So with the ps4 using gddr5 as system memory and vram, it got me thinking

So with the ps4 using gddr5 as system memory and vram, it got me thinking.

Would it be possible for an APU to have HBM as system ram and vram too?

What would the limitations be? What would the advantages be?

It could be partitioned
>4gb vram
>12GB system ram

16GB total memory.

Maybe even for gaming laptops.

Other urls found in this thread:

anandtech.com/show/6993/intel-iris-pro-5200-graphics-review-core-i74950hq-tested/3
twitter.com/SFWRedditImages

IIRC AMD is going to do HBM on some APUs eventually as a replacement for L3$/L4$

Imagine 4GB of L3$...

Holy shit

>Imagine 4GB of L3$

Yes
limitations:
you're capped at whatever's there
adding any onboard stuff would slow it down
exceptionally expensive
CPUs don't like high latencies.

Imagine 4GB of being a poorfag cuck who can't afford a decent/superior Nvidia product

The 1080 is the best GPU ever made, and it would be a mistake not to buy one

>high latency/high bandwidth ram as cache
do not want.jpg

It'll be good for laptops though. You won't need more than 16GB and the ram slots could be used to squeeze in a bigger cooler for the APU.

Not sure about latency. How did Sony do it?

>high latency
current APUs don't even have an L3$, even if it's got latency, it'd still be faster to keep decoded uops in a dram-like memory than discarding them and having to decode all over again.

>current APUs don't have L3 cache
LOL
damn I like AMD but this is too much.
SDRAM cache still won't work like normal caches do though.

Y tho

Yes, they'll probably move that direction for their mobile APUs. They'd lose system power in having both HBM on package and DDR4 soldered on to the board.

HBM is far, far too slow to act as a CPU cache. That isn't happening.

Literally everything you posted there is wrong. How long it takes to reissue an instruction can be measured in clock cycles, the same is true of how long it takes to access DRAM through the memory controller.
HBM as a last level cache would be so slow that it would have a negative impact on performance. DRAM and SRAM are in entirely different leagues.

Except people do it all the time, what do you think 1T-SRAM is?
Hint: It's not SRAM.

We're not talknig about replacing l1/l2 with dram, but rather adding an extra level of dram cache (which btw has far lower latency than DDR4 system ram)

>>current APUs don't have L3 cache
>LOL
They actually don't though...

Wow you got me, I'm buying 4 1080s right now.

Look at the performance difference between Intel i7's and similarly clocke/core/thread xeons.
More cache == Better performance.

The HBM on Fiji didn't do a whole lot for gaming performance because that's limited by the core count/speed above a certain memory throughput. The compute performance of Fiji is insaaaaaaane.

>1T-SRAM
basically dram ^:)

>They actually don't though...
no i believe you (i hope you're right) and that is pretty shocking. no wonder APUs are kinda shitty.

Just looked it up, oh boy.

Stupid question though, if cache is so important, why not add more? Why not l4 and l5 cache?

Is compute performance usually more bandwidth-limited than gaming?

I've done a little GPGPU performance and I know that memory access is the major bottleneck, but I've never done any game programming.

Adding more caches means copying data between them.
The less you have to cpy between caches, the lower the latency for memory requests, and the less time the CPU spends waiting for the data to arrive. Ideally, you would have a single pool of memory that the CPU accesses directly (which is how simple microcontrollers work).

I'm not sure on the specifics, but gaming is more dependant on ROP and shader speed, and is less concerned with throwing data around than doing really fast calculations over small pieces of data.

Then why not a larger l1 cache instead of several smaller l1, l2, and l3 caches?

L1 cache is hella expensive and hard to manufacture without defects. It has to be right next to the core on the die so that you can keep clock speeds high, as you're limited by the speed of light as to how far you can send data cross the chip at blazing speeds.

The larger a cache is the slower it is. This is why the fastest SRAM arrangements are always the smallest, and closest to logic. L1 is about a magnitude faster than an L2. This isn't by accident.
The L1 is hit more often, so it needs to be faster. L2 is still hit often, but larger ops land here, so you need more of it, even if its private. L3 is larger still, and a magnitude slower than L2, but it tends to be shared between cores.
How associative the cache is also has a huge impact on performance.

The more stages in a cache heirarchy, the less often they're hit. Its diminishing returns.
You spend more and more transistors on something that offers you no performance in the end, and it still draws power.

You cannot compare how a GPU uses VRAM memory bandwidth to how a CPU uses its caches.
Fiji's biggest bottleneck is pixel throughput from the same limited ROP configuration featured in Hawaii.


As always this board is filled with tech illiterate retards who still feel the need to talk out of their asses.

>You cannot compare how a GPU uses VRAM memory bandwidth to how a CPU uses its caches
You can compare it almost directly to how something like L3 cache works, which is very similar to how AMD APU CPU cores utilize DRAM.

No, you can't. VRAM swaps geometry and texture data primarily, not ops.
GPUs have their own L1 and L2 caches. DRAM and SRAM are not used for the same things. Stop trying to participate in a conversation you have no business being in.

It depends. It's just as conceivable that a graphics workload can take as much bandwidth, simply in gaming most data are forwarded from command processors into caches and fed into the queues, so most of the bottlenecking occurs in texture mapping or rasterization which are usually much longer, requires signaling to be offloaded, and can actually be more or less serial rather than parallelizable, unlike most of the GPGPU operations intended to be offloaded to a coprocessor. If you write the code right, cards can usually saturate the compute queue and burn the entire compute stack faster than you can feed it work. The Fiji cards are in fact ideal for parallel compute only scenarios because they don't have that many ROPs but have a huge number of shaders that wouldn't be bottlenecked by the command processor if it was fed only a long compute queue.

Data is Data.
Can you explain to me why copying ops is different to something like texture data??

INSTRUCTIONS ARE EXTREMELY TIMING SENSITIVE YOU TECH ILLITERATE RETARD

Memory controllers can spend over 500 clock cycles processing some blocks based on their size. The data moved through VRAM is not timing sensitive. If a CPU core were to wait 100 clock cycles on an instruction every time it was fetched it would cripple performance. You'd have Bulldozer performance.
Stop talking out of your ass.

What card is that OP?

>data moved through VRAM is not timing sensitive
That'll be people don't OC VRAM then.

File name

>HBM die
>2x8-pin connectors
>short board
I wonder what?

Are you kidding?

The amount of bandwidth a GPU needs is directly proportional to the throughput of its ALUs. You overclock VRAM along with the core clock so performance scales as linearly as possible without incurring any bottleneck.

This has absolutely nothing to do with how timing sensitive the workload is. You're still trying to equate DRAM and SRAM. They are not comparable. The data read and written to them is entirely different.

Its about time you stop posting.

Bump

>Nvidia makes x86 CPU's

They tried once.

Really?

so, you're saying that slow DRAM makes the GPU wait for data, and that slow SRAM makes the CPU wait for data, and that there's no similarities?
Say I have a cache miss in L2, and I have to go fetch that from DRAM. You're telling me that there's no difference between the DRAM on a motherboard behind a bunch of controllers is no different to putting that DRAM right next to the CPU?

The interface that logic interacts with DRAM and SRAM through is entirely different.
If you were regularly pulling instructions out of system RAM then you would have a CPU core stalling, doing nothing, for hundreds of clock cycles. A really bad pipeline stall is maybe 20 clocks.

Stop posting. You are too fucking stupid for words. Go to your local shit tier community college and take into to CS instead of continuing this tirade of ignorance.

What happened? Intel fuckery?

I understand that, what I was saying, is that larger low-level caches improve performance because it lowers the number of cache misses that occur. When a cache miss occurs in the highest level on the CPU, time is wasted fetching that DRAM, which is where HBM gives an enormous advantage over traditional SODIMM's in the same way that it gives an advantage over GDDR.

HBM IS DRAM
The stack is made of DRAM dies on top of a base layer with a little control logic.

It is only marginally faster than GDDR5 in a few metrics because of how it handles signaling, it is still DRAM. It cannot be compared to SRAM, and you would not use it as a CPU cache. The fastest DRAM is still horrendously slow compared to SRAM.
If swapping an instruction out of your last level cache takes longer than reissuing it then you are directly causing a performance regression.

>and you would not use it as a CPU cache
Intel uses DRAM as L4 cache on Iris Pro CPU's

Do you just hate AMD?

>It is only marginally faster than GDDR5
4 stacks are as fast as 1 stack
This means it's 4 times faster than any other 2d counterpart.

No, intel uses bidirectional eDRAM. It is considerably faster than HBM, it just doesn't offer the same bandwidth. It also uses a ton of power. Unsurprisingly there are trade offs to everything.

HBM's trade off is that it wouldn't ever be suitable for a last level cache.
anandtech.com/show/6993/intel-iris-pro-5200-graphics-review-core-i74950hq-tested/3

Bandwidth is not access time and latency, dipshit.

>it is faster
>but it has less fastness
Nice.

>ITT: something I thought about when Fury X launched

>No, intel uses bidirectional eDRAM
That's just signaling, it's still slower than HBM
It's just on die DRAM, stacked on die DRAM is still faster
>Bandwidth is not access time and latency, dipshit.
Latency and access time is about same as HBM

Stop mincing words like a fucking moron.
HBM offers more bandwidth, it is not lower latency. eDRAM is not *just* DRAM, it has closer wire length and a lower latency interface with the logic its feeding than anything else. The article I linked explicitly states the access time for the L4.
HBM is a full 10ns slower.

What if HBM cache is used differently? Used like an in between for the cpu and the ram to make some things faster?
Not that guy, I'm just a tech illiterate retard who made his way onto Sup Forums, so please no hate.

Isn't small size one reason why caches are fast, the other one being that they are made of SRAM instead of DRAM?

not the same guy but KYS

kys

you guys are misspelling kiss, what's the deal??

kYs

Bump