So with the ps4 using gddr5 as system memory and vram, it got me thinking

Question

So with the ps4 using gddr5 as system memory and vram, it got me thinking

Logan Rivera

So with the ps4 using gddr5 as system memory and vram, it got me thinking.

Would it be possible for an APU to have HBM as system ram and vram too?

What would the limitations be? What would the advantages be?

It could be partitioned
>4gb vram
>12GB system ram

16GB total memory.

Maybe even for gaming laptops.

June 19, 2016 - 12:00

Other urls found in this thread:

anandtech.com/show/6993/intel-iris-pro-5200-graphics-review-core-i74950hq-tested/3
twitter.com/SFWRedditImages

Kayden Stewart

IIRC AMD is going to do HBM on some APUs eventually as a replacement for L3$/L4$

Imagine 4GB of L3$...

June 19, 2016 - 12:02

Nolan Brown

Holy shit

June 19, 2016 - 12:03

Jayden Wright

>Imagine 4GB of L3$

June 19, 2016 - 12:04

James Evans

Yes
limitations:
you're capped at whatever's there
adding any onboard stuff would slow it down
exceptionally expensive
CPUs don't like high latencies.

June 19, 2016 - 12:04

Jose Bennett

Imagine 4GB of being a poorfag cuck who can't afford a decent/superior Nvidia product

The 1080 is the best GPU ever made, and it would be a mistake not to buy one

June 19, 2016 - 12:05

Juan Johnson

>high latency/high bandwidth ram as cache
do not want.jpg

June 19, 2016 - 12:06

Aaron Walker

It'll be good for laptops though. You won't need more than 16GB and the ram slots could be used to squeeze in a bigger cooler for the APU.

Not sure about latency. How did Sony do it?

June 19, 2016 - 12:08

Ayden Reyes

>high latency
current APUs don't even have an L3$, even if it's got latency, it'd still be faster to keep decoded uops in a dram-like memory than discarding them and having to decode all over again.

June 19, 2016 - 12:10

Henry Baker

>current APUs don't have L3 cache
LOL
damn I like AMD but this is too much.
SDRAM cache still won't work like normal caches do though.

June 19, 2016 - 12:13

Blake Wright

Y tho

June 19, 2016 - 12:16

Alexander Carter

Yes, they'll probably move that direction for their mobile APUs. They'd lose system power in having both HBM on package and DDR4 soldered on to the board.

HBM is far, far too slow to act as a CPU cache. That isn't happening.

Literally everything you posted there is wrong. How long it takes to reissue an instruction can be measured in clock cycles, the same is true of how long it takes to access DRAM through the memory controller.
HBM as a last level cache would be so slow that it would have a negative impact on performance. DRAM and SRAM are in entirely different leagues.

June 19, 2016 - 12:17

Angel Campbell

Except people do it all the time, what do you think 1T-SRAM is?
Hint: It's not SRAM.

We're not talknig about replacing l1/l2 with dram, but rather adding an extra level of dram cache (which btw has far lower latency than DDR4 system ram)

>>current APUs don't have L3 cache
>LOL
They actually don't though...

June 19, 2016 - 12:18

William Hughes

Wow you got me, I'm buying 4 1080s right now.

June 19, 2016 - 12:18

Jaxon Johnson

Look at the performance difference between Intel i7's and similarly clocke/core/thread xeons.
More cache == Better performance.

The HBM on Fiji didn't do a whole lot for gaming performance because that's limited by the core count/speed above a certain memory throughput. The compute performance of Fiji is insaaaaaaane.

June 19, 2016 - 12:18

Adrian Evans

>1T-SRAM
basically dram ^:)

>They actually don't though...
no i believe you (i hope you're right) and that is pretty shocking. no wonder APUs are kinda shitty.

June 19, 2016 - 12:20

Jackson Phillips

Just looked it up, oh boy.

Stupid question though, if cache is so important, why not add more? Why not l4 and l5 cache?

June 19, 2016 - 12:22

Cooper Watson

Is compute performance usually more bandwidth-limited than gaming?

I've done a little GPGPU performance and I know that memory access is the major bottleneck, but I've never done any game programming.

June 19, 2016 - 12:23

Anthony Walker

Adding more caches means copying data between them.
The less you have to cpy between caches, the lower the latency for memory requests, and the less time the CPU spends waiting for the data to arrive. Ideally, you would have a single pool of memory that the CPU accesses directly (which is how simple microcontrollers work).

I'm not sure on the specifics, but gaming is more dependant on ROP and shader speed, and is less concerned with throwing data around than doing really fast calculations over small pieces of data.

June 19, 2016 - 12:27

Christian Parker

Then why not a larger l1 cache instead of several smaller l1, l2, and l3 caches?

June 19, 2016 - 12:28

Kayden Williams

L1 cache is hella expensive and hard to manufacture without defects. It has to be right next to the core on the die so that you can keep clock speeds high, as you're limited by the speed of light as to how far you can send data cross the chip at blazing speeds.

June 19, 2016 - 12:31

Jayden Ortiz

The larger a cache is the slower it is. This is why the fastest SRAM arrangements are always the smallest, and closest to logic. L1 is about a magnitude faster than an L2. This isn't by accident.
The L1 is hit more often, so it needs to be faster. L2 is still hit often, but larger ops land here, so you need more of it, even if its private. L3 is larger still, and a magnitude slower than L2, but it tends to be shared between cores.
How associative the cache is also has a huge impact on performance.

The more stages in a cache heirarchy, the less often they're hit. Its diminishing returns.
You spend more and more transistors on something that offers you no performance in the end, and it still draws power.

You cannot compare how a GPU uses VRAM memory bandwidth to how a CPU uses its caches.
Fiji's biggest bottleneck is pixel throughput from the same limited ROP configuration featured in Hawaii.

As always this board is filled with tech illiterate retards who still feel the need to talk out of their asses.

June 19, 2016 - 12:39

Kevin Lee

>You cannot compare how a GPU uses VRAM memory bandwidth to how a CPU uses its caches
You can compare it almost directly to how something like L3 cache works, which is very similar to how AMD APU CPU cores utilize DRAM.

June 19, 2016 - 12:43

Jace Phillips

No, you can't. VRAM swaps geometry and texture data primarily, not ops.
GPUs have their own L1 and L2 caches. DRAM and SRAM are not used for the same things. Stop trying to participate in a conversation you have no business being in.

June 19, 2016 - 12:45

Blake Martinez

It depends. It's just as conceivable that a graphics workload can take as much bandwidth, simply in gaming most data are forwarded from command processors into caches and fed into the queues, so most of the bottlenecking occurs in texture mapping or rasterization which are usually much longer, requires signaling to be offloaded, and can actually be more or less serial rather than parallelizable, unlike most of the GPGPU operations intended to be offloaded to a coprocessor. If you write the code right, cards can usually saturate the compute queue and burn the entire compute stack faster than you can feed it work. The Fiji cards are in fact ideal for parallel compute only scenarios because they don't have that many ROPs but have a huge number of shaders that wouldn't be bottlenecked by the command processor if it was fed only a long compute queue.

June 19, 2016 - 12:46

Lucas Clark

Data is Data.
Can you explain to me why copying ops is different to something like texture data??

June 19, 2016 - 12:47

Luis Sanders

INSTRUCTIONS ARE EXTREMELY TIMING SENSITIVE YOU TECH ILLITERATE RETARD

Memory controllers can spend over 500 clock cycles processing some blocks based on their size. The data moved through VRAM is not timing sensitive. If a CPU core were to wait 100 clock cycles on an instruction every time it was fetched it would cripple performance. You'd have Bulldozer performance.
Stop talking out of your ass.

June 19, 2016 - 12:52

Ayden Cook

What card is that OP?

June 19, 2016 - 12:53

Austin Brown

>data moved through VRAM is not timing sensitive
That'll be people don't OC VRAM then.

June 19, 2016 - 12:54

Jaxon Barnes

File name

June 19, 2016 - 12:55

Charles Rivera

>HBM die
>2x8-pin connectors
>short board
I wonder what?

June 19, 2016 - 12:58

Xavier Ramirez

Are you kidding?

The amount of bandwidth a GPU needs is directly proportional to the throughput of its ALUs. You overclock VRAM along with the core clock so performance scales as linearly as possible without incurring any bottleneck.

This has absolutely nothing to do with how timing sensitive the workload is. You're still trying to equate DRAM and SRAM. They are not comparable. The data read and written to them is entirely different.

Its about time you stop posting.

June 19, 2016 - 12:58

Jonathan Lopez

Bump

June 19, 2016 - 13:00

Camden Rodriguez

>Nvidia makes x86 CPU's

June 19, 2016 - 13:02

Kevin Wilson

They tried once.

June 19, 2016 - 13:03

Carson Howard

Really?

June 19, 2016 - 13:04

Christian Price

so, you're saying that slow DRAM makes the GPU wait for data, and that slow SRAM makes the CPU wait for data, and that there's no similarities?
Say I have a cache miss in L2, and I have to go fetch that from DRAM. You're telling me that there's no difference between the DRAM on a motherboard behind a bunch of controllers is no different to putting that DRAM right next to the CPU?

June 19, 2016 - 13:05

Asher Reed

The interface that logic interacts with DRAM and SRAM through is entirely different.
If you were regularly pulling instructions out of system RAM then you would have a CPU core stalling, doing nothing, for hundreds of clock cycles. A really bad pipeline stall is maybe 20 clocks.

Stop posting. You are too fucking stupid for words. Go to your local shit tier community college and take into to CS instead of continuing this tirade of ignorance.

June 19, 2016 - 13:10

Lucas Edwards

What happened? Intel fuckery?

June 19, 2016 - 13:15

Bentley Phillips

I understand that, what I was saying, is that larger low-level caches improve performance because it lowers the number of cache misses that occur. When a cache miss occurs in the highest level on the CPU, time is wasted fetching that DRAM, which is where HBM gives an enormous advantage over traditional SODIMM's in the same way that it gives an advantage over GDDR.

June 19, 2016 - 13:17

Nathaniel Clark

HBM IS DRAM
The stack is made of DRAM dies on top of a base layer with a little control logic.

It is only marginally faster than GDDR5 in a few metrics because of how it handles signaling, it is still DRAM. It cannot be compared to SRAM, and you would not use it as a CPU cache. The fastest DRAM is still horrendously slow compared to SRAM.
If swapping an instruction out of your last level cache takes longer than reissuing it then you are directly causing a performance regression.

June 19, 2016 - 13:41

Xavier Allen

>and you would not use it as a CPU cache
Intel uses DRAM as L4 cache on Iris Pro CPU's

June 19, 2016 - 13:43

Hunter Martin

Do you just hate AMD?

June 19, 2016 - 13:46

Jordan Martin

>It is only marginally faster than GDDR5
4 stacks are as fast as 1 stack
This means it's 4 times faster than any other 2d counterpart.

June 19, 2016 - 13:49

Angel Lewis

No, intel uses bidirectional eDRAM. It is considerably faster than HBM, it just doesn't offer the same bandwidth. It also uses a ton of power. Unsurprisingly there are trade offs to everything.

HBM's trade off is that it wouldn't ever be suitable for a last level cache.
anandtech.com/show/6993/intel-iris-pro-5200-graphics-review-core-i74950hq-tested/3

Bandwidth is not access time and latency, dipshit.

June 19, 2016 - 14:04

Levi Martin

>it is faster
>but it has less fastness
Nice.

June 19, 2016 - 14:20

Nolan Powell

>ITT: something I thought about when Fury X launched

June 19, 2016 - 17:00

Chase Howard

>No, intel uses bidirectional eDRAM
That's just signaling, it's still slower than HBM
It's just on die DRAM, stacked on die DRAM is still faster
>Bandwidth is not access time and latency, dipshit.
Latency and access time is about same as HBM

June 19, 2016 - 18:28

Noah Allen

Stop mincing words like a fucking moron.
HBM offers more bandwidth, it is not lower latency. eDRAM is not *just* DRAM, it has closer wire length and a lower latency interface with the logic its feeding than anything else. The article I linked explicitly states the access time for the L4.
HBM is a full 10ns slower.

June 19, 2016 - 18:35

Nathan Carter

What if HBM cache is used differently? Used like an in between for the cpu and the ram to make some things faster?
Not that guy, I'm just a tech illiterate retard who made his way onto Sup Forums, so please no hate.

June 19, 2016 - 18:40

Leo Gomez

Isn't small size one reason why caches are fast, the other one being that they are made of SRAM instead of DRAM?

June 19, 2016 - 19:11

Juan Lopez

not the same guy but KYS

June 19, 2016 - 19:21

Jeremiah Myers

kys

June 19, 2016 - 20:08

Brandon Bennett

you guys are misspelling kiss, what's the deal??

June 19, 2016 - 20:10

Carter Anderson

kYs

June 19, 2016 - 21:30

Aaron Miller

Bump

June 19, 2016 - 23:31

1 2 ... 6 Next

So with the ps4 using gddr5 as system memory and vram, it got me thinking

Last threads