Assuming Double Sided Memory Module with a total of 32 Memory Packages (Not Counting the extra ECC Packages for Parity on each row in each halve of the Memory module)
1 Memory Package = 16 GiB per Package / 8 Layers per Package = 2 GiB per RAM Die Layer
Bandwidth is a primary issue such that they’ll be forced to remain wide instead of tall. At some point all that extra RAM is for concurrent tasks on a many-core CPU, not a single user operating on an enormous data set.
The primary factor that limits both capacity and bandwidth is the memory controller - usually on the CPU these days - both the width (number of channels) and the technologies employed (DDR generation, buffering). Most workstation/desktop boards already have 2 DIMMs per channel, some servers have up to 6 DIMMs a channel - most uesrs don't hit the maximum DIMM capacity per channel. This means commodity DIMM sticks don't really need to get taller, only the motherboard & CPU need to be designed to accept more sticks.
Another issue actually is Z height. 1U servers are 44mm tall and a standard RAM module is 30mm tall. This would mean that a module like this is only useful for 2U or taller servers. That means you won't be able to use it is blades, 2U4Ns, etc... which is popular for compute nodes.
The "Milan" I/O die integrates eight DDR4 memory controllers ( UMC s), two per I/O die quadrant, which achieve data rates from 1333 to 3200 MT/s. Up to 2 DIMMs per channel are supported.May 31, 2021
Now if Intel & AMD can go OMI for their memory Interface, they can join IBM in revolutionizing the Memory Connection by going for a Serial connection. Get JEDEC to help standardize that sucker, because having RAM still be a direct parallel connection seems like a outdated concept in 2021.
By going OMI, you'd get more Memory Bandwidth to the CPU thanks to the OMI and Serial Nature of it's connection with the Memory Controller being simplified and somewhat local to the Memory Module itself with a simple OMI receiver link on the CPU end.
CL50 huh. So we double the clock speed from 3600 to 7200, then we sit 'round doing nothing for 50 cycles to deal with it. How is this better than the clock at 3600 but only waiting around 16 cycles between operations?
Larger caches combined with more aggressive prefer hers have been working together for years now to reduce the impact of dram access latency. I suspect that these CL numbers won't be that impactful in practice.
Because the operations take place over many cycles. A typical 128 bit interface is only transferring 16 bytes of data at a time. Waiting 50 cycles isn’t that big of a deal when it takes 65 *thousand* cycles to transfer just one MiB of data.
We’ve had this latency discussion with every new version of DDR. CAS latency has always been in the 5-8 ns range, and people repeatedly trip up, comparing low latency, overclocked RAM from the previous generation against more conservative server RAM & JEDEC specifications for the new generation.
Ceteris paribus, the CAS should be around 32 at this speed to have parity with current mainstream DDR4 latency. What happens every DDR version is that it takes 1-2 years for SKUs to match the latency performance of the previous DDR version. Is it ever a big deal? Not enough for anyone to choose to make a new system on an old platform. Thankfully we're seeing a focus on large LL cache just before DDR5.
Except that: 1. With each generation this is further away from being RANDOM Acces Memory. 2. Access time latency IS a bottleneck. Most of the other accesses are being taken care by the cache and prefetchers. It is whatever misses them that will cause massive stalls. 3. Your example is mostly irrellevant. Programs access data in far smaller chinks than 1MB. And when they do gulp it in Megabytes, they use non-cached access for one-time use. So, such linear, "drag-strip" use is relatively rare.
If you only had one core, that might be important.
Today's machines bottleneck on the bus, and cores are quickly bandwidth starved to DRAM. High level languages make it even worse (slab allocators mean that nothing is ever in cache).
So if we want more than 4-8 cores (8-16 threads) to be running at speed, we need a lot more bandwidth.
We have to wait for 3-6 years minimum to get a solid spec DDR5, look how long it took for DDR4 to reach to it's maximum potential. Now in 2021 we have DDR4 4000MHz C15 Dual Rank B-Die kits, that too at high 1.5v.
Having enough bandwidth for your application, and a large enough DRAM cache so that you don't need to tap into NAND, are much more important than having the lowest latency.
Meaningless marketing cr*p. Who cares about inter-die distance etc. Waht I want to know is what doe they intend to do about FRIGGIN CL=56. 56 cycles to start reading a row. Yes, it's at 7200MHz, but still. Speeding up transfers doesn't mean that much if real latency is the same or worse...
Same for the rest of the cr*p. ODECC isn't there because they just needed to do some cool stuff. It's there because with higher density cell reliability sucks, so they had to do something about it. And they did the minimum that they can get away with. So they shouldn't be selling it as something "extra".
missing interaction between kernel and memory for sorting (by (AI) algorithm) for low and high demanded memory pages into lowest and low latency circuits (combining DDR4 and DDR5? worth effort?) and maybe freeing oldest memory pages if else OOM would be happen to be triggered?
‘Samsung states that the introduction of Same-Bank refresh (SBR) into its DDR5 will increase the efficiency of the DRAM bus connectivity by almost 10%, with DDR4-4800 showing the best efficiency in terms of energy from bit.’
That's pretty much exactly what we had back in the bad old days of PC133 and DDR1 TSOP DIMMs. Memory sockets mounted at 45 degree angles to deal with the Z-height.
We’ve updated our terms. By continuing to use the site and/or by logging into your account, you agree to the Site’s updated Terms of Use and Privacy Policy.
27 Comments
Back to Article
Kamen Rider Blade - Sunday, August 22, 2021 - link
For this 512 GiB Memory Module:Assuming Double Sided Memory Module with a total of 32 Memory Packages (Not Counting the extra ECC Packages for Parity on each row in each halve of the Memory module)
1 Memory Package = 16 GiB per Package / 8 Layers per Package = 2 GiB per RAM Die Layer
Does my math sound about right?
Kamen Rider Blade - Sunday, August 22, 2021 - link
Do you think Enterprise Customers would ever want to go with Double Height Memory Modules for more RAM?https://www.anandtech.com/show/13694/double-height...
Given the height of the PCB, I can see future DDR5 Memory Modules with more than 2x rows.
Potentially 3/4/5 rows of Memory Packages on each side of the Memory Modules, doubling them for obviously alot more memory.
1 Double-Sided Rows = _256 GiB
2 Double-Sided Rows = _512 GiB
3 Double-Sided Rows = _756 GiB
4 Double-Sided Rows = 1024 GiB
5 Double-Sided Rows = 1280 GiB
Imagine one "Double-Height" Memory Module in a Enterprise Server Rack that contains 1280 GiB or 1.25 TiB of RAM in one Memory Module
Wrs - Sunday, August 22, 2021 - link
Bandwidth is a primary issue such that they’ll be forced to remain wide instead of tall. At some point all that extra RAM is for concurrent tasks on a many-core CPU, not a single user operating on an enormous data set.Kamen Rider Blade - Sunday, August 22, 2021 - link
True, but wouldn't a Double-Height RAM module using DDR5 help facilitate that?More RAM on one PCB. Given that there is a finite amount of DIMM slots on any MoBo, wouldn't more RAM be better?
Wrs - Sunday, August 22, 2021 - link
The primary factor that limits both capacity and bandwidth is the memory controller - usually on the CPU these days - both the width (number of channels) and the technologies employed (DDR generation, buffering). Most workstation/desktop boards already have 2 DIMMs per channel, some servers have up to 6 DIMMs a channel - most uesrs don't hit the maximum DIMM capacity per channel. This means commodity DIMM sticks don't really need to get taller, only the motherboard & CPU need to be designed to accept more sticks.schujj07 - Monday, August 23, 2021 - link
Ice Lake Xeon and Epyc have are 8 DIMMs/Channel.Another issue actually is Z height. 1U servers are 44mm tall and a standard RAM module is 30mm tall. This would mean that a module like this is only useful for 2U or taller servers. That means you won't be able to use it is blades, 2U4Ns, etc... which is popular for compute nodes.
Freeb!rd - Thursday, April 7, 2022 - link
The "Milan" I/O die integrates eight DDR4 memory controllers ( UMC s), two per I/O die quadrant, which achieve data rates from 1333 to 3200 MT/s. Up to 2 DIMMs per channel are supported.May 31, 2021Kamen Rider Blade - Monday, August 23, 2021 - link
Now if Intel & AMD can go OMI for their memory Interface, they can join IBM in revolutionizing the Memory Connection by going for a Serial connection. Get JEDEC to help standardize that sucker, because having RAM still be a direct parallel connection seems like a outdated concept in 2021.By going OMI, you'd get more Memory Bandwidth to the CPU thanks to the OMI and Serial Nature of it's connection with the Memory Controller being simplified and somewhat local to the Memory Module itself with a simple OMI receiver link on the CPU end.
coburn_c - Sunday, August 22, 2021 - link
CL50 huh. So we double the clock speed from 3600 to 7200, then we sit 'round doing nothing for 50 cycles to deal with it. How is this better than the clock at 3600 but only waiting around 16 cycles between operations?lightningz71 - Sunday, August 22, 2021 - link
Larger caches combined with more aggressive prefer hers have been working together for years now to reduce the impact of dram access latency. I suspect that these CL numbers won't be that impactful in practice.Small Bison - Sunday, August 22, 2021 - link
Because the operations take place over many cycles. A typical 128 bit interface is only transferring 16 bytes of data at a time. Waiting 50 cycles isn’t that big of a deal when it takes 65 *thousand* cycles to transfer just one MiB of data.We’ve had this latency discussion with every new version of DDR. CAS latency has always been in the 5-8 ns range, and people repeatedly trip up, comparing low latency, overclocked RAM from the previous generation against more conservative server RAM & JEDEC specifications for the new generation.
willis936 - Sunday, August 22, 2021 - link
Ceteris paribus, the CAS should be around 32 at this speed to have parity with current mainstream DDR4 latency. What happens every DDR version is that it takes 1-2 years for SKUs to match the latency performance of the previous DDR version. Is it ever a big deal? Not enough for anyone to choose to make a new system on an old platform. Thankfully we're seeing a focus on large LL cache just before DDR5.Brane2 - Monday, August 23, 2021 - link
Except that:1. With each generation this is further away from being RANDOM Acces Memory.
2. Access time latency IS a bottleneck. Most of the other accesses are being taken care by the cache and prefetchers. It is whatever misses them that will cause massive stalls.
3. Your example is mostly irrellevant. Programs access data in far smaller chinks than 1MB. And when they do gulp it in Megabytes, they use non-cached access for one-time use. So, such linear, "drag-strip" use is relatively rare.
cp0x - Sunday, August 29, 2021 - link
If you only had one core, that might be important.Today's machines bottleneck on the bus, and cores are quickly bandwidth starved to DRAM. High level languages make it even worse (slab allocators mean that nothing is ever in cache).
So if we want more than 4-8 cores (8-16 threads) to be running at speed, we need a lot more bandwidth.
Kamen Rider Blade - Sunday, August 22, 2021 - link
The amount of time in each cycle isn't consistent as you go up in Clock Speed.For DDR4, you should use a table to help figure out the true Effective Latency:
https://www.reddit.com/r/pcmasterrace/comments/cd2...
You'll need a new Table for DDR5 RAM effective Latency as well
Silver5urfer - Sunday, August 22, 2021 - link
We have to wait for 3-6 years minimum to get a solid spec DDR5, look how long it took for DDR4 to reach to it's maximum potential. Now in 2021 we have DDR4 4000MHz C15 Dual Rank B-Die kits, that too at high 1.5v.Kamen Rider Blade - Sunday, August 22, 2021 - link
That's why I'm not in a rush to DDR5hansmuff - Sunday, August 22, 2021 - link
True enough. I'm not looking at DDR5 before 2024.Wereweeb - Monday, August 23, 2021 - link
Having enough bandwidth for your application, and a large enough DRAM cache so that you don't need to tap into NAND, are much more important than having the lowest latency.Brane2 - Monday, August 23, 2021 - link
Meaningless marketing cr*p.Who cares about inter-die distance etc.
Waht I want to know is what doe they intend to do about FRIGGIN CL=56.
56 cycles to start reading a row.
Yes, it's at 7200MHz, but still.
Speeding up transfers doesn't mean that much if real latency is the same or worse...
Same for the rest of the cr*p. ODECC isn't there because they just needed to do some cool stuff.
It's there because with higher density cell reliability sucks, so they had to do something about it.
And they did the minimum that they can get away with.
So they shouldn't be selling it as something "extra".
TheinsanegamerN - Tuesday, August 24, 2021 - link
Everyone said the same thing about DDR4, and ddr3, and ddr2, and even the original DDR. Just chill bud. It'll get fixed withina year or so.Spunjji - Tuesday, August 24, 2021 - link
"And they did the minimum that they can get away with"10^-6 isn't "the minimum", it''s a massive improvement.
calc76 - Monday, August 23, 2021 - link
DDR5 UDIMMs are supposed to eventually scale up to 128 GB, not just 64 GB, as noted on AnandTech's article from Jul 14, 2020.back2future - Monday, August 23, 2021 - link
missing interaction between kernel and memory for sorting (by (AI) algorithm) for low and high demanded memory pages into lowest and low latency circuits (combining DDR4 and DDR5? worth effort?) and maybe freeing oldest memory pages if else OOM would be happen to be triggered?Oxford Guy - Monday, August 23, 2021 - link
Typo?‘Samsung states that the introduction of Same-Bank refresh (SBR) into its DDR5 will increase the efficiency of the DRAM bus connectivity by almost 10%, with DDR4-4800 showing the best efficiency in terms of energy from bit.’
I assume you meant DDR5-4800.
Athlex - Monday, September 13, 2021 - link
That's pretty much exactly what we had back in the bad old days of PC133 and DDR1 TSOP DIMMs. Memory sockets mounted at 45 degree angles to deal with the Z-height.pogsnet - Wednesday, September 22, 2021 - link
Meanwhile, VPS hosting services still offer 1GB plan or lower which is no longer good on today's use.