i knew it, people were giving apple to much credit for matching L3 Cache sizes with Core i7 CPU's and keeping the die so small.
Everyone guess wrong, and this teaches us even Geekbench reports are to be taken with a grain of salt.
I am surprised they gimped the A9X with no L3 at all. Something more has to be going on here or did apple seriously just not have enough die space with all that gpu.
Regarding the Metal Pitch being 20nm even on a 16nm process, this has me worried about future Nvidia and AMD GPU's being built on TSMC. Both GPU vendors tried 20nm internally on prototypes, and saw negligible benifits, hence why they skipped it entirely.
The problem with the 20nm node was that it was not finfet -- and thus the leakage was super high making that process unsuitable for large chips running at high clock speeds. It was purely a problem with the FEOL (transistors) not the BEOL (metal layers). With the 16FF node -- this will not be an issue.
(a) What Apple is doing here is likely more sophisticated than what Intel is doing. The tricky point is not having an exclusive or victim cache, it is maintaining coherence between the GPUs and CPUs. An inclusive cache is a simple (but inefficient) solution to this problem. A more efficient solution is to use a directory (which you can imagine as something like the L3 holds a whole of additional tags without lines attached, tags describing the contents of the CPU and GPU L2 caches).
It is somewhat unclear just how much coherency both the Apple/ARM/Imagination world and the Intel world offer between GPUs and CPUs today. The impression I get is that such coherency as exists in TODAY's products is limited; but all parties want this to move to full coherence ASAP.
Given how fast Apple is moving, I suspect there is a lot happening on the successive A- chips that's not 100% ready for shipping, and which is not exposed to users/developers, but which was implemented for testing and experimentation, to inform the future; and I suspect that the A9 L3 in part fits into this category. (Which in part explains why it wasn't on the A9X --- it can perform a secondary job of testing Apple's directory implementation and clarifying any holes in the implementation/protocol by being on just one SoC; and clearly the combination of GPU L2 and wider memory bus is good enough for the iPad Pro to perform well.)
(b) WTF does Geekbench have to do with any of this? GB does not report L3 sizes. GB does not claim to be testing L3 performance. You have the causality exactly backwards. The A9 posted spectacular GB (and every other benchmark) results, and we were immediately told by certain noisy individuals on the internet that this was PURELY because Apple had placed a massive 8MiB L3 on the CPU in order to game GB.
There was a number of comments after Linus Torvalds of Linux fame said referring to Geekbench 3 (GB3) "I suspect most of them have a code footprint that basically fits in a L1I cache." This has led other people to sat that the Apple processors only give good GB3 scores because of large caches, L1 through L3. Note Linus only said L1 cache and only commented on GB3 other expanded this to all caches and Apple.
It isn't surprising Apple has the engineering potential to come up with these radical designs, but at the same time, it is surprising, because a lot of these designs are off the wall unnecessary. Is 4MB of victim cache worth the die space? A9X says...no. So it seemed like an experiment, if anything.
It's speculated in both articles that it makes more sense in the A9, which is bound by a much much smaller smartphone battery, since it's thrashing the main RAM less with a cache. In the iPad Pro where the SoC becomes a small fraction of the power draw of the display on the larger battery, they can just go ahead and make all those RAM accesses.
We’ve updated our terms. By continuing to use the site and/or by logging into your account, you agree to the Site’s updated Terms of Use and Privacy Policy.
14 Comments
Back to Article
tipoo - Monday, November 30, 2015 - link
All very interesting, particularly as I thought that not-8MB would have been helping with the large GPU gains. /None/ on A9X? Very, very interesting.xype - Monday, November 30, 2015 - link
Could that be due to the bigger bandwidth on the A9X? (sorry if the question is stupid, I’m not a CPU guy, just graphics designer who reads AT :P)xype - Monday, November 30, 2015 - link
Never mind, just read the same speculation in the other article…asendra - Monday, November 30, 2015 - link
This is super interesting, and better explains why there isn´t one in the A9X.Any rough estimate on the full A9X review (with SPEC2006 results :P)?
jasonelmore - Monday, November 30, 2015 - link
i knew it, people were giving apple to much credit for matching L3 Cache sizes with Core i7 CPU's and keeping the die so small.Everyone guess wrong, and this teaches us even Geekbench reports are to be taken with a grain of salt.
I am surprised they gimped the A9X with no L3 at all. Something more has to be going on here or did apple seriously just not have enough die space with all that gpu.
Regarding the Metal Pitch being 20nm even on a 16nm process, this has me worried about future Nvidia and AMD GPU's being built on TSMC. Both GPU vendors tried 20nm internally on prototypes, and saw negligible benifits, hence why they skipped it entirely.
extide - Monday, November 30, 2015 - link
The problem with the 20nm node was that it was not finfet -- and thus the leakage was super high making that process unsuitable for large chips running at high clock speeds. It was purely a problem with the FEOL (transistors) not the BEOL (metal layers). With the 16FF node -- this will not be an issue.name99 - Monday, November 30, 2015 - link
Way to miss the point along multiple dimensions.(a) What Apple is doing here is likely more sophisticated than what Intel is doing. The tricky point is not having an exclusive or victim cache, it is maintaining coherence between the GPUs and CPUs. An inclusive cache is a simple (but inefficient) solution to this problem. A more efficient solution is to use a directory (which you can imagine as something like the L3 holds a whole of additional tags without lines attached, tags describing the contents of the CPU and GPU L2 caches).
It is somewhat unclear just how much coherency both the Apple/ARM/Imagination world and the Intel world offer between GPUs and CPUs today. The impression I get is that such coherency as exists in TODAY's products is limited; but all parties want this to move to full coherence ASAP.
Given how fast Apple is moving, I suspect there is a lot happening on the successive A- chips that's not 100% ready for shipping, and which is not exposed to users/developers, but which was implemented for testing and experimentation, to inform the future; and I suspect that the A9 L3 in part fits into this category. (Which in part explains why it wasn't on the A9X --- it can perform a secondary job of testing Apple's directory implementation and clarifying any holes in the implementation/protocol by being on just one SoC; and clearly the combination of GPU L2 and wider memory bus is good enough for the iPad Pro to perform well.)
(b) WTF does Geekbench have to do with any of this? GB does not report L3 sizes. GB does not claim to be testing L3 performance.
You have the causality exactly backwards. The A9 posted spectacular GB (and every other benchmark) results, and we were immediately told by certain noisy individuals on the internet that this was PURELY because Apple had placed a massive 8MiB L3 on the CPU in order to game GB.
DarkXale - Monday, November 30, 2015 - link
>and we were immediately told by certain noisy individuals on the internetWho? Again, GB not testing L3 (or L2) is well known.
hlovatt - Monday, November 30, 2015 - link
There was a number of comments after Linus Torvalds of Linux fame said referring to Geekbench 3 (GB3) "I suspect most of them have a code footprint that basically fits in a L1I cache." This has led other people to sat that the Apple processors only give good GB3 scores because of large caches, L1 through L3. Note Linus only said L1 cache and only commented on GB3 other expanded this to all caches and Apple.Fmehard - Tuesday, December 1, 2015 - link
Those are wild and rosy guesses based on nothing more than your hunch ("Given how fast Apple is moving..") Fanboi much?zeeBomb - Tuesday, December 1, 2015 - link
Talk about learning something new everyday.Samus - Tuesday, December 1, 2015 - link
It isn't surprising Apple has the engineering potential to come up with these radical designs, but at the same time, it is surprising, because a lot of these designs are off the wall unnecessary. Is 4MB of victim cache worth the die space? A9X says...no. So it seemed like an experiment, if anything.Constructor - Tuesday, December 8, 2015 - link
A9X has double the main RAM bandwidth than A9. Which plausibly makes an L3 cache less of a priority, if not redundant.tipoo - Wednesday, December 9, 2015 - link
It's speculated in both articles that it makes more sense in the A9, which is bound by a much much smaller smartphone battery, since it's thrashing the main RAM less with a cache. In the iPad Pro where the SoC becomes a small fraction of the power draw of the display on the larger battery, they can just go ahead and make all those RAM accesses.