I'm going to predict that this will be Ampere, but that Ampere will not see large changes in its design from Turing. It'll be a 7nm die shrink, with PCI-E 4.0 and faster RT performance. The number shaders will probably go up at the highest SKUs. Tensor Cores will be unchanged, but they'll market some new software feature to make them more useful in consumer workloads. And in order to make the launch spicier, Nvidia will announce something related to the display engine, most likely DisplayPort 2.0 or HDMI 2.1.
PCIe 4.0 is kinda of a surprise that it isn't active in Turing: Volta supports the PHY speeds for nvLink to POWER9 chips.
More ALUs and tensor cores is pretty much a given at the high end but I don't see much of change in the middle. Rather nVidia is just going to reap the rewards of smaller die sizes to boost profit margins while the performance benefits stem mainly from clock speed increases. Recall that the GV100 die is a record holder at 818 mm^2 with the TU102 coming in north of 650 mm^2. Those are stupid large sizes and a die shrink is pretty much necessary to return to yield sanity.
One variable is if the highend still throws a lot of silicon towards compute but follows a chiplet philosophy with it being comprised of more than one die + HBM memory. This would be the more interesting solution in terms of design as the cost penalty has been paid for exotic packaging and design but scales up, very, very well.
I think HDMI 2.1 is a sure thing at this juncture. DP 2.0 is a bit of wild card and may only appear on the last of the chips introduced in this generation. It'd work out in a weird way as the low end chips often get tasked for high resolution digital signage in the professional space.
Just to be clear, Tensor Cores don't exist as separate logic. They are just groups of 8 shaders being used together to carry out operations more efficiently than if they were doing the same calculations as individual instructions. Same thing goes for RT cores. They are just lots of shaders working together with a little bit of extra logic to carry out BVH efficiently.
We won't see MCM for Ampere, that won't come until Hopper.
I think you went too far from the one extreme (believing the cores were entirely separate new blocks of transistors) to the other (seeming to believe that the addition of the tensor cores and RT cores is more or less trivial). Most likely the tensor cores are made by giving the shaders alternate, specialized data pathways to make efficient macro operations. But we don't know that for sure. As far as the RT cores, they haven't been around as long and I think even less is known about them. Ideally NVIDIA would want to reuse as much of the compute cores as possible. But, to my understanding, incoherent rays present a big problem for the standard data pathway of SIMD/SIMT hardware. The key thing for both the tensor cores and the RT cores are the data pathways: reducing the I/O bandwidth. But that is the key part of the engineering of the SIMT hardware, as well. It's relatively extremely easy to build number-crunching MAC units. It can be done cheaply and quickly with just a few engineers, as seen by all the AI chip startups and Tesla. But building a GPU that is able to efficiently use those units for general purpose compute in an offload model is a much bigger engineering challenge. The same can be said for the tensor cores and the RT cores. Those data pathways ARE the heart of the cores. They aren't just "a little bit of extra logic".
I spent a lot of time reading up on it after our back and forth and found various developers talking about their experiences with Turing on Reddit and other places. These are folks who actually spend the time tracing wavefronts as they are scheduled and staring at assembly code. Their experience changed my mind, which happens occasionally :p These folks have been able to confirm that you can't use Tensor Cores and vanilla CUDA Cores simultaneously which pretty much proves that Nvidia is breaking each SM into 8 groups of 8 shaders, with each group able to be configured as a Tensor Core. As far as RT cores are concerned, they have also been able to confirm that using and RT at all means that the entire SM becomes unavailable for other operations. They also insist that there are new instructions for BVH that would require new hardware. BVH is the part of RT that can't efficiently be calculated by FP operations. Nvidia are using optimized hardware to do the BVH traversal and then using a compressed form of the traversed path to determine the sequence of vector transforms that are necessary to get from the source to the camera. The transforms and any subsequent application of material properties, are all SIMD optimized tasks. BVH relative to the rest of the computational resources within an SM is a tiny fraction of the area that is distributed across groups of shaders. Not some huge chunk like the Nvidia marketing material would have you believe. This is all smoke and mirrors, hiding the fact that Nvidia's approach to ray-tracing is actually quite similar to the one AMD has proposed where they also use the shaders for the bulk of the computation and use other hardware resources to perform the BVH.
Informative. The information on the compressed BVHs can be found in a 2017 NVIDIA research paper, BTW. But there is no guarantee that that is what Nvidia are using. I am skeptical that anyone can know because the underlying operations are not exposed. Did those forums explain exactly how they could determine that?
I also think it's inappropriate both to call it smoke and mirrors and to say it is the same as an implementation that we have even less information on (AMD's).
I don't think we know how much space the RT cores take up, and I don't think NVIDIA ever claimed any percentage of die space. People took Nvidia's schematics as representing relative die areas. I have argued with various people about that since Turing was introduced. I never took those schematics as meaning to be representative of die areas and I'm not sure why people did. I mean, look: https://images.anandtech.com/doci/13214/Turing_575... Would the NVLink block look just like the decode block, just like the tensor cores, etc? Then there were also the colored ones which were drawn as 1/3,1/3,1/3 of SMs, tensor cores, and RT cores. They were just meant to highlight features in an appealing way. They were never meant to give insight on the workings or layout of the chip or to make any claims of die area.
Various parts of the chip likely need to be changed in order to accommodate what are almost certainly different memory access patterns. Perhaps they need a more general solution, or multiple pathways, changes in the cache, etc. Additionally, it will probably be 4 years between when NVIDIA introduced tensor cores and when AMD have an equivalent. If they were smoke and mirrors then AMD wouldn't have let their AI opportunities wallow in the mud for such a long time.
IF tensor cores were just grouped shaders , then surely the performance of those could be put to use in non-RT games , that doesnt seem to be evident in all the benchmarks . Or even firestrike/Furmark . I mean you could be right .. I thought RT logic was unused in non RT games.. Maybe I'm hopelessly confused with RT tensor Ai acceleration etc etc etc
"Just to be clear, Tensor Cores don't exist as separate logic. They are just groups of 8 shaders being used together to carry out operations more efficiently than if they were doing the same calculations as individual instructions. "
This is incorrect. Not only can Tensor cores and CUDA cores operate simultaneously, there are not a sufficient number of CUDA cores to perform the number of simultaneous operations that the Tensor cores perform. 64x operations per Tensor core to perform a single 4x4 FMA operation, not 8x operations, because that's not how matrix math works. This has been proven by independant benchmarking (e.g. Citadels microarchitecture analysis: https://arxiv.org/pdf/1804.06826.pdf found Tensor TFLOPS to match the published values for Tensor cores, a peak impossible to achieve with the number of CUDA cores on the die).
The reason a Tensor core can do so many individual operations without taking up too much diea rea is because they are fixed-function: a Tensor core will do an FMA operation on predetermined size input and output matrices, and that's all they'll do. If you feed them a scaler (packed into a matrix) then most of the core will be doing nothing. a CUDA core on the other hand is a multipurpose engine, it can perform a huge variety of operations at the cost of increased die area per core. You could take 64 CUDA cores and perform the same 4x4 FMA operation that a Tensor core performs, but it would take both longer (due to the need to corral 64 cores worth of memory operations) and use up a vastly larger die area. You CANNOT take a Tensor core and split it into 64 individually addressable units however.
Note that at a hardware level, a 'Tensor core' sits within a sub-core within an SM, alongside the general purpose math engines for FP and INT operations. For Volta, the Tensor core and one of the GP math cores could operate concurrently. For Turing, the FP and INT cores can also operate concurrently alongside the Tensor core.
Let's look at the RTX 2060. It has 1920 32-bit stream processors and 240 tensor cores. That enables 3840 FP16 FMA operations per clock. According to NVIDIA's blog on tesnor cores "Each Tensor Core performs 64 floating point FMA mixed-precision operations per clock". 240x16 = 3840. A coincidence? Maybe, but there are certainly a sufficient number of CUDA cores to handle the simultaneous operations.
The caveat I see here is that the Tensor cores include a 32 bit accumulate. Perhaps each FP32 ALU that is used to initially hold two FP16 inputs can successfully hold the FP32 result with the correct circuitry. There then needs to be an FP32 accumulator to add all the correct products together. I really don't know anything about electronic engineering. I guess they can cascade the results somehow to use the accumulation circuitry of the cores to take their own product results and add them to a cascading sum of other cores' product results. The first core in the line would add the already stored result from the neural network with the first product, then the second in line would take that combined value and add it to its own product result, etc. This all needs to be done in one clock cycle. It would demand some sort of sub-cycle timing so that all it can be performance. But maybe the core clock of the GPU is in reference to the scheduler and warp operations, and the internal core operations can have more finely grained timings. (??)
Yeah, I don't know where they are getting their math from either. Each tensor core is able to do 64 FMA ops per cycle with FP16 inputs. This works out to exactly 8 FMA operations per CUDA core, with a row-column dot-product requiring 4 multiplies and 3 additions, plus 1 more addition for the final accumulate (4xFMA). This allows 8 shaders to compute the full 4x4 tensor FMA in 1 cycle. It also holds that the FP16 Tensor TFLOPS are almost exactly 8x the FP32 TFLOPs, but only 4x the FP16 TFLOPS. With some clever use of data locality in the hardware, I can absolutely see them getting these kinds of results.
I also scanned the introduction (summary and chapter 1), conclusion, and tensor core sections of the paper you posted and didn't see them make your claimed conclusions: 1) tensor cores and CUDA cores can operate simultaneously, or 2) there are not sufficient number of CUDA cores to perform the simultaneous operations, including that they showed a peak impossible to achieve with the number of CUDA cores on the die. What they do show is that 16-bit-result tensor core operations show a capability of only reaching 90% of its peak theoretical result and that 32-bit-result only seems to achieve about 70%. That means that the circuitry is not able to achieve its peak theoretical result like clockwork, unlike the basic FMA operations as shown in the two graphs (the ones with red or blue lines) below the tensor core graph. That is interesting because it shows that the tensor core operation is some complex operation that doesn't always go through as intended. I'd argue that if NVIDIA created a special block of execution units to run the operation that would be less likely to happen. Incidentally, I have the vague notion reading that the Turing tensor cores achieve closer to their peak theoretical performance when performing their intended operation. Additionally, the fact that the 16-bit-result gets closer to its peak performance than the 32-bit-result is also interesting. It suggests that the 32-bit-result operation is trying to squeeze even more out of re-purposed circuitry than the 16-bit-result, which fits with the complications of 32 bit accumulation I proposed in my previous post. And again, if NVIDIA had free reign to purpose-build a new fixed-function block it would be expected that they would achieve full peak throughput under ideal conditions for its intended operation (the matrix multiply and accumulate). It really suggests a clever re-purposing to me.
In any case, if you can be more specific as to where and how in the paper they prove your conclusions it would be helpful, because I was unable to see it on my own.
Not necessarily. It just means that the math being done in the hardware uses some trick to perform multiple equivalent operations in a single cycle. This isn't really any different than an FMA instruction resulting in two operations per cycle. It's far more likely that each CUDA core has a little bit of extra area allocated to additional optimized instructions which, because of datapath limitations, aren't usable in normal FP16 operation, but can be used in optimal circumstances to perform 8 multiplies and 8 additions in a single cycle. This is easily explained by pipelining. Turing takes 4 cycles to complete a single warp. Each CUDA core can do 1 FP32 FMA in this time, but with pipelining the average cost goes down to 1 cycle. If Nvidia were able to cleverly use each pipeline stage to do an FP16 FMA, each CUDA core could execute 16 FMA operations per cycle and a total of 64 FMA operations by the end of the pipeline. This would mean an average of 64 FMA operations per cycle, which is exactly as advertised. The same throughput is not achievable with normal FP16 ops because the bandwidth requirements are too high. This technique relies on operands moving through the pipeline and being reused each cycle.
1. Nvidia needs to make a specialized DL chip (get rid of FP64 cores) to compete with TPUs
2. We are talking about Tesla cards, which don't have output ports, so talking about HDMI 2.1 or DP 2.0 does not make sense.
3. PCIe 4.0 is good but it's not good enough. What we need is an ability to link more than 2 cards with NVlink using bridges.
4. Put more memory on these cards. TPUs let you use insane amount of memory to train large models (like GPT-2). Right now I'm planning to build quad Quadro 8000 (instead of Titan RTX, because I need the memory).
NVIDIA have specialized DL chips. They don't feel they need to commercialize one at the moment. Bill Dally, NVIDIA's chief scientist, claims they can come out with one or add it to one of their products whenever they want to.
Internal testing of unreleased products and empty claims of the "We can do xx as well!" kind are irrelevant at best, meaningless at worst. All companies have R&D labs, the point is what they choose to do with that research, what to release. As long as Item X remains in the R&D wing of a company, it effectively does not exist outside of it.
I should have said.. Who said it's internal? And I see no reason for Dally to claim in can be used as a drop in replacement for the NVDLA, which he did in an interview with The Next Platform in 2019, when it isn't true. He's not in the part of the company that goes around proselytizing a good deal. If he were to brag about something technical and it weren't true I think it would be looked upon poorly in his circles.
Ampere will see large changes in its design from Turing. Probably not as large as GCN4 to RDNA, but larger than Volta to Turing. Something along the lines of Pascal to Turing, to just throw something out there. NVIDIA doesn't in general do die shrinks of old architectures, and every time NVIDIA comes out with an architecture on a new node people seem to claim it is going to be a die shrink of the old one. As far as RT cores, it's unlikely the part mentioned in this story, ie the one going in the supercomputer, will have any of them. Somehow NVIDIA will get 70-75% faster performance over Volta, according to the Indiana University supercomputer guy, and it's not going to be entirely from a die shrink from 12 to 7.
I agree. The timing between Turing and Ampere is a lot more than it is required for a die shrink. They most certainly have changed stuff internally, upgraded rt cores, etc.
Yeah it will need more cores. 70% uplift of fp64 crunch from even lowest pcie V100 is 11.9Tflops. Full V100 chip would need to clock 11 900GFlops/(64*84) ~ 2.21GHz. While not out of question clocks for Tesla that's just unlikely.
Can't really compare transistor densities between IHVs. But if we take Vega 20 chip, which is 331mm² and has 13230 million transistors. That would make V100 size at 7nm as 331mm² * 21100/13230 ~ 528mm². So there could be a bit more room for more cudas as I don't think they will make much over 600mm² chips at the 7nm EUV process.
I thought that a die shrink of and older architecture was exactly how Nvidia went about transitioning to new nodes after the disaster that was NV30? They only seem to have stopped that approach fairly recently with Pascal, though my understanding is that was still less of a change from Maxwell than Turing is from Pascal.
Unless I'm misreading you and you're simply pointing out that they never release an entirely new range that is *just* a die shrink, which is fair.
Agreed that RT cores are unlikely on a supercomputer chip, though. Perhaps this chip will bear a similar relationship to the Ampere consumer cards that Volta does to consumer Turing.
The change NVIDIA made was to not aggressively pursue new process nodes. They generally wait longer than AMD to introduce products on a new node. But they do not take existing architectures and shrink it to a new node. Pascal was a significant change from Maxwell, Kepler was a significant change from Fermi. Before that we are getting into some ancient history.
I'm not sure what you mean by distinguishing "just" a die shrink. That's what people necessarily mean when they say "The new architecture is not really new it is a die shrink of the old one". In any case, don't think there's any reason to believe that NVIDIA holds back their big changes away from when they change nodes. Kepler was a massive change from Fermi and was on a new node. Judging by the odd release of the architecture and the rejiggering of their architectures NVIDIA did at the time (Maxwell 1 and Maxwell 2, adding Pascal into the roadmap and shifting features around), I think Maxwell was originally planned to be on a new node but that node ended up not working out because planar FETs hit a wall.
Well yeah Maxwell should have been on TSMC 20nm, but that node failed miserably. So Maxwell were stript down from fp64 compute and released on old 28nm node. Pascal gp100 was then what big Maxwell should have been, so it was more like an evolution from Maxwell rather than hole new Arch. So I would not call it very significant, i.e. Kepler to Maxwell is much larger change as is Pascal to Turing.
Well now, as Turing is more like an evolution of Volta, is it really time for an hole new architecture yet? Or will Nvidia do another evolution with Ampere and die shrink Volta/Turing shaders to 7nm, amplify RT(maybe smaller SMs i.e. 32cc instead of 64cc thus more RT cores), modify Tensors(BFloat16) and call it a day.
It says 'Ampere' at the link but it's a bit fuzzy.
**The original plan was to outfit the system with Nvidia V100 GPUs, which would have brought its peak performance to around 5.9 petaflops.**
It goes on to state the original 672 dual-socket nodes -- “Rome” Epyc 7742 processors from AMD -- will bring "... additional nodes online" this Summer, and expected to deliver close to 8 petaflops.
**The newer silicon is expected to deliver 70 percent to 75 percent more performance than that of the current generation** but **ended up buying a smaller number of GPUs**
I wonder for how much longer Kepler will be receiving driver updates, I am still running a GTX 650. The K40 is still in a few TOP500 supercomputers, and the first generation Titan is GK110.
The metaphor you want is "buttoned up", which means tight-lipped. "Buttoned down" is a confusion of battened down, which is what is done to the hatches on a ship that is entering rough waters so that nothing comes loose and water does not enter into compartments. Then there are button-up shirts, which are shirts with buttons up the front for closing it. Button-down shirts are specifically button-up shirts with collars that button down onto the shirt. </the more you know, grammar> :)
We’ve updated our terms. By continuing to use the site and/or by logging into your account, you agree to the Site’s updated Terms of Use and Privacy Policy.
38 Comments
Back to Article
quorm - Friday, January 31, 2020 - link
Would this be Turing or Ampere based?Ryan Smith - Friday, January 31, 2020 - link
Officially, we don't know. However given the timing, I would imagine it would be based on NVIDIA's next-gen architecture.SaberKOG91 - Friday, January 31, 2020 - link
I'm going to predict that this will be Ampere, but that Ampere will not see large changes in its design from Turing. It'll be a 7nm die shrink, with PCI-E 4.0 and faster RT performance. The number shaders will probably go up at the highest SKUs. Tensor Cores will be unchanged, but they'll market some new software feature to make them more useful in consumer workloads. And in order to make the launch spicier, Nvidia will announce something related to the display engine, most likely DisplayPort 2.0 or HDMI 2.1.Kevin G - Friday, January 31, 2020 - link
PCIe 4.0 is kinda of a surprise that it isn't active in Turing: Volta supports the PHY speeds for nvLink to POWER9 chips.More ALUs and tensor cores is pretty much a given at the high end but I don't see much of change in the middle. Rather nVidia is just going to reap the rewards of smaller die sizes to boost profit margins while the performance benefits stem mainly from clock speed increases. Recall that the GV100 die is a record holder at 818 mm^2 with the TU102 coming in north of 650 mm^2. Those are stupid large sizes and a die shrink is pretty much necessary to return to yield sanity.
One variable is if the highend still throws a lot of silicon towards compute but follows a chiplet philosophy with it being comprised of more than one die + HBM memory. This would be the more interesting solution in terms of design as the cost penalty has been paid for exotic packaging and design but scales up, very, very well.
I think HDMI 2.1 is a sure thing at this juncture. DP 2.0 is a bit of wild card and may only appear on the last of the chips introduced in this generation. It'd work out in a weird way as the low end chips often get tasked for high resolution digital signage in the professional space.
SaberKOG91 - Friday, January 31, 2020 - link
Just to be clear, Tensor Cores don't exist as separate logic. They are just groups of 8 shaders being used together to carry out operations more efficiently than if they were doing the same calculations as individual instructions. Same thing goes for RT cores. They are just lots of shaders working together with a little bit of extra logic to carry out BVH efficiently.We won't see MCM for Ampere, that won't come until Hopper.
p1esk - Friday, January 31, 2020 - link
> Tensor Cores don't exist as separate logicDo you have a link to back that up? You might be right, but I can't find it in Nvidia docs.
Yojimbo - Saturday, February 1, 2020 - link
NVIDIA isn't going to public its trade secrets.Yojimbo - Saturday, February 1, 2020 - link
I think you went too far from the one extreme (believing the cores were entirely separate new blocks of transistors) to the other (seeming to believe that the addition of the tensor cores and RT cores is more or less trivial). Most likely the tensor cores are made by giving the shaders alternate, specialized data pathways to make efficient macro operations. But we don't know that for sure. As far as the RT cores, they haven't been around as long and I think even less is known about them. Ideally NVIDIA would want to reuse as much of the compute cores as possible. But, to my understanding, incoherent rays present a big problem for the standard data pathway of SIMD/SIMT hardware. The key thing for both the tensor cores and the RT cores are the data pathways: reducing the I/O bandwidth. But that is the key part of the engineering of the SIMT hardware, as well. It's relatively extremely easy to build number-crunching MAC units. It can be done cheaply and quickly with just a few engineers, as seen by all the AI chip startups and Tesla. But building a GPU that is able to efficiently use those units for general purpose compute in an offload model is a much bigger engineering challenge. The same can be said for the tensor cores and the RT cores. Those data pathways ARE the heart of the cores. They aren't just "a little bit of extra logic".SaberKOG91 - Saturday, February 1, 2020 - link
I spent a lot of time reading up on it after our back and forth and found various developers talking about their experiences with Turing on Reddit and other places. These are folks who actually spend the time tracing wavefronts as they are scheduled and staring at assembly code. Their experience changed my mind, which happens occasionally :p These folks have been able to confirm that you can't use Tensor Cores and vanilla CUDA Cores simultaneously which pretty much proves that Nvidia is breaking each SM into 8 groups of 8 shaders, with each group able to be configured as a Tensor Core. As far as RT cores are concerned, they have also been able to confirm that using and RT at all means that the entire SM becomes unavailable for other operations. They also insist that there are new instructions for BVH that would require new hardware. BVH is the part of RT that can't efficiently be calculated by FP operations. Nvidia are using optimized hardware to do the BVH traversal and then using a compressed form of the traversed path to determine the sequence of vector transforms that are necessary to get from the source to the camera. The transforms and any subsequent application of material properties, are all SIMD optimized tasks. BVH relative to the rest of the computational resources within an SM is a tiny fraction of the area that is distributed across groups of shaders. Not some huge chunk like the Nvidia marketing material would have you believe. This is all smoke and mirrors, hiding the fact that Nvidia's approach to ray-tracing is actually quite similar to the one AMD has proposed where they also use the shaders for the bulk of the computation and use other hardware resources to perform the BVH.Yojimbo - Saturday, February 1, 2020 - link
Informative. The information on the compressed BVHs can be found in a 2017 NVIDIA research paper, BTW. But there is no guarantee that that is what Nvidia are using. I am skeptical that anyone can know because the underlying operations are not exposed. Did those forums explain exactly how they could determine that?I also think it's inappropriate both to call it smoke and mirrors and to say it is the same as an implementation that we have even less information on (AMD's).
I don't think we know how much space the RT cores take up, and I don't think NVIDIA ever claimed any percentage of die space. People took Nvidia's schematics as representing relative die areas. I have argued with various people about that since Turing was introduced. I never took those schematics as meaning to be representative of die areas and I'm not sure why people did. I mean, look: https://images.anandtech.com/doci/13214/Turing_575...
Would the NVLink block look just like the decode block, just like the tensor cores, etc? Then there were also the colored ones which were drawn as 1/3,1/3,1/3 of SMs, tensor cores, and RT cores. They were just meant to highlight features in an appealing way. They were never meant to give insight on the workings or layout of the chip or to make any claims of die area.
Various parts of the chip likely need to be changed in order to accommodate what are almost certainly different memory access patterns. Perhaps they need a more general solution, or multiple pathways, changes in the cache, etc. Additionally, it will probably be 4 years between when NVIDIA introduced tensor cores and when AMD have an equivalent. If they were smoke and mirrors then AMD wouldn't have let their AI opportunities wallow in the mud for such a long time.
MASSAMKULABOX - Sunday, February 2, 2020 - link
IF tensor cores were just grouped shaders , then surely the performance of those could be put to use in non-RT games , that doesnt seem to be evident in all the benchmarks . Or even firestrike/Furmark . I mean you could be right .. I thought RT logic was unused in non RT games.. Maybe I'm hopelessly confused with RT tensor Ai acceleration etc etc etcedzieba - Sunday, February 2, 2020 - link
"Just to be clear, Tensor Cores don't exist as separate logic. They are just groups of 8 shaders being used together to carry out operations more efficiently than if they were doing the same calculations as individual instructions. "This is incorrect. Not only can Tensor cores and CUDA cores operate simultaneously, there are not a sufficient number of CUDA cores to perform the number of simultaneous operations that the Tensor cores perform. 64x operations per Tensor core to perform a single 4x4 FMA operation, not 8x operations, because that's not how matrix math works. This has been proven by independant benchmarking (e.g. Citadels microarchitecture analysis: https://arxiv.org/pdf/1804.06826.pdf found Tensor TFLOPS to match the published values for Tensor cores, a peak impossible to achieve with the number of CUDA cores on the die).
The reason a Tensor core can do so many individual operations without taking up too much diea rea is because they are fixed-function: a Tensor core will do an FMA operation on predetermined size input and output matrices, and that's all they'll do. If you feed them a scaler (packed into a matrix) then most of the core will be doing nothing. a CUDA core on the other hand is a multipurpose engine, it can perform a huge variety of operations at the cost of increased die area per core. You could take 64 CUDA cores and perform the same 4x4 FMA operation that a Tensor core performs, but it would take both longer (due to the need to corral 64 cores worth of memory operations) and use up a vastly larger die area. You CANNOT take a Tensor core and split it into 64 individually addressable units however.
Note that at a hardware level, a 'Tensor core' sits within a sub-core within an SM, alongside the general purpose math engines for FP and INT operations. For Volta, the Tensor core and one of the GP math cores could operate concurrently. For Turing, the FP and INT cores can also operate concurrently alongside the Tensor core.
Yojimbo - Sunday, February 2, 2020 - link
Let's look at the RTX 2060. It has 1920 32-bit stream processors and 240 tensor cores. That enables 3840 FP16 FMA operations per clock. According to NVIDIA's blog on tesnor cores "Each Tensor Core performs 64 floating point FMA mixed-precision operations per clock". 240x16 = 3840. A coincidence? Maybe, but there are certainly a sufficient number of CUDA cores to handle the simultaneous operations.The caveat I see here is that the Tensor cores include a 32 bit accumulate. Perhaps each FP32 ALU that is used to initially hold two FP16 inputs can successfully hold the FP32 result with the correct circuitry. There then needs to be an FP32 accumulator to add all the correct products together. I really don't know anything about electronic engineering. I guess they can cascade the results somehow to use the accumulation circuitry of the cores to take their own product results and add them to a cascading sum of other cores' product results. The first core in the line would add the already stored result from the neural network with the first product, then the second in line would take that combined value and add it to its own product result, etc. This all needs to be done in one clock cycle. It would demand some sort of sub-cycle timing so that all it can be performance. But maybe the core clock of the GPU is in reference to the scheduler and warp operations, and the internal core operations can have more finely grained timings. (??)
SaberKOG91 - Monday, February 3, 2020 - link
Yeah, I don't know where they are getting their math from either. Each tensor core is able to do 64 FMA ops per cycle with FP16 inputs. This works out to exactly 8 FMA operations per CUDA core, with a row-column dot-product requiring 4 multiplies and 3 additions, plus 1 more addition for the final accumulate (4xFMA). This allows 8 shaders to compute the full 4x4 tensor FMA in 1 cycle. It also holds that the FP16 Tensor TFLOPS are almost exactly 8x the FP32 TFLOPs, but only 4x the FP16 TFLOPS. With some clever use of data locality in the hardware, I can absolutely see them getting these kinds of results.Yojimbo - Sunday, February 2, 2020 - link
I also scanned the introduction (summary and chapter 1), conclusion, and tensor core sections of the paper you posted and didn't see them make your claimed conclusions: 1) tensor cores and CUDA cores can operate simultaneously, or 2) there are not sufficient number of CUDA cores to perform the simultaneous operations, including that they showed a peak impossible to achieve with the number of CUDA cores on the die. What they do show is that 16-bit-result tensor core operations show a capability of only reaching 90% of its peak theoretical result and that 32-bit-result only seems to achieve about 70%. That means that the circuitry is not able to achieve its peak theoretical result like clockwork, unlike the basic FMA operations as shown in the two graphs (the ones with red or blue lines) below the tensor core graph. That is interesting because it shows that the tensor core operation is some complex operation that doesn't always go through as intended. I'd argue that if NVIDIA created a special block of execution units to run the operation that would be less likely to happen. Incidentally, I have the vague notion reading that the Turing tensor cores achieve closer to their peak theoretical performance when performing their intended operation. Additionally, the fact that the 16-bit-result gets closer to its peak performance than the 32-bit-result is also interesting. It suggests that the 32-bit-result operation is trying to squeeze even more out of re-purposed circuitry than the 16-bit-result, which fits with the complications of 32 bit accumulation I proposed in my previous post. And again, if NVIDIA had free reign to purpose-build a new fixed-function block it would be expected that they would achieve full peak throughput under ideal conditions for its intended operation (the matrix multiply and accumulate). It really suggests a clever re-purposing to me.In any case, if you can be more specific as to where and how in the paper they prove your conclusions it would be helpful, because I was unable to see it on my own.
trinibwoy - Tuesday, February 4, 2020 - link
Peak fp16 tensor flops are 4x the peak regular fp16 flops on Turing.It is not possible to achieve the tensor peak using regular CUDA cores. Therefore the tensor cores must be separate hardware.
SaberKOG91 - Thursday, February 6, 2020 - link
Not necessarily. It just means that the math being done in the hardware uses some trick to perform multiple equivalent operations in a single cycle. This isn't really any different than an FMA instruction resulting in two operations per cycle. It's far more likely that each CUDA core has a little bit of extra area allocated to additional optimized instructions which, because of datapath limitations, aren't usable in normal FP16 operation, but can be used in optimal circumstances to perform 8 multiplies and 8 additions in a single cycle. This is easily explained by pipelining. Turing takes 4 cycles to complete a single warp. Each CUDA core can do 1 FP32 FMA in this time, but with pipelining the average cost goes down to 1 cycle. If Nvidia were able to cleverly use each pipeline stage to do an FP16 FMA, each CUDA core could execute 16 FMA operations per cycle and a total of 64 FMA operations by the end of the pipeline. This would mean an average of 64 FMA operations per cycle, which is exactly as advertised. The same throughput is not achievable with normal FP16 ops because the bandwidth requirements are too high. This technique relies on operands moving through the pipeline and being reused each cycle.p1esk - Friday, January 31, 2020 - link
A few thoughts:1. Nvidia needs to make a specialized DL chip (get rid of FP64 cores) to compete with TPUs
2. We are talking about Tesla cards, which don't have output ports, so talking about HDMI 2.1 or DP 2.0 does not make sense.
3. PCIe 4.0 is good but it's not good enough. What we need is an ability to link more than 2 cards with NVlink using bridges.
4. Put more memory on these cards. TPUs let you use insane amount of memory to train large models (like GPT-2). Right now I'm planning to build quad Quadro 8000 (instead of Titan RTX, because I need the memory).
Yojimbo - Saturday, February 1, 2020 - link
NVIDIA have specialized DL chips. They don't feel they need to commercialize one at the moment. Bill Dally, NVIDIA's chief scientist, claims they can come out with one or add it to one of their products whenever they want to.Santoval - Saturday, February 1, 2020 - link
Internal testing of unreleased products and empty claims of the "We can do xx as well!" kind are irrelevant at best, meaningless at worst. All companies have R&D labs, the point is what they choose to do with that research, what to release. As long as Item X remains in the R&D wing of a company, it effectively does not exist outside of it.Yojimbo - Saturday, February 1, 2020 - link
Who said it's untested. Why make such an assumption? It's been detailed in Research Chip 18.Yojimbo - Saturday, February 1, 2020 - link
I should have said.. Who said it's internal? And I see no reason for Dally to claim in can be used as a drop in replacement for the NVDLA, which he did in an interview with The Next Platform in 2019, when it isn't true. He's not in the part of the company that goes around proselytizing a good deal. If he were to brag about something technical and it weren't true I think it would be looked upon poorly in his circles.Yojimbo - Saturday, February 1, 2020 - link
Ampere will see large changes in its design from Turing. Probably not as large as GCN4 to RDNA, but larger than Volta to Turing. Something along the lines of Pascal to Turing, to just throw something out there. NVIDIA doesn't in general do die shrinks of old architectures, and every time NVIDIA comes out with an architecture on a new node people seem to claim it is going to be a die shrink of the old one. As far as RT cores, it's unlikely the part mentioned in this story, ie the one going in the supercomputer, will have any of them. Somehow NVIDIA will get 70-75% faster performance over Volta, according to the Indiana University supercomputer guy, and it's not going to be entirely from a die shrink from 12 to 7.yeeeeman - Saturday, February 1, 2020 - link
I agree. The timing between Turing and Ampere is a lot more than it is required for a die shrink. They most certainly have changed stuff internally, upgraded rt cores, etc.jabbadap - Monday, February 3, 2020 - link
Yeah it will need more cores. 70% uplift of fp64 crunch from even lowest pcie V100 is 11.9Tflops. Full V100 chip would need to clock 11 900GFlops/(64*84) ~ 2.21GHz. While not out of question clocks for Tesla that's just unlikely.Can't really compare transistor densities between IHVs. But if we take Vega 20 chip, which is 331mm² and has 13230 million transistors. That would make V100 size at 7nm as 331mm² * 21100/13230 ~ 528mm². So there could be a bit more room for more cudas as I don't think they will make much over 600mm² chips at the 7nm EUV process.
Spunjji - Monday, February 3, 2020 - link
I thought that a die shrink of and older architecture was exactly how Nvidia went about transitioning to new nodes after the disaster that was NV30? They only seem to have stopped that approach fairly recently with Pascal, though my understanding is that was still less of a change from Maxwell than Turing is from Pascal.Unless I'm misreading you and you're simply pointing out that they never release an entirely new range that is *just* a die shrink, which is fair.
Agreed that RT cores are unlikely on a supercomputer chip, though. Perhaps this chip will bear a similar relationship to the Ampere consumer cards that Volta does to consumer Turing.
Yojimbo - Monday, February 3, 2020 - link
The change NVIDIA made was to not aggressively pursue new process nodes. They generally wait longer than AMD to introduce products on a new node. But they do not take existing architectures and shrink it to a new node. Pascal was a significant change from Maxwell, Kepler was a significant change from Fermi. Before that we are getting into some ancient history.I'm not sure what you mean by distinguishing "just" a die shrink. That's what people necessarily mean when they say "The new architecture is not really new it is a die shrink of the old one". In any case, don't think there's any reason to believe that NVIDIA holds back their big changes away from when they change nodes. Kepler was a massive change from Fermi and was on a new node. Judging by the odd release of the architecture and the rejiggering of their architectures NVIDIA did at the time (Maxwell 1 and Maxwell 2, adding Pascal into the roadmap and shifting features around), I think Maxwell was originally planned to be on a new node but that node ended up not working out because planar FETs hit a wall.
jabbadap - Tuesday, February 4, 2020 - link
Well yeah Maxwell should have been on TSMC 20nm, but that node failed miserably. So Maxwell were stript down from fp64 compute and released on old 28nm node. Pascal gp100 was then what big Maxwell should have been, so it was more like an evolution from Maxwell rather than hole new Arch. So I would not call it very significant, i.e. Kepler to Maxwell is much larger change as is Pascal to Turing.Well now, as Turing is more like an evolution of Volta, is it really time for an hole new architecture yet? Or will Nvidia do another evolution with Ampere and die shrink Volta/Turing shaders to 7nm, amplify RT(maybe smaller SMs i.e. 32cc instead of 64cc thus more RT cores), modify Tensors(BFloat16) and call it a day.
Smell This - Friday, January 31, 2020 - link
It says 'Ampere' at the link but it's a bit fuzzy.**The original plan was to outfit the system with Nvidia V100 GPUs, which would have brought its peak performance to around 5.9 petaflops.**
It goes on to state the original 672 dual-socket nodes -- “Rome” Epyc 7742 processors from AMD -- will bring "... additional nodes online" this Summer, and expected to deliver close to 8 petaflops.
**The newer silicon is expected to deliver 70 percent to 75 percent more performance than that of the current generation** but **ended up buying a smaller number of GPUs**
??
Yojimbo - Saturday, February 1, 2020 - link
What's fuzzy? Can you provide some commentary to let people know what you are talking about?skavi - Friday, January 31, 2020 - link
This one's gotta be a node shrink, right?dwade123 - Friday, January 31, 2020 - link
Big Dead Navi will compete against RTX 3060 LOOOLSpunjji - Monday, February 3, 2020 - link
If the 3060 performs like a 2070 Super then no, it won't - Big Navi will probably compete with either the 3070 or 3080 (if it ever comes out).zodiacfml - Saturday, February 1, 2020 - link
Meh. This sounds like eight months or more for consumer 7nm nvidia cardsUltraWide - Saturday, February 1, 2020 - link
There is no real rush in Nvidia's point of view. There is 0 competition at the high end.p1esk - Saturday, February 1, 2020 - link
TPUsobi210 - Saturday, February 1, 2020 - link
I wonder for how much longer Kepler will be receiving driver updates, I am still running a GTX 650. The K40 is still in a few TOP500 supercomputers, and the first generation Titan is GK110.yacoub35 - Monday, February 24, 2020 - link
The metaphor you want is "buttoned up", which means tight-lipped."Buttoned down" is a confusion of battened down, which is what is done to the hatches on a ship that is entering rough waters so that nothing comes loose and water does not enter into compartments.
Then there are button-up shirts, which are shirts with buttons up the front for closing it.
Button-down shirts are specifically button-up shirts with collars that button down onto the shirt.
</the more you know, grammar> :)