GF100’s Gaming Architecture
Although it feels like ages ago, NVIDIA announced the Fermi architecture back in September of 2009, focusing on the compute abilities of the GPU that would be GF100. Today’s announcement is about filling in the blanks – where does the graphics hardware fit in to the design that NVIDIA revealed earlier.
As a quick refresher, the GF100 is composed of 512 SPs, which NVIDIA now calls CUDA cores. Each CUDA core is organized in to a block of 32 cores which we call a Streaming Multiprocessor (SM), which in turn are laid out so that 4 of them are in a Graphics Processing Cluster (GPC). Finally, there are 4 GPCs in a single GF100 chip. 32x4x4 = 512 CUDA cores.
GF100 | GT200x2 (GTX 295) | GT200 (GTX 285) | G92 (9800+ GTX) | |
Stream Processors | 512 | 2 x 240 | 240 | 128 |
Texture Address / Filtering | 64/256 | 2 x 80 / 80 | 80 / 80 | 64 / 64 |
ROPs | 48 | 2x 28 | 32 | 16 |
Core Clock | ? | 576MHz | 648MHz | 738MHz |
Shader Clock | ? | 1242MHz | 1476MHz | 1836MHz |
Memory Clock | ? GDDR5 | 999MHz (1998MHz data rate) GDDR3 | 1242MHz (2484MHz data rate) GDDR3 | 1100MHz (2200MHz data rate) GDDR3 |
Memory Bus Width | 384-bit | 2 x 448-bit | 512-bit | 256-bit |
Frame Buffer | ? | 2 x 896MB | 1GB | 512MB |
Transistor Count | 3B | 2 x 1.4B | 1.4B | 754M |
Manufacturing Process | TSMC 40nm | TSMC 55nm | TSMC 55nm | TSMC 55nm |
Price Point | $? | $500 | $400 | $150 - 200 |
When NVIDIA first unveiled GF100, we did not know anything about the ROPs, texture units, or any of the fixed-function graphics units that are customary in a GPU. Today we now have that information, and can discuss where it goes.
While GF100 resembles GT200 in a number of ways as a compute GPU, as a gaming GPU it’s very close to being a complete departure from GT200. The big change here is that the single block of fixed-function hardware as we know it is gone. Virtually every piece of hardware has been split up and moved down a level, and can now be found as part of a GPC or as a SM. NVIDIA’s GF100 block diagram does a good job of showing this.
What’s left of the fixed function hardware is the GigaThread Engine, which is what NVIDIA is calling their scheduler. To understand where our fixed function hardware went, we must meet the newest execution units: The PolyMorph Engine and the Raster Engine.
We’ll start with the Raster Engine. The Raster Engine is the combination of all the raster related hardware that we would have found previously in the fixed function pipeline. It does edge/triangle setup, rasterization, and z-culling in a pipelined manner. Each GPC has its own Raster Engine. NVIDIA has very little to say about the Raster Engine, but very little has changed from GT200 besides the fact that there are now 4 of them (one for each GPC) as opposed to 1 larger block. Each rasterizer can do 8 pixels per clock, for a total of 32 pixels per clock over the entirety of GF100.
The much more important of the two new execution units is the PolyMorph Engine, which is what NVIDIA is calling the execution unit that handles geometry for GF100. The PolyMorph Engine is responsible for handling Vertex Fetch, Tessellation, Viewport Transform, Attribute Setup, and Stream Output. It’s here where the Tessellator is located, one of the biggest changes that DX11 is bringing to GPU design. Unlike the Raster Engine, each SM has a PolyMorph Engine, giving the GF100 16 PolyMorph Engines altogether.
While the PolyMoprh Engine may sound simple in its description, don’t let it fool you. NVIDIA didn’t just move their geometry hardware to a different place, clone it 15 times, and call it a day. This was previously fixed-function hardware where a single unit sat in a pipeline and did its share of the work. By splitting up the fixed-function pipeline like this, NVIDIA in actuality created a lot of work for themselves. Why? Out of order execution.
OoO is something we usually reserve for CPUs, where high-end CPUs are built to execute instructions out of order in order to extract more performance out of them through instruction level parallelism. OoO is very hard to accomplish, because you can only execute certain instructions ahead of other ones while maintaining the correct result for your data. Execute an add instruction that relies on a previous operation before that’s done, and you have problems. GF100 isn’t a full OoO design, so we’re not going to cover OoO in-depth here, but if you’d like to know more please see this article.
At any rate, internally each PolyMorph Engine is still a simple in-order design. NVIDIA hasn’t gone so far as to make a PolyMorph Engine an OoO design – but because there are 16 of them when there used to be just 1, OoO hazards can occur just as they would in an OoO executing CPU. NVIDIA now has to keep track of what each PolyMorph Engine is doing in respect to the other 15, and put the brakes on any of them that get too far ahead in order to maintain the integrity of results.
To resolve the hazards of OoO, GF100 has a private communication channel just for the PolyMorph Engines that allows them to stay together on a task in spite of being spread apart. The fact of the matter is that all of the work that goes in to making a design like this work correctly is an immense amount of effort, and NVIDIA’s engineers are quite proud of this effort. They have taken the monolithic design of prior GPUs’ geometry units, and made it parallel. We can’t overstate how much of an engineering challenge this is.
However all of this work came at a cost, and not just the significant engineering resources NVIDIA threw at GF100. The other cost was time – we believe that the PolyMorph Engine is the single biggest reason that GF100 didn’t make it out last year. It’s the single biggest redesign of any component in GF100, and is something that NVIDIA had to start virtually from scratch on. When NVIDIA told us that designing a big GPU is hard, this is what they had in mind.
Now why did NVIDIA put themselves through all of this? Because in their eyes, they had to. The use of a fixed-function pipeline in their eyes was a poor choice given the geometric complexity that a tessellator would create, and hence the entire pipeline needed to be rebalanced. By moving to the parallel design of the PolyMorph Engine, NVIDIA’s geometry hardware is no longer bound by any limits of the pipelined fixed-function design (such as bottlenecks in one stage of the pipeline), and for better or for worse, they can scale their geometry and raster abilities with the size of the chip. A smaller GF100 derivative will not have as many PolyMorph or Raster units as GF100, and as a result won’t have the same level of performance; G92 derivatives and AMD’s designs both maintain the same fixed function pipeline through all chips, always offering the same level of performance.
Speaking of performance, we’ll get to that in a bit, but for now we’ll finish our discussion of GF100’s graphics units. Each SM has 4 texture units, which have changes somewhat from the GT200. In GT200, the texture units were at the TPC level; here they are per SM. The texture units are also affected by the cache reorganization of GF100; for L1 cache they still have the same 12KB of texture cache per set of 4 texture units, while they all share GF100’s 768KB L2 cache, which is now used by all execution units and not just textures. Each unit can compute 1 texture address and fetch 4 texture samples per clock, a significantly higher texture fetch rate than on the GT200. Furthermore in conjunction with DX11, NVIDIA’s texture units now support DX11’s new compressed texture formats. Finally, texture units are now tied to the shader clock and not the core clock. They run at 1/2 the shader clock speed of GF100.
Last but not least, this brings us to the ROPs. The ROPs have been reorganized, there are now 48 of them in 6 parttions of 8, and a 64bit memory channel serving each partition. The ROPs now share the L2 cache with the rest of GF100, while under GT200 they had their own L2 cache. Each ROP can do 1 regular 32bit pixel per clock, 1 FP16 pixel over 2 clocks, or 1 FP32 pixel over 4 clocks, giving the GF100 the ability to retire 48 regular pixels per clock. The ROPs are clocked together with the L2 cache.
That leaves us on a final note: clocks. The core clock has been virtually done away with on GF100, as almost every unit now operates at or on a fraction of the shader clock. Only the ROPs and L2 cache operate on a different clock, which is best described as what’s left of the core clock. The shader clock now drives the majority of the chip, including the shaders, the texture units, and the new PolyMorph and Raster Engines. Specifically, the texture units, PolyMorph Engine, and Raster Engine all run at 1/2 shader clock (which NVIDIA is tentatively calling the "GPC Clock"), while the L1 cache and the shaders themselves run at the full shader clock. Don’t be surprised if GF100 overclocking is different from GT200 overclocking as a result.
115 Comments
View All Comments
FlyTexas - Monday, January 18, 2010 - link
I have a feeling that nVidia is taking the long road here...The past 6 months have been painful for nVidia, however I think they are looking way ahead. At its core, the 5000 series from AMD is really just a supersized 4000 series. Not a bad thing, but nothing new either (DX11 is nice, but that'll be awhile, and multiple monitors are still rare).
Games have all looked the same for years now. CPU and GPU power have gone WAY up in the past 5 years, but too much is still developed for DX9 (X360/PS3 partly to blame, as is Vista's poor adoption), and I suspect that even the 5000 series is really still designed around DX9 and games meant for it with a few "enhancements".
This new chip seems designed for DX11 and much higher detailed graphics. Polygon counts can go up with this, the number of new details can really shine, but only once games are designed from scratch for it. From that point, the 6 month wait isn't a big deal, it'll be another few years before games are really designed from scratch for DX11 ONLY. Otherwise you have DX9 games with a few "enhancements" that don't add to gameplay.
It seems like we are really skipping DX10 here, partly due to Vista's poor adoption, partly due to XP not being able to use DX10. With Windows 7 being a success and DX11 backported to Vista, I think in the next 2-3 years you'll finally see most games come out that really require Vista/7 because they will require DX10/11.
Of course, my 260GTX still runs everything I throw at it, so until games get more complex or something else changes, I see no reason to upgrade. I thought about a 5870 as an upgrade, but why? Everything already runs fast enough, what does it get me other than some headroom? If I was still on a 8800GT, it would make sense, but I'd rather wait for nVidia to launch so the prices come down.
PorscheRacer - Tuesday, January 19, 2010 - link
Well then there's the fact ATI designed their 2000 series (and 3000 and 4000 series) to comply with the full DirectX 10 specification. NVIDIA didn't have the chips required for this spec, and talked Microsoft into castrating DX10 by only adding in a few things. Tessellation was notably left out. ATI wsa hung out to dry on performanec and features wasted on die. They finalyl got DX10.1 later on but the damage was done.Sure people complained about Vista, mostly gamers as games ran slower, but I wonder how those games would have been if DX10 was run at the full spec (which was marginally lower the DX11 today)?
Scali - Wednesday, January 27, 2010 - link
I think you need to read this, and reconsider your statement:http://scalibq.spaces.live.com/blog/cns">http://scalibq.spaces.live.com/blog/cns!663AD9A4F9CB0661!194.entry
jimhsu - Monday, January 18, 2010 - link
I made this post in another forum, but I think it's relevant here:---
Yes, I'm beginning to see this [games becoming less GPU limited and more CPU limited] with more mainstream games (to repeat, Crysis is NOT a mainstream game). FLOP wise, a high end video card (i.e. 5970 at 5 TFLOP) is something like 100 TIMES the performance of a high end CPU (i7 at 50 GFLOPS).
In comparison, during the 2004 days, we had GPUs like the 6800 Ultra (54 GFLOP) and P4's (6 GFLOP) (historical data here: http://forum.beyond3d.com/showthread.php?t=51677)">http://forum.beyond3d.com/showthread.php?t=51677). That's 9X the performance. We've gone from 9X to 100X the performance in a matter of 5 years. No wonder few modern games are actually pushing modern GPUs (requiring people who want to "get the most" out of their high powered GPUs to go for multiple screens, insane AA/AF, insane detail settings, complex shaders, etc)
I know this is a horrible comparison, but still - it gives you an idea of the imbalance in performance. This kind of reminds me of the whole hard drive capacity vs. transfer rate argument. Today's 2 TB monsters are actually not much faster than the few GB drives at the turn of the millennium (and even less so latency wise).
Personally, I think the days of GPU bound (for mainstream discrete GPU computing) closed when Nvidia's 8 series launched (the 8800GTX is perhaps the longest-lived video card ever made). And in general, when the industry adopted programmable compute units (aka DirectX 10).
AznBoi36 - Tuesday, January 19, 2010 - link
Actually the Radeon 9700/9800 Pro had a pretty long life too. The 9700 Pro I bought in 2002/2003 had lasted me all the way to early 2007, which was when I then bought a 8800GTS 640mb. 4 years is pretty good. It could have lasted longer, but then I was itching for a new platform and needed to get a PCI-Express card (the Radeon was AGP).RJohnson - Monday, January 18, 2010 - link
Sorry you lost all credibility when you tried to spin this bullsh*#t "Today's 2 TB monsters are actually not much faster than the few GB drives at the turn of the millennium"Go try and run your new rig off one of those old drives, come back and post your results in 2 hours when your system finally boots.
jimhsu - Monday, January 18, 2010 - link
A fun chart. Note the performance disparity.http://i65.photobucket.com/albums/h204/killer-ra/V...">http://i65.photobucket.com/albums/h204/...Game%20S...
jimhsu - Monday, January 18, 2010 - link
Disclosure: I'm still on a 8800 GTS 512, and I am in no pressure to upgrade right now. While a 58xx would be nice to have, on a single monitor I really have no need to upgrade. I may look into going i7 though.dentatus - Monday, January 18, 2010 - link
If something works well for you then there is no real reason (or need) to upgrade.I still run an 8800 ultra, it still runs many games well on a 22 inch monitor. The GT200 was really only a 50% boost over the 8 series on average. For comparison, I bought a second hand ultra for $60, transplanted both of them into an i7 based system and this really produced a significant boost over a GTX285 in the games I liked; about 25% more performance- roughly equivalent to HD5850, albeit not always as smooth.
It would be good to upgrade to a single GPU that is more than double the performance of this kind of setup. But a HD5800 series card is not in that league, and it remains to be seen if the GF100 is.
dentatus - Monday, January 18, 2010 - link
I agree this chip does seem designed around new or upcoming features. Many architectural shortcomings from the GT200 chip seem to be addressed and worked around getting usable performance (like tesselation) for new API features.Anyway to be pragmatic about things, nvidias history leaves much to be desired; performance promised and performance delivered is very variable. HardOCP mentioned the 5800 Ultra launch as a con, there is also th G80 launch on the flip side.
A GPU's theoretical performance and the expectations hanging around it are nothing to make choices by, wait for the real proof. Anyone recall the launch of the 'monstrous' 2900XT? A toothless beast that one.