Intel's 8-core Skulltrail Platform: Close to Perfecting the Niche
by Anand Lal Shimpi on February 4, 2008 5:00 AM EST- Posted in
- CPUs
Brace Yourself, High Latency Roads Ahead
We tested Skulltrail with only two FB-DIMMs installed, but even in this configuration memory latency was hardly optimal:
CPU | CPU-Z Latency in ns (8192KB, 256-byte stride) |
Intel Core 2 Extreme QX9775 (FBD-DDR2/800) | 79.1 ns |
Intel Core 2 Extreme QX9770 (DDR2/800) | 55.9 ns |
Memory accesses on Skulltrail take almost 42% longer to complete than on our quad-core X38 system. In applications that can't take advantage of 8-cores, this is going to negatively impact performance. While you shouldn't expect a huge real world deficit there are definitely going to be situations where this 8-core behemoth is slower than its quad-core desktop counterpart.
Scaling to 8 Cores: Most Benchmarks are Unaffected
Trying to benchmark an 8 core machine, even today, is much like testing some of the first dual-core CPUs: most applications and benchmarks are simply unaffected. We've called Skulltrail a niche platform but what truly makes it one is the fact that most applications, even those that are multithreaded, can't take advantage of 8 cores.
While games today benefit from two cores and to a much lesser degree benefit from four, you can count the number that can even begin to use 8 cores on one hand...if you lived in Springfield and had yellow skin.
The Lost Planet demo is the only game benchmark we found that actually showed a consistent increase in performance when going from 4 to 8 cores. The cave benchmark results speak for themselves:
CPU | Lost Planet Cave Benchmark (FPS) |
Dual Intel Core 2 Extreme QX9775 | 113 |
Intel Core 2 Extreme QX9775 | 82 |
Dual Intel Core 2 Extreme QX9775 @ 4.0GHz | 124 |
At 1600 x 1200 we're looking at a 30% increase in performance when going from 4 to 8 cores, unfortunately Lost Planet isn't representative of most other games available today. Other titles like Flight Simulator X can actually take advantage of 8 cores, but not all the time and not consistently enough to offer a real world performance advantage over a quad-core system.
The problem is that because most games can't use the extra cores the added latency of Skulltrail's FB-DIMMs actually makes the platform slower than a regular quad-core desktop. To show just how bad it can get, take a look at our Supreme Commander benchmark.
At the suggestion of Gas Powered Games, we don't rely on Supreme Commander's built in performance test. Instead we play back a recording of our own gameplay with game speed set to maximum and record the total simulation time, making a great CPU benchmark. We ran the game at maximum image quality settings but left resolution at 1024 x 768 to focus on CPU performance, the results were a bit startling:
Thanks to the high latency FBD memory subsystem, it takes a 4.0GHz Skulltrail system to offer performance better than a single QX9770 on a standard desktop motherboard. We can't stress enough how much more attractive Skulltrail would have been were it able to use standard DDR2 or DDR3 memory.
Gamers shouldn't be too worried however, Skulltrail's memory latency issues luckily don't impact GPU-limited scenarios. Take a look at our Oblivion results from earlier for affirmation:
In more CPU bound scenarios like Supreme Commander, you will see a performance penalty, but in GPU bound scenarios like Oblivion (or Crysis, for example), Skulltrail will perform like a regular quad-core system.
The Bottom line? Skulltrail is a system made for game developers, not gamers.
Other benchmarks, even our system level suite tests like SYSMark 2007, hardly show any performance improvement when going from 4 to 8 cores. We're talking about a less than 5% performance improvement, most of which is erased when you compare to a quad-core desktop platform with standard DDR2 or DDR3 memory.
That being said, there are definitely situations where Skulltrail performance simply can't be matched.
30 Comments
View All Comments
moiettoi - Friday, June 27, 2008 - link
Hi allThis sounds like a great board and for some-one like me that uses 4x22"monitors and does heaps of multi tasking it sounds perfect and would gladly pay the price asked.
BUT why is such a great board slowed right down by not having DDR3 memory sticks,,,because from what I've read at the momment there is not that much difference with running this and what I have now which is a quad core with DDR3 which runs great but I do overwork it.So bigger would be better.
You would think and I'm sure they already know that it would be common sence to make this board with DDR3 as it is it's only fault as far as I can see.
We will probably see that board come out soon or next in line once they have sold enough of these to satify there egos.
Great board but,,,,just not yet I will be waiting for the next one out which will have to carry DDR3,,,if they want to go forward in thier technolagy.
hnolagy
VooDooAddict - Thursday, February 7, 2008 - link
For testers of large distributed systems this is an awesome thing to have sitting on your desk.You can have a small server room running on one of these.
The biggest shortfall I see is cramming enough RAM on it.
iSOBigD - Tuesday, February 5, 2008 - link
I'm actually very disappointed with 3D rendering speed. Going from 1 core to 4 cores takes my rendering performance up by close to 400% (16 seconds to 4.something seconds, etc.) in Max with any renderer. (I've tried Scanline, MentalRay and VRay) ...so I'm surprised that going from 4 to 8 gives you 40-60% more speed. That's pretty pathetic, so I suspect the board is to blame, not the software.martin4wn - Tuesday, February 5, 2008 - link
Actually 40-60% is not disappointing at all, it's quite impressive. You are encountering the realities of Amdahl's law, which is that only the parallel part of the app scales. Here's a simple workthrough:Say the application is 94% parallel code and 6% serial. As you add cores, say the parallel part scales perfectly, so doubles in speed with every doubling in core count. Now say the runtime on one core is 16 seconds (your example). Of that, 1 second is serial code and the other 15 seconds is parallel code running serially.
Now running on a 4 core machine, you still have the 1s serial, but the parallel part drops to 15/4 = 3.75 seconds. Total runtime 4.75s. Overall scaling is 3.4x. Now go to 8 cores. Total runtime = 1 + 15/8 = 2.87s. Scaling of 60% going from 4 cores to 8 cores, and overall scaling of 5.5x
So the numbers are actually consistent with what you are seeing. It's a great illustration of the power of Amdahls law - even an app that is 94% parallel still only gains 60% going from 4 to 8 cores even with perfect scaling, and it's really hard to get good scaling at even moderate core counts. Once you get to 16 or more cores, expect scaling to fall off even more dramatically.
ChronoReverse - Tuesday, February 5, 2008 - link
This is why I'm quite happy with my quad core. What would probably be the useful limit on the desktop would be a quad core with SMT. After that faster individual cores will be needed regardless of how parallel our code gets (face it, you're not getting 90% parallelizeable software most of the time and even then 8 cores over 4 isn't getting more than about 50% boost in the best case for 90% parallel code).FullHiSpeed - Tuesday, February 5, 2008 - link
Why the heck does this D5400XS MB support only the QX9775 CPU ??? If you need to use 8 cores you can get a lot more bang for the buck with quad core Xeon 5400 series, with only 80 watts TDP each, up to 3 ghz. For a TOTAL of $508 ($254 each quad ) you can have 8 cores @ 2 Ghz.Last month I built a system with a Supermicro X7DWA-N MB ($500) and 4 gig of DDR2 667 ($220) and a single 2.83 Ghz Xeon E5440 ($773) , which I use to test Gen 2 PCIE, dual channel 8 Gb/s Fibre Channel boards, two boards at once.
Starglider - Tuesday, February 5, 2008 - link
Damnit. AMD could've destroyed this if they'd gotten their act togther. Tyan makes a 4 socket Opteron board that fits into an E-ATX form factor;http://www.tyan.com/product_board_detail.aspx?pid=...">http://www.tyan.com/product_board_detail.aspx?pid=...
I was strongly tempted to get one before the whole Barcelona launch farce. If AMD hadn't made such horrible execution blunders and could have devoted the kind of resources Intel had to a project like this, we could have four Barcelonas running at 3 to 3.6 GHz with eight DDR2 slots all on a dedicated channel. Ah well. Guess I'll be waiting for Nehalem.
enigma1997 - Tuesday, February 5, 2008 - link
Note what Francois said in his Feb 04 reply re memory timing http://blogs.intel.com/technology/2008/01/skulltra...">http://blogs.intel.com/technology/2008/01/skulltra... Do you think it would help the latency and make it closer to DDR2/DDR3 ones? Thanks.enigma1997 - Tuesday, February 5, 2008 - link
CL3 FBDIMM from Kingston would be "insanely fast"?! Have a read of this artcile: http://www.tgdaily.com/content/view/34636/135/">http://www.tgdaily.com/content/view/34636/135/Visual - Tuesday, February 5, 2008 - link
I must say, I am very disappointed.Not from performance - everything is as expected on this front... I didn't even need to see benchmarks about it.
But prices and availability are hell. AMD giving up on QuadFX is hell. Intel not letting us use DDR2 is hell.
I was really hoping I could get a dual-socket board with a couple (or quad) PCI-express x16 slots and standard ram, coupled with a pair of relatively inexpensive quadcore CPUs. Why is that too much to ask?
The ASUS L1N64-SLI WS board has been available for an eon now, costs less than $300 and has quite a good feature set. Quadcore Opterons for the same socket are also available for more than a quarter, some models as cheap as $200-$250.
Unfortunately, for some god-damned reason neither ASUS not AMD are willing to make this board work with these CPUs. The board works just fine with dual-core Opterons, all the while using standard unbuffered unregistered DDR2 modules, but not with quad cores? WTF.
And that board is old like the world now. I am quite certain AMD could, if they wanted, have a refresh already - using the newest and coolest chipsets with PCIe 2.0, HT 3.0, independent power planes for each cpu, etc.
Intel could also certainly have a dual socket board that works with cheap DDR2, have plenty PCI-express slots, and the cheap $300 quad-core Xeons that are out already instead of the $1500 "extremes".
I feel like the industry is purposely slowing, throttling technological progress. It's like AMD and Intel just don't want to give us the maximum of their real capabilities, because that would devalue their existing products too quickly. They are just standing around idly most of the time, trying to sell out their old tech.
Same as nVidia not letting us have SLI on all boards, or ATI not allowing crossfire on nforce for that matter.
Same as a whole lot of other manufacturers too...
I feel like there is some huge anti-progress conspiracy going on.