I would not say these cpus are for high end market. High end market are huge servers, with as many as 32 sockets, some monster servers even have 64 sockets! These expensive Unix RISC servers or IBM Mainframes, have extremely good RAS. For instance, some Mainframes do every calculation in three cpus, and if one fails it will automatically shut down. Some SPARC cpus can replay instructions if something went wrong. Hotswap cpus, and hotswap RAM. etc etc. These low end Xeon cpus have nothing of that.
PS. Remember that I distinguish between a SMP server (which is a single huge server) which might have 32/64 sockets and are hugely expensive. For instance, the IBM P595 32-socket POWER6 server used for the old TPC-C record, costed $35 million. No typo. One single huge 32 socket server, costed $35 million. Other examples are IBM P795, Oracle M5-32 - both have 32 sockets. Oracle have 32TB RAM which is the largest RAM server on the market. IBM Mainframes also belong to this category.
In contrast to this, every server larger than 32/64 sockets, is a cluster. For instance the SGI Altix or UV2000 servers, which sports up to 262.000 cores and 100s of TB. These are the characteristica of supercomputer clusters. These huge clusters are dirt cheap, and you pay essentially the hardware cost. Buy 100 nodes, and you pay 100 x $one node. Many small computers in a cluster.
Clusters are only used for HPC number crunching. SMP servers (one single huge server, extremely expensive because it is very difficult to scale beyond 16 sockets) are used for ERP business systems. An HPC cluster can not run business systems, as SGI explains in this link: http://www.realworldtech.com/sgi-interview/6/ "The success of Altix systems in the high performance computing market are a very positive sign for both Linux and Itanium. Clearly, the popularity of large processor count Altix systems dispels any notions of whether Linux is a scalable OS for scientific applications. Linux is quite popular for HPC and will continue to remain so in the future,...However, scientific applications (HPC) have very different operating characteristics from commercial applications (SMP). Typically, much of the work in scientific code is done inside loops, whereas commercial applications, such as database or ERP software are far more branch intensive. This makes the memory hierarchy more important, particularly the latency to main memory. Whether Linux can scale well with a SMP workload is an open question. However, there is no doubt that with each passing month, the scalability in such environments will improve. Unfortunately, SGI has no plans to move into this SMP market, at this point in time."
This is actually accurate as the E5 is Intel’s midrange Xeon series. Intel has the E7 line for those who want more RAS or scalability to 8 sockets. Features like memory hot swap can or lock step mirroring can be found in select high end Xeon systems. If you want ultra high end RAS, you can find it if you need it as well as pay the premium price premium for it.
“In contrast to this, every server larger than 32/64 sockets, is a cluster. For instance the SGI Altix or UV2000 servers, which sports up to 262.000 cores and 100s of TB. These are the characteristica of supercomputer clusters. These huge clusters are dirt cheap, and you pay essentially the hardware cost. Buy 100 nodes, and you pay 100 x $one node.”
Incorrect on several points but they’ve already been pointed out to you. The UV2000 is fully cache coherent (with up to 64 TB of memory) with a global address space that operates as one uniform, logical system that only a single OS/Hypervisor is necessary to boot and run.
Secondly, the price of the UV2000 does not scale linearly. There are NUMALink switches that bridge the coherency domains that have to be purchased to scale to higher node counts. This is expected of how the architecture scales and is similar to other large scale systems from IBM and Oracle.
“Clusters are only used for HPC number crunching.”
Incorrect. Clustering is standard in what you define as SMP applications (big business ERP). It is utilized to increase RAS and prevent downtime. This is standard procedure in this market.
“SMP servers (one single huge server, extremely expensive because it is very difficult to scale beyond 16 sockets) are used for ERP business systems. An HPC cluster can not run business systems,”
Why? As long as underlaying architecture is the same, they can run. You may not get the same RAS or scale as high in a single logical system but they’ll work. Performance is where you’d expected it on these boxes: a dual socket HPC system will perform roughly one quarter the speed of as the same chips occupying an 8 socket system.
As pointed out numerous times before, that link is you cite is a decade old. SGI has moved into the SMP space with the Altix UV series. Continuing to use this link as relevant is plain disingenuous and deceptive.
As for an example of a big ERP application running on such an architecture, the US Post Office run’s Oracle Data Warehousing software on a UV1000. ( https://www.fbo.gov/index?s=opportunity&mode=f... )
Do you really think that UV (which is the successor to Altix) is that different? Windows is Windows, and it will not magically challenge Unix or OpenVMS, in some iterations later. Windows will not be superior to Unix after some development. You think that HPC- Altix will after some development, be superior to Oracle and IBM's huge investments of decades research in billions of USD? Do you think Oracle and IBM has stopped developing their largest servers?
Altix is only for HPC number crunching, says SGI in my link. Today the UV line of servers, has up to 262.000 cores and 100s of TB of RAM. Whereas the largest Unix and IBM Mainframes have 64 sockets and couple of TB RAM, after decades of research.
In a SMP server, all cpus will have to be connected to each other, for this SGI UV2000 with 32.768 cpus, you would need (n²) 540 million (half a billion) threads connecting each cpu. Do you really think that is feasible? Does it sound reasonable to you? IBM and Oracle and HP has had great problems connecting 32 sockets to each other, just look at the connections on the last picture at the bottom, do you see all connections? Now imagine half a billion of them in a server! http://www.theregister.co.uk/2013/08/28/oracle_spa...
But on the other hand, if you keep the number of connection downs to islands, and then connect the islands to each other, you dont need half a billion. This solution would be feasible. And then you are not in SMP territory anymore: SGI say like this on page 4 about the UV2000 cluster: www.sgi.com/pdfs/4395.pdf "...SMP is based on intra-node communication using memory shared by all cores. A cluster is made up of SMP compute nodes but each node cannot communicate with each other so scaling is limited to a single compute node...."
Dont you think that a 262.000 core server and 100s of TB of RAM sounds more like a cluster, than a single fat SMP server? And why do the UV line of servers focus on OpenMPI accerators? OpenMPI is never used in SMP workloads, only in HPC.
Do you have any benchmarks where one 32.768 cpu SGI UV2000 demolishes 50-100 of the largest Oracle SPARC M6-32 in business systems? And why is the UV2000 much much cheaper than a 16/32 socket Unix server? Why does a single 32 socket Unix server cost $35 million, whereas a very large SGI cluster with 1000 of sockets is very very cheap?
Wow, I think the script you're copy/pasting from needs better revision.
"Do you really think that UV (which is the successor to Altix) is that different?"
Yes. SGI changed the core achitecture to add cache coherent links between the entire system. Clusters tend to have an API on top of a networking software stack to abstract the independent systems so they may act as one. The UV line does not need to do this. For one processor to use memory and performance calculations on data residing off of CPU on the other end, a memory read operation is all that is needed on the UV. It is really that simple.
"Windows is Windows, and it will not magically challenge Unix or OpenVMS, in some iterations later."
The UV can run any OS that runs on modern x86 hardware today. Windows, Linux, Solaris (Unix) and perhaps at some point NonStop (HP's mainframe OS http://h17007.www1.hp.com/us/en/enterprise/servers... ). The x86 platform has plenty of choices to choose from.
"You think that HPC- Altix will after some development, be superior to Oracle and IBM's huge investments of decades research in billions of USD? Do you think Oracle and IBM has stopped developing their largest servers?"
What I see SGI offering is another tool alongside IBM and Oracle systems. Also you mention decades of research, then it is also fair to put SGI into that category as that link you love to spam IS A DECADE OLD. Clearly SGI didn't have this technology back in 2004 when that interview was written.
"Today the UV line of servers, has up to 262.000 cores and 100s of TB of RAM. Whereas the largest Unix and IBM Mainframes have 64 sockets and couple of TB RAM, after decades of research."
Actually this is a bit incorrect. IBM can scale to 131,072 cores on POWER7 if the coherency requirement is forgiven. Oh, and this system can run either AIX or Linux when maxed out. Source: http://www.theregister.co.uk/Print/2009/11/27/ibm_...
"In a SMP server, all cpus will have to be connected to each other, for this SGI UV2000 with 32.768 cpus, you would need (n²) 540 million (half a billion) threads connecting each cpu. http://www.theregister.co.uk/2013/08/28/oracle_spa...
Wow, do you not read your own sources? Not only is your math horribly horribly wrong but the correct methodology is found for calculating the number of links as things scale is in the link you provided. To quote that link: "The Bixby interconnect does not establish everything-to-everything links at a socket level, so as you build progressively larger machines, it can take multiple hops to get from one node to another in the system. (This is no different than the NUMAlink 6 interconnect from Silicon Graphics, which implements a shared memory space using Xeon E5 chips...)"
The full implication here is that if the UV 2000 is not a socket machine, then neither is Oracle's soon-to-be-released 96 socket device. The topology to scale is the same in both cases per your very own source.
"SGI say like this on page 4 about the UV2000 cluster: www.sgi.com/pdfs/4395.pdf"
Fundamentally false. If you were to actually *read* the source material for that quote, it is not describing the UV2000. Rather is speaking generically abou the differences between a cluster and large SMP box on page 4. If you got to page 19, it further describes the UV 2000 as a single system image unlike that of a cluster as defined on page 4.
"Dont you think that a 262.000 core server and 100s of TB of RAM sounds more like a cluster, than a single fat SMP server? And why do the UV line of servers focus on OpenMPI accerators? OpenMPI is never used in SMP workloads, only in HPC."
All I'd say about a 262,000 core server is that it wouldn't fit into a single box. Then again IBM, Oracle and HP are spreading their large servers across multiple chassis so this doesn't bother me at all. The important part is how all these boxes are connected. SGI uses NUMAlink6 which provides cache coherency and a global address space for a single system image. OpenMPI can be used inside of a cache coherent NUMA system as it provides a means to gurantee memory locality when data is used for execution. It is a means of increasing efficiency for applications that use it. However, OpenMPI libraries do not need to be installed for software to scale across all 256 sockets on the UV200. It is purely an option for programmers to take advantage of.
"And why is the UV2000 much much cheaper than a 16/32 socket Unix server? Why does a single 32 socket Unix server cost $35 million, whereas a very large SGI cluster with 1000 of sockets is very very cheap?"
First, to maintain coherency, the UV2000 only scales to 256 sockets/64 TB of memory. Second, the cost of a decked out P795 from IBM in terms of processors (8 sockets, 256 cores) and memory (2 TB) but only basic storage to boot the system is only $6.7 million whole sale. Still expensive but far less than what you're quoting. It'll require some math and reading comprehension to get to that figure but here is the source: http://www-01.ibm.com/common/ssi/ShowDoc.wss?docUR...
I couldn't find pricing for the UV2000 as a complete system but purchasing the Intel processors and memory seperately to get to a 256 socket/64 TB system would be just under $2 million. Note that that figure is just processor + memory, no blade chassis, racks or interconnect to glue everything together. That would also be several million. So yes, the UV2000 does come out to be cheaper but not drastically. That IBM pricing document does highlight why their high end systems costs so much, mainly capacity on demand. The p795 is getting a mainframe like pricing structure where you purchase the hardware and then you have to activate it as an additional cost. Not so on the UV2000.
"High end market are huge servers, with as many as 32 sockets, some monster servers even have 64 sockets!"
Partially true. The entire cabinet might have that many sockets/processors, but on a per-system, per-"box" level, most max out between two and four. You get a few odd balls here and there that would have a daughter board for a true 8-socket system, but those are EXTREMELY rare in actuality. (Tyan, I think had one for the AMD Opterons, and they said that less than 5% of the orders were for the full fledge 8-socket systems).
"PS. Remember that I distinguish between a SMP server (which is a single huge server) which might have 32/64 sockets and are hugely expensive. For instance, the IBM P595 32-socket POWER6 server used for the old TPC-C record, costed $35 million. No typo. One single huge 32 socket server, costed $35 million. Other examples are IBM P795, Oracle M5-32 - both have 32 sockets. Oracle have 32TB RAM which is the largest RAM server on the market. IBM Mainframes also belong to this category." Again, only partially true. The costs and stuff is correct, but the assumptions that you're writing about is incorrect. SMP is symmetric multiprocessing. BY DEFINITION, that means that "involves a multiprocessor computer hardware and software architecture where two or more identical processors connect to a single, shared main memory, have full access to all I/O devices, and are controlled by a single OS instance that treats all processors equally, reserving none for special purposes." (source: wiki) That means that it is a monolithic system, again, of which, few are TRULY such systems. If you've ever ACTUALLY witnessed the startup/bootup sequence of an ACTUAL IBM mainframe, the rest of the "nodes" are actually booted up typically by PXE or something very similiar to that, and then the "node" is ennumerated into the resource pool. But, for all other intents and purposes, they are semi-independent, standalone systems, because SMP systems do NOT have the capability to pass messages and/or memory calls (reads/writes/requests) without some kind of a transport layer (for example MPI).
Furthermore, the old TPC-C that you mention, they do NOT process as one monolithic sequential series of events in parallel (so think of like how PATA works...), but rather more like a JBOD SATA (i.e. the processing of the next transaction does NOT depend on ALL of the current block of transactions to be completed, UNLESS there is an inherent dependency issue, which I don't think would be very common in TPC-C). Like bank accounts, they're all treated as discrete and separate, independent entities, which means you can send all 150,000 accounts against the 32-socket or 64-socket system and it'll just pick up the next account when the current one is done, regardless.
The other failure in your statement or assumption is that's why there's something called HA - high avialability. Which means that they can dynamically hotswap an entire node if there's a CPU failure, so that the node can be downed and yanked out for service/repair while another one is hotswapped in. So it will failover to a spare hotswap node, work on it, and then either fall over back to the newly replaced node or it would rotate the new one into the hotswap failover pool. (There are MANY different ways of doing that and MANY different topologies).
The statement you made about having 32TB of RAM is again, partially true. But NONE of the single OS instances EVER have full control of all 32TB at once, which again, by DEFINITION, means that it is NOT truly SMP. (Course, if you ever get a screenshot which shows that, I'd LOVE to see it. I'd LOVE to get corrected on that.)
"In contrast to this, every server larger than 32/64 sockets, is a cluster." Again, not entirely true. You can actually get 4 socket systems that are comprised of two dual-socket nodes and THAT is enough to meet the requirements of a cluster. Heck, if you pair up two single-socket consumer-grade systems, that TOO is a cluster. That's kinda how Beowulf clusters got started - cuz it was an inexpensive way (compare to the aforementioned RISC UNIX based systems) to gain computing power without having to spend a lot of money.
'These huge clusters are dirt cheap" Sure...if you consider IBM's $100 million contract award "cheap".
"Clusters are only used for HPC number crunching. SMP servers (one single huge server, extremely expensive because it is very difficult to scale beyond 16 sockets) are used for ERP business systems. An HPC cluster can not run business systems, as SGI explains in this link:" So there's the two problems with this - 1) it's SGI - so of course they're going to promote what they ARE capable of vs. what they don't WANT to be capable of. 2) Given the SGI-biased statements, this, again, isn't EXACTLY ENTIRELY true either.
But that also depends on the specific implementation of the ERP system given that SAP is NOT the ONLY ERP system that's available out there, but it's probably one of the most popular one, if not THE most popular one. (There's a whole thing about distributed relational databases so that the database can reside in smaller chunks across multiple nodes, in-memory, which are then accessed via a high speed interconnect like Myrinet or Infiniband or something along those lines.)
Furthermore, the fact that ERP runs across large mainframes (it grows as the needs grows), is an indications of HPC's place in ERP. Alternatively, perhaps rather than using it for the backend, HPC can be used on the front end by supporting many, many, many virtualized front-end clients.
Like I said, most of the numbers that you wrote are true, but the assumptions behind them isn't exactly all entirely true.
"That means that it is a monolithic system, again, of which, few are TRULY such systems. If you've ever ACTUALLY witnessed the startup/bootup sequence of an ACTUAL IBM mainframe, the rest of the "nodes" are actually booted up typically by PXE or something very similiar to that, and then the "node" is ennumerated into the resource pool. But, for all other intents and purposes, they are semi-independent, standalone systems, because SMP systems do NOT have the capability to pass messages and/or memory calls (reads/writes/requests) without some kind of a transport layer (for example MPI)."
Not exactly. IBM's recent boxes don't boot themselves. Each box has a service processor that initializes the main CPU's and determines if there are any additional boxes connected via external GX links. If it finds external boxes, some negotiation is done to join them into one large coherent system before an attempt to load an OS is made. This is all done in hardware/firmware. Adding/removing these boxes can be done but there are rules to follow to prevent data loss.
It'll be interesting to see what IBM does with their next generation of hardware as the GX bux is disappearing.
"The statement you made about having 32TB of RAM is again, partially true. But NONE of the single OS instances EVER have full control of all 32TB at once, which again, by DEFINITION, means that it is NOT truly SMP. (Course, if you ever get a screenshot which shows that, I'd LOVE to see it. I'd LOVE to get corrected on that.)"
Actually on some of these larger systems, a single OS can see the entire memory pool and span across all sockets. The SGI UV2000 and SPARC M6 are fully cache coherent across a global memory address space.
As for a screenshot, I didn't find one. I did find a video going over some of the UV 2000 features displaying all of this though. It is only a 64 socket, 512 core, 1024 thread, 2 TB of RAM configuration running a single instance of Linux. :) https://www.youtube.com/watch?v=YUmBu6A2ykY
IBM's topology is weird in that while a global memory address space is shared across nodes, it is not cache coherent. IBM's POWER7 and their recent BlueGene systems can be configured like this. I wouldn't call these setups clusters as there is no software overhead to read/write to remote memory addresses but it isn't fully SMP either due to multiple coherency domains.
Ian, the Xeon E3-1220v3 and E3-1225v3 do not have hyperthreading. They're incorrectly listed in the table as 4c/8t. At those prices, if they did have 8t, more people would be buying them! Also, I think that "c3" by the E3-1230 is a typo.
"If the E5-2697 v2 was put in this position, we would have 12 cores at 3.5 GHz, ready to blast through the workload."
I do not think this is possible. I have tried to force all cores to turbo mode with ThrottleStop on the 2697v2 Xeon (ThrottleStop bypasses the BIOS/UEFI and codes the limits directly using MSR registers), but the CPU will just refuse to grant this.
I suppose the desktop CPUs simply have this option unlocked, while 2S/4S Xeons have much stricter operating point limits.
The best I can do with 2697 v2 is to set the BCLK to 105 MHz with Z9PE-D8 WS and get 3.15 GHz maximum all-core turbo. This is as much overclock as the system can take.
I've managed to get 110 BCLK on both processors relatively stable (112 BCLK needs a push), but this boosts up from the lower multiplier rather than the high one, and there is still a deficit on the high end. Enthusiasts will always want more, and I'd love the chance to run all the cores at the top turbo mode. Given how this is on the consumer line, it makes me wonder why Intel doesn't allow it here. The downside on the consumer line of allowing this behaviour is every so often there is a motherboard that fails to implement any Turbo Core, which has happened in my testing already.
I suppose some Intel marketing people just wanted to stop a possibility of lower-end Xeon cannibalizing higher-end Xeons by cheap overclocking. There are markets even in server business that are OK with overclocking (low latency trading, for example).
It is interesting that you got it to 110 MHz BCLK. Did you use a server/workstation board or a HEDT (X79) board?
This was in the MSI X79A-GD45 Plus, an X79 board. There are some server market areas that do sell pre-overclocked systems in this way, while still using Xeons, if I remember correctly.
I think HEDT boards are better when it comes to overclocking, probably due to higher component tolerances.
Z9PE-D8 WS is not that good, but then again, it is a 2S board and apart from some Supermicro products (the "hyperspeed" series), the only 2S board that allows at least some overclocking of DDR3 and CPU.
I am perplexed by the gaming benchmarks... Any particular reason why the 4770K and A10-7850K don't show up on all of the single and double GPU benchmarks? Especially considering that you have some tri-fire benches of the a10-7850K...
The AMD does not allow dual NVIDIA cards because the platform does not allow SLI. I need to re-run the 4770K in a PLX8747 enabled motherboard to get 3x SLI results across the board (you cannot get 3x SLI without a PLX chip), and I have not had a chance to run either CPU on my BF4 benchmark which has just been finalised for this review. The A10-7850K and i7-4770K numbers were taken from the Kaveri review and some internal testing - now my 2014 benchmarks are finalised I can run it on more platforms as the year goes on.
Why aren't Xeon E3s the recommended CPUs for enthusiast desktops? They make more sense than the Core i5 and i7 which come with integrated graphics that never gets used.
Yeah, the Xeon E3s are great processors for enthusiasts/consumers.
I was tossing up between an E3-1230 v3 (2MB more cache) or an i5 4670K (IGP, unlocked multiplier) in my latest build, and ended up with the 4670K because it had a $30 instant rebate, and I wanted quicksync (I know Ganesh says it has inferior transcode quality to x264, but I just need something quick and dirty for my phone, and honestly, I'm seeing more artefacts in the x264 samples he posted than in the QS samples). But it wasn't an easy decision.
"I'd take a CPU with IGP any day, unless it costs significantly more. Reason: increased resale value, even if I don't use the IGP myself."
I would not.
Reason is even when you use discrete, it uses additional power. IIRC for Sandy Bridge, a 3.6 GHz Sandy without the IGP switched on uses as much as a 3.4 GHz with the IGP on. Seeing that the graphics have become a bigger emphasis with Haswell, I expect that the IGP will be using a bigger percentage of the total power consumption with time.
okay...please tell me im not the only one who thinks this article is hilarious! look...if you are going to review two very expensive cpu's, don't test these against normal cpu's. It is shear idiocy to compare these to "desktop cpu's" because they are not....at all. Why can't anandtech do reviews on other affordable xeons and compare them all together. Also funny how they included AMD. That's cute
you're the only one who think this article is hilarious.
I thought/wished about this dataset previously. What if we simply go to XEONs and trade off clock for cores. i.e. the next logical progression from i7-4960X price bracket. obviously with no control over turbo bins & OC capability, the single thread is hosed. But wondered how many cores near-pro programs can really utilize.
I am just glad we get to see some comparisons. getting more XEON dataset would be interesting but I think it's beyond most of Anandtech reader usage partern.
You do indeed have a point about consumer cpu's and how they can overclock and are nifty to have for changing multiple options. but...the xeon market for average consumers needs to either be booming or a well kept secret. the 1230v3 is a fantastic cpu and can (for the most part) equal a 4770 for 240 dollars. I know it can not be overclocked, but people (about 70% of the market) do not give a care. But anandtech viewers are overly intelligent arnt we?
Had they not included these regular CPUs people would (rightfully) complain "what do these numbers mean without familiar comparison points?" Comparing 2 products to work out the real-world differences for some specific application is completely different from claiming they'd be equal.
this bench also shows that the haswell had almost no CPU related performance benefits over IVB (if not slowed down performance) looking at 3770k vs 4770k and that haswell ups the gpu performance only.
its a shame they didn't do a UHD x264 encode here as that would have shown a haswell AVX2 improvement (something like 90% over AVX), and why people will have to wait for the xeons to catch up to at least AVX2 if not AVX3.1
There is no "90% speedup over AVX" between HSW and IVB architectures.
AVX (v1) is floating point only and thus was useless for x264. For floating point workloads you would be very lucky to get 10% improvement by jumping to AVX2. The only difference between AVX and AVX2 for floating point is the FMA instruction and gather, but gather is done in microcode for Haswell, so it is not actually much faster than manually gathering data.
Now, x264 AVX2 is a big improvement because it is an integer workload, and with AVX (v1) you could not do that. So x264 is jumping from SSE4.x to AVX2, which is a huge jump and it allows much more efficient processing.
For integer workloads that can be optimized so that you load and process eight 32-bit values at once, AVX2 Xeon EPs/EXs will be a big thing. Unfortunately, this is not so easy to do for a general-purpose algorithms. x264 team did the great job, but I doubt you will be using 14 core single Haswell EP (or 28 core dual CPU) for H.264 transcoding. This job can be done probably much more efficient with dedicated accelerators.
As for the scientific applications, they already benefit from AVX v1 for floating point workloads. AVX2 in Haswell is just a stop-gap as the gather is microcoded, but getting code ready for hardware gather in the future uArch is definitely a good way to go.
Finally, when Skylake arrives with AVX 3.1, this will be the next big jump after AVX (v1) for scientific / floating point use cases.
Shouldn't both the Xeon E5-2687W v2 support 384 GB of memory? 4 channels * 3 slots per channel * 32 GB DIMM per slot? (Presumably it could be twice that using eight rank 64 GB DIMMs but I'm not sure if Intel has validated them on the 6 and 10 core dies.) Registered memory has to be used for the E6-2687w v2 to get to 256 GB, just is the chip not capable of running a third slots per channel? Seems like a weird handicap. I can only imagine this being more of a design guideline rule than anything explicit. The 150W CPU's are workstation focused which tend to only have 8 slots maximum.
Also a bit weird is the inclusion of the E5-2400 series on the first page's table. While they use the same die, they use a different socket (LGA 1356) with triple memory support and only 24 PCI-e lanes. With the smaller physical area and generally lower TDP's, they're aimed squarely the blade server market. Socket LGA 2011 is far more popular in the workstation and 1U and up servers.
a mass of other's might argue a 12 core/24 thread chip or better is a potential "real-time" UHD x264 encoding machine , its just out of most encoders budgets, so NO SALE....
Corrected :) The test setup for the A10-7850K is the same as the Kaveri review. ASRock FM2A88X Extreme6+ with extra cooling, 2x8GB DDR3-2133 (i.e. rated processor speed).
I have two 2697v2 I was gifted, and while I'm only running one on an x79 MB, I have two questions I can't find answers to elsewhere on the 2697 v2: 1. Is the memory limited to 1866, even on motherboards supporting higher overclocks? I have tried to run memory above that speed (1866 memory that usually over clocks well) and the computer refuses to boot past the bios at that speed. 2. What would the performance gains be with both installed, in reference to multithreaded activities, like rendering, or even more rudimentary, like x264 or handbrake conversion? I would guess with single threaded activity, there would be no difference in performance, than one CPU.
I have tested two generations of 2697 v2 (C0 and C1 stepping) and both refuse to accept anything above 1866 MHz. Practically, my old workhorse (dual 2687W) was much better in that regard and could run DDR3 @2133 MHz without any trickery.
Although the CPU platforms (JakeTown and IvyTown) are pin-compatible for the EP series, high-core-count (HCC) EP IvyTowns have two separate memory controllers and I suppose this introduces regressions when it comes to "overclockability" of the RAM.
As for the #2, if you are running NUMA-aware multithreaded software that can spawn 24 or 48 threads, you can expect almost linear performance scaling with dual-CPU setups.
If the software is not NUMA aware, then there are performance drops that can be 30-50% (so you get, maybe, 1.5x speedup). If the software cannot get more than, say, 8 threads, then there would be no speedups (but even in this case you can start two separate processes and do two encoding sessions at once, and regain the 2x speedup)
This doesn't really answer any question about the State of Xeon.
So what exactly is difference between a Xeon E3- v3 and a Normal Top End Haswell Chip?
We have Broadwell soon ( in a few months? ) Are we suppose to get new Haswell E5 too? Isn't the Xeon E5 always one year behind the desktop counterpart, or are they slipping even more?
I need to spend some time to organise this with my new 2014 benchmark setup. That and I've never used bench to add data before. But I will be putting some data in there for everyone :)
There is one sad thing - disappearance of 2C/4T high clock speed CPUs, as Oracle Enterprise Edition charges by cores.....and sometimes you need just small installation but with EE features...
Wouldn't L3/thread be a more useful metric than L3/core in the big table? HT will only really work after all, if both threads are in cache, and if you can get a CPU with HT and one without, as is the case with the Xeons, you'd get the one without because you are running more concurrent threads. That means that under optimum conditions, you have 2 threads per core that are active, and thus 2x#cores threads that need to be in the data caches.
holy shit anandtech you really have gone to the dogs - comparing a £2000 cpu against a £100 apu and saying its better..... and really? wheres the AMD AM3+ cpu`s? 8350 or 9590? seriously
Let's see. I'm not comparing it against a £100 APU, I'm comparing it against the $1000 Core i7-4960X to see the difference. We're using a new set of benchmarks for 2014, which I have already run on the APU so I include them here as a point of reference for AMD's new highest performance line. It is interesting to see where the APU and Xeon line up in the benchmarks to show the difference (if any). AMD's old high end line has stagnated - I have not tested those CPUs in our new 2014 set of benchmarks. There have been no new AM3+ platforms or CPUs this year, or almost all of last year. Testing these two CPUs properly took the best part of three weeks, including all the other work such as news, motherboard reviews, Mobile World Congress coverage, meetings, extra testing, bug fixing, conversing with engineers on how to solve issues. Sure, let's just stop all that and pull out an old system to test. If I had the time I really would, but I was able to get these processors from GIGABYTE, not Intel, for a limited time. I have many other projects (memory scaling, Gaming CPU) that would take priority if I had time.
AKA I think you missed the point of the article. If you have a magical portal to Narnia, I'd happily test until I was blue in the face and go as far back to old Athlon s939 CPUs. But the world moves faster than that.
any chance of updating this article with some x265 and/or Divx265 benchmarks? hevc is much more processor intensive and threading friendly, so these encoders may be perfect for showing a greater separation between the various core configurations.
1. please change the charts' headings on the first page to say 'Cores/Threads' instead of 'Cores'.
2. it wasn't clear on the first page that this is talking about workstation CPUs.
3. "Intel can push core counts, frequency and thus price much higher than in the consumer space" I would have said core counts and cache... Don't the consumer parts have the highest clocks (before overclocking)?
1) I had it that way originally but it broke the table layout due to being too wide. I made a compromise and hoped people would follow the table in good faith. 2) Generally Xeon in the name means anything Workstation and above. People use Xeons for a wide variety of uses - high end for workstaitons, or low end for servers, or vice versa. 3) Individual core counts maybe, but when looking at 8c or 12c chips in the same power bracket, the frequency is still being pushed to more stringent requirements (thus lower yields/bin counts) vs. voltages. Then again, the E3-1290 does go to 4.0 GHz anyway, so in terms of absolute frequencies you can say (some) Xeons at least match the consumer parts.
I know you were benchmarking these Xeons for home use, thus the selection of rendering and gaming benchmarks. But there are lots of us doing home virtualization (VMWare ESXi all-in-one servers using PCI-passthrough ZFS virtual SAN and multiple VMs). It would be great so see some virtualization benchmarks. For further reference see: http://www.napp-it.org/index_en.html
Hi Ian, Thanks for a great review. Do you think there's any possibility of adding V-Ray to your workstation benchmarks? It's an incredibly popular renderer that is multi-platform and also works in pretty much any decent 3D software (Max, Maya, C4D etc). It also sucks the life out of any computer when it's running, so would be perfect for your tests.
Question. Since you discuss turbo bins at length and the article revolves around them, how does Windows Server handle load balancing in regards to the turbo bins. On a 2P E5-2697 will the OS balance all the threads on a single CPU first? Spread evenly across both CPUs? Max out all physical cores before assigning threads to logical cores?
Is the OS capable of spreading 3 threads to each processor to ensure they both run at the max turbo frequency for as long as possible? Or would it instead max out one processor to attempt to let the other retain a lower power state? For that matter is any of this even configurable under Windows Server?
Interresting article, but I dont agree with your conclusion on the 2667 vs 2687w : You say 2667 id cheaper, ok but 50 $ difference in list price on cpu costing 2100 $, that is less than 2%. You also say 2687w v2 use more energy than 2667 v2, do you have proof of that ? For me the fact that 2687w v2 has a 150w tdp only mean it can keep its turbo frequency under higher load than 2667, with situations where 2667 turbo mode would drop because of the power usage while not on the 2687w v2, making it in fine a faster cpu than 2667 under heavy loads. If the two cpu run the same computation at the same frequency, there is no reason 2687w v2 uses more power, it would be like saying that i5 and i7 consume the same because they have the same tdp, while everybody knows it is not the case.
I just bought a 2687w v2, it ended up being $3 difference between them, I have i7-3970X so I have 150W chip anyway, so TDP wasn't really much of a factor to me.
It would be interesting to do a head to head of them and see how they perform, in thermal load/power.
Following Ian's logic he's be super interested in the E5-2673 v2, this is the same as the 2667 but with 110W TDP.
If the 2690 had a little higher turbo, it would be great, 10/20 with say 3 stock and 3.8 turbo
We’ve updated our terms. By continuing to use the site and/or by logging into your account, you agree to the Site’s updated Terms of Use and Privacy Policy.
71 Comments
Back to Article
vLsL2VnDmWjoTByaVLxb - Monday, March 17, 2014 - link
> TrueCrypt is an off the shelf open source encoding tool for files and folders.Encoding?
Brutalizer - Monday, March 17, 2014 - link
I would not say these cpus are for high end market. High end market are huge servers, with as many as 32 sockets, some monster servers even have 64 sockets! These expensive Unix RISC servers or IBM Mainframes, have extremely good RAS. For instance, some Mainframes do every calculation in three cpus, and if one fails it will automatically shut down. Some SPARC cpus can replay instructions if something went wrong. Hotswap cpus, and hotswap RAM. etc etc. These low end Xeon cpus have nothing of that.PS. Remember that I distinguish between a SMP server (which is a single huge server) which might have 32/64 sockets and are hugely expensive. For instance, the IBM P595 32-socket POWER6 server used for the old TPC-C record, costed $35 million. No typo. One single huge 32 socket server, costed $35 million. Other examples are IBM P795, Oracle M5-32 - both have 32 sockets. Oracle have 32TB RAM which is the largest RAM server on the market. IBM Mainframes also belong to this category.
In contrast to this, every server larger than 32/64 sockets, is a cluster. For instance the SGI Altix or UV2000 servers, which sports up to 262.000 cores and 100s of TB. These are the characteristica of supercomputer clusters. These huge clusters are dirt cheap, and you pay essentially the hardware cost. Buy 100 nodes, and you pay 100 x $one node. Many small computers in a cluster.
Clusters are only used for HPC number crunching. SMP servers (one single huge server, extremely expensive because it is very difficult to scale beyond 16 sockets) are used for ERP business systems. An HPC cluster can not run business systems, as SGI explains in this link:
http://www.realworldtech.com/sgi-interview/6/
"The success of Altix systems in the high performance computing market are a very positive sign for both Linux and Itanium. Clearly, the popularity of large processor count Altix systems dispels any notions of whether Linux is a scalable OS for scientific applications. Linux is quite popular for HPC and will continue to remain so in the future,...However, scientific applications (HPC) have very different operating characteristics from commercial applications (SMP). Typically, much of the work in scientific code is done inside loops, whereas commercial applications, such as database or ERP software are far more branch intensive. This makes the memory hierarchy more important, particularly the latency to main memory. Whether Linux can scale well with a SMP workload is an open question. However, there is no doubt that with each passing month, the scalability in such environments will improve. Unfortunately, SGI has no plans to move into this SMP market, at this point in time."
Kevin G - Tuesday, March 18, 2014 - link
@BrutalizerAnd here we go again. ( http://anandtech.com/comments/7757/quad-ivy-brigde... )
“These low end Xeon cpus have nothing of that.”
This is actually accurate as the E5 is Intel’s midrange Xeon series. Intel has the E7 line for those who want more RAS or scalability to 8 sockets. Features like memory hot swap can or lock step mirroring can be found in select high end Xeon systems. If you want ultra high end RAS, you can find it if you need it as well as pay the premium price premium for it.
“In contrast to this, every server larger than 32/64 sockets, is a cluster. For instance the SGI Altix or UV2000 servers, which sports up to 262.000 cores and 100s of TB. These are the characteristica of supercomputer clusters. These huge clusters are dirt cheap, and you pay essentially the hardware cost. Buy 100 nodes, and you pay 100 x $one node.”
Incorrect on several points but they’ve already been pointed out to you. The UV2000 is fully cache coherent (with up to 64 TB of memory) with a global address space that operates as one uniform, logical system that only a single OS/Hypervisor is necessary to boot and run.
Secondly, the price of the UV2000 does not scale linearly. There are NUMALink switches that bridge the coherency domains that have to be purchased to scale to higher node counts. This is expected of how the architecture scales and is similar to other large scale systems from IBM and Oracle.
“Clusters are only used for HPC number crunching.”
Incorrect. Clustering is standard in what you define as SMP applications (big business ERP). It is utilized to increase RAS and prevent downtime. This is standard procedure in this market.
“SMP servers (one single huge server, extremely expensive because it is very difficult to scale beyond 16 sockets) are used for ERP business systems. An HPC cluster can not run business systems,”
Why? As long as underlaying architecture is the same, they can run. You may not get the same RAS or scale as high in a single logical system but they’ll work. Performance is where you’d expected it on these boxes: a dual socket HPC system will perform roughly one quarter the speed of as the same chips occupying an 8 socket system.
“as SGI explains in this link:
http://www.realworldtech.com/sgi-interview/6/“
As pointed out numerous times before, that link is you cite is a decade old. SGI has moved into the SMP space with the Altix UV series. Continuing to use this link as relevant is plain disingenuous and deceptive.
As for an example of a big ERP application running on such an architecture, the US Post Office run’s Oracle Data Warehousing software on a UV1000. ( https://www.fbo.gov/index?s=opportunity&mode=f... )
Brutalizer - Tuesday, March 18, 2014 - link
Do you really think that UV (which is the successor to Altix) is that different? Windows is Windows, and it will not magically challenge Unix or OpenVMS, in some iterations later. Windows will not be superior to Unix after some development. You think that HPC- Altix will after some development, be superior to Oracle and IBM's huge investments of decades research in billions of USD? Do you think Oracle and IBM has stopped developing their largest servers?Altix is only for HPC number crunching, says SGI in my link. Today the UV line of servers, has up to 262.000 cores and 100s of TB of RAM. Whereas the largest Unix and IBM Mainframes have 64 sockets and couple of TB RAM, after decades of research.
In a SMP server, all cpus will have to be connected to each other, for this SGI UV2000 with 32.768 cpus, you would need (n²) 540 million (half a billion) threads connecting each cpu. Do you really think that is feasible? Does it sound reasonable to you? IBM and Oracle and HP has had great problems connecting 32 sockets to each other, just look at the connections on the last picture at the bottom, do you see all connections? Now imagine half a billion of them in a server!
http://www.theregister.co.uk/2013/08/28/oracle_spa...
But on the other hand, if you keep the number of connection downs to islands, and then connect the islands to each other, you dont need half a billion. This solution would be feasible. And then you are not in SMP territory anymore: SGI say like this on page 4 about the UV2000 cluster:
www.sgi.com/pdfs/4395.pdf
"...SMP is based on intra-node communication using memory shared by all cores. A cluster is made up of SMP compute nodes but each node cannot communicate with each other so scaling is limited to a single compute node...."
Dont you think that a 262.000 core server and 100s of TB of RAM sounds more like a cluster, than a single fat SMP server? And why do the UV line of servers focus on OpenMPI accerators? OpenMPI is never used in SMP workloads, only in HPC.
Do you have any benchmarks where one 32.768 cpu SGI UV2000 demolishes 50-100 of the largest Oracle SPARC M6-32 in business systems? And why is the UV2000 much much cheaper than a 16/32 socket Unix server? Why does a single 32 socket Unix server cost $35 million, whereas a very large SGI cluster with 1000 of sockets is very very cheap?
Kevin G - Tuesday, March 18, 2014 - link
Wow, I think the script you're copy/pasting from needs better revision."Do you really think that UV (which is the successor to Altix) is that different?"
Yes. SGI changed the core achitecture to add cache coherent links between the entire system. Clusters tend to have an API on top of a networking software stack to abstract the independent systems so they may act as one. The UV line does not need to do this. For one processor to use memory and performance calculations on data residing off of CPU on the other end, a memory read operation is all that is needed on the UV. It is really that simple.
"Windows is Windows, and it will not magically challenge Unix or OpenVMS, in some iterations later."
The UV can run any OS that runs on modern x86 hardware today. Windows, Linux, Solaris (Unix) and perhaps at some point NonStop (HP's mainframe OS http://h17007.www1.hp.com/us/en/enterprise/servers... ). The x86 platform has plenty of choices to choose from.
"You think that HPC- Altix will after some development, be superior to Oracle and IBM's huge investments of decades research in billions of USD? Do you think Oracle and IBM has stopped developing their largest servers?"
What I see SGI offering is another tool alongside IBM and Oracle systems. Also you mention decades of research, then it is also fair to put SGI into that category as that link you love to spam IS A DECADE OLD. Clearly SGI didn't have this technology back in 2004 when that interview was written.
"Today the UV line of servers, has up to 262.000 cores and 100s of TB of RAM. Whereas the largest Unix and IBM Mainframes have 64 sockets and couple of TB RAM, after decades of research."
Actually this is a bit incorrect. IBM can scale to 131,072 cores on POWER7 if the coherency requirement is forgiven. Oh, and this system can run either AIX or Linux when maxed out. Source: http://www.theregister.co.uk/Print/2009/11/27/ibm_...
"In a SMP server, all cpus will have to be connected to each other, for this SGI UV2000 with 32.768 cpus, you would need (n²) 540 million (half a billion) threads connecting each cpu.
http://www.theregister.co.uk/2013/08/28/oracle_spa...
Wow, do you not read your own sources? Not only is your math horribly horribly wrong but the correct methodology is found for calculating the number of links as things scale is in the link you provided. To quote that link: "The Bixby interconnect does not establish everything-to-everything links at a socket level, so as you build progressively larger machines, it can take multiple hops to get from one node to another in the system. (This is no different than the NUMAlink 6 interconnect from Silicon Graphics, which implements a shared memory space using Xeon E5 chips...)"
The full implication here is that if the UV 2000 is not a socket machine, then neither is Oracle's soon-to-be-released 96 socket device. The topology to scale is the same in both cases per your very own source.
"SGI say like this on page 4 about the UV2000 cluster:
www.sgi.com/pdfs/4395.pdf"
Fundamentally false. If you were to actually *read* the source material for that quote, it is not describing the UV2000. Rather is speaking generically abou the differences between a cluster and large SMP box on page 4. If you got to page 19, it further describes the UV 2000 as a single system image unlike that of a cluster as defined on page 4.
"Dont you think that a 262.000 core server and 100s of TB of RAM sounds more like a cluster, than a single fat SMP server? And why do the UV line of servers focus on OpenMPI accerators? OpenMPI is never used in SMP workloads, only in HPC."
All I'd say about a 262,000 core server is that it wouldn't fit into a single box. Then again IBM, Oracle and HP are spreading their large servers across multiple chassis so this doesn't bother me at all. The important part is how all these boxes are connected. SGI uses NUMAlink6 which provides cache coherency and a global address space for a single system image. OpenMPI can be used inside of a cache coherent NUMA system as it provides a means to gurantee memory locality when data is used for execution. It is a means of increasing efficiency for applications that use it. However, OpenMPI libraries do not need to be installed for software to scale across all 256 sockets on the UV200. It is purely an option for programmers to take advantage of.
"And why is the UV2000 much much cheaper than a 16/32 socket Unix server? Why does a single 32 socket Unix server cost $35 million, whereas a very large SGI cluster with 1000 of sockets is very very cheap?"
First, to maintain coherency, the UV2000 only scales to 256 sockets/64 TB of memory. Second, the cost of a decked out P795 from IBM in terms of processors (8 sockets, 256 cores) and memory (2 TB) but only basic storage to boot the system is only $6.7 million whole sale. Still expensive but far less than what you're quoting. It'll require some math and reading comprehension to get to that figure but here is the source: http://www-01.ibm.com/common/ssi/ShowDoc.wss?docUR...
I couldn't find pricing for the UV2000 as a complete system but purchasing the Intel processors and memory seperately to get to a 256 socket/64 TB system would be just under $2 million. Note that that figure is just processor + memory, no blade chassis, racks or interconnect to glue everything together. That would also be several million. So yes, the UV2000 does come out to be cheaper but not drastically. That IBM pricing document does highlight why their high end systems costs so much, mainly capacity on demand. The p795 is getting a mainframe like pricing structure where you purchase the hardware and then you have to activate it as an additional cost. Not so on the UV2000.
psyq321 - Tuesday, March 18, 2014 - link
Xeon 2697 v2 is not a "low end" Xeon.It is part of "expandable server" platform (EP), being able to scale up to 24 cores.
That is far from "low end", at least in 2014.
alpha754293 - Wednesday, March 19, 2014 - link
"High end market are huge servers, with as many as 32 sockets, some monster servers even have 64 sockets!"Partially true. The entire cabinet might have that many sockets/processors, but on a per-system, per-"box" level, most max out between two and four. You get a few odd balls here and there that would have a daughter board for a true 8-socket system, but those are EXTREMELY rare in actuality. (Tyan, I think had one for the AMD Opterons, and they said that less than 5% of the orders were for the full fledge 8-socket systems).
"PS. Remember that I distinguish between a SMP server (which is a single huge server) which might have 32/64 sockets and are hugely expensive. For instance, the IBM P595 32-socket POWER6 server used for the old TPC-C record, costed $35 million. No typo. One single huge 32 socket server, costed $35 million. Other examples are IBM P795, Oracle M5-32 - both have 32 sockets. Oracle have 32TB RAM which is the largest RAM server on the market. IBM Mainframes also belong to this category."
Again, only partially true. The costs and stuff is correct, but the assumptions that you're writing about is incorrect. SMP is symmetric multiprocessing. BY DEFINITION, that means that "involves a multiprocessor computer hardware and software architecture where two or more identical processors connect to a single, shared main memory, have full access to all I/O devices, and are controlled by a single OS instance that treats all processors equally, reserving none for special purposes." (source: wiki) That means that it is a monolithic system, again, of which, few are TRULY such systems. If you've ever ACTUALLY witnessed the startup/bootup sequence of an ACTUAL IBM mainframe, the rest of the "nodes" are actually booted up typically by PXE or something very similiar to that, and then the "node" is ennumerated into the resource pool. But, for all other intents and purposes, they are semi-independent, standalone systems, because SMP systems do NOT have the capability to pass messages and/or memory calls (reads/writes/requests) without some kind of a transport layer (for example MPI).
Furthermore, the old TPC-C that you mention, they do NOT process as one monolithic sequential series of events in parallel (so think of like how PATA works...), but rather more like a JBOD SATA (i.e. the processing of the next transaction does NOT depend on ALL of the current block of transactions to be completed, UNLESS there is an inherent dependency issue, which I don't think would be very common in TPC-C). Like bank accounts, they're all treated as discrete and separate, independent entities, which means you can send all 150,000 accounts against the 32-socket or 64-socket system and it'll just pick up the next account when the current one is done, regardless.
The other failure in your statement or assumption is that's why there's something called HA - high avialability. Which means that they can dynamically hotswap an entire node if there's a CPU failure, so that the node can be downed and yanked out for service/repair while another one is hotswapped in. So it will failover to a spare hotswap node, work on it, and then either fall over back to the newly replaced node or it would rotate the new one into the hotswap failover pool. (There are MANY different ways of doing that and MANY different topologies).
The statement you made about having 32TB of RAM is again, partially true. But NONE of the single OS instances EVER have full control of all 32TB at once, which again, by DEFINITION, means that it is NOT truly SMP. (Course, if you ever get a screenshot which shows that, I'd LOVE to see it. I'd LOVE to get corrected on that.)
"In contrast to this, every server larger than 32/64 sockets, is a cluster."
Again, not entirely true. You can actually get 4 socket systems that are comprised of two dual-socket nodes and THAT is enough to meet the requirements of a cluster. Heck, if you pair up two single-socket consumer-grade systems, that TOO is a cluster. That's kinda how Beowulf clusters got started - cuz it was an inexpensive way (compare to the aforementioned RISC UNIX based systems) to gain computing power without having to spend a lot of money.
'These huge clusters are dirt cheap"
Sure...if you consider IBM's $100 million contract award "cheap".
"Clusters are only used for HPC number crunching. SMP servers (one single huge server, extremely expensive because it is very difficult to scale beyond 16 sockets) are used for ERP business systems. An HPC cluster can not run business systems, as SGI explains in this link:"
So there's the two problems with this - 1) it's SGI - so of course they're going to promote what they ARE capable of vs. what they don't WANT to be capable of. 2) Given the SGI-biased statements, this, again, isn't EXACTLY ENTIRELY true either.
HPCs CAN run ERP systems.
"HPC vendors are increasingly targeting commercial markets, whereas commercial vendors, such as Oracle, SAP and SAS, are seeing HPC requirements." (Source: http://www.information-age.com/it-management/strat...
But that also depends on the specific implementation of the ERP system given that SAP is NOT the ONLY ERP system that's available out there, but it's probably one of the most popular one, if not THE most popular one. (There's a whole thing about distributed relational databases so that the database can reside in smaller chunks across multiple nodes, in-memory, which are then accessed via a high speed interconnect like Myrinet or Infiniband or something along those lines.)
Furthermore, the fact that ERP runs across large mainframes (it grows as the needs grows), is an indications of HPC's place in ERP. Alternatively, perhaps rather than using it for the backend, HPC can be used on the front end by supporting many, many, many virtualized front-end clients.
Like I said, most of the numbers that you wrote are true, but the assumptions behind them isn't exactly all entirely true.
See also: http://csserver.evansville.edu/~mr56/Publications/...
Kevin G - Wednesday, March 19, 2014 - link
"That means that it is a monolithic system, again, of which, few are TRULY such systems. If you've ever ACTUALLY witnessed the startup/bootup sequence of an ACTUAL IBM mainframe, the rest of the "nodes" are actually booted up typically by PXE or something very similiar to that, and then the "node" is ennumerated into the resource pool. But, for all other intents and purposes, they are semi-independent, standalone systems, because SMP systems do NOT have the capability to pass messages and/or memory calls (reads/writes/requests) without some kind of a transport layer (for example MPI)."Not exactly. IBM's recent boxes don't boot themselves. Each box has a service processor that initializes the main CPU's and determines if there are any additional boxes connected via external GX links. If it finds external boxes, some negotiation is done to join them into one large coherent system before an attempt to load an OS is made. This is all done in hardware/firmware. Adding/removing these boxes can be done but there are rules to follow to prevent data loss.
It'll be interesting to see what IBM does with their next generation of hardware as the GX bux is disappearing.
"The statement you made about having 32TB of RAM is again, partially true. But NONE of the single OS instances EVER have full control of all 32TB at once, which again, by DEFINITION, means that it is NOT truly SMP. (Course, if you ever get a screenshot which shows that, I'd LOVE to see it. I'd LOVE to get corrected on that.)"
Actually on some of these larger systems, a single OS can see the entire memory pool and span across all sockets. The SGI UV2000 and SPARC M6 are fully cache coherent across a global memory address space.
As for a screenshot, I didn't find one. I did find a video going over some of the UV 2000 features displaying all of this though. It is only a 64 socket, 512 core, 1024 thread, 2 TB of RAM configuration running a single instance of Linux. :)
https://www.youtube.com/watch?v=YUmBu6A2ykY
IBM's topology is weird in that while a global memory address space is shared across nodes, it is not cache coherent. IBM's POWER7 and their recent BlueGene systems can be configured like this. I wouldn't call these setups clusters as there is no software overhead to read/write to remote memory addresses but it isn't fully SMP either due to multiple coherency domains.
silverblue - Monday, March 17, 2014 - link
The A10-7850K is a 2M/4T CPU.Ian Cutress - Monday, March 17, 2014 - link
Thanks for the correction, small brain fart on my part when generating the graphs.lever_age - Monday, March 17, 2014 - link
Ian, the Xeon E3-1220v3 and E3-1225v3 do not have hyperthreading. They're incorrectly listed in the table as 4c/8t. At those prices, if they did have 8t, more people would be buying them! Also, I think that "c3" by the E3-1230 is a typo.Ian Cutress - Monday, March 17, 2014 - link
Correct, I missed that going through the data.psyq321 - Monday, March 17, 2014 - link
"If the E5-2697 v2 was put in this position, we would have 12 cores at 3.5 GHz, ready to blast through the workload."I do not think this is possible. I have tried to force all cores to turbo mode with ThrottleStop on the 2697v2 Xeon (ThrottleStop bypasses the BIOS/UEFI and codes the limits directly using MSR registers), but the CPU will just refuse to grant this.
I suppose the desktop CPUs simply have this option unlocked, while 2S/4S Xeons have much stricter operating point limits.
The best I can do with 2697 v2 is to set the BCLK to 105 MHz with Z9PE-D8 WS and get 3.15 GHz maximum all-core turbo. This is as much overclock as the system can take.
Ian Cutress - Monday, March 17, 2014 - link
I've managed to get 110 BCLK on both processors relatively stable (112 BCLK needs a push), but this boosts up from the lower multiplier rather than the high one, and there is still a deficit on the high end. Enthusiasts will always want more, and I'd love the chance to run all the cores at the top turbo mode. Given how this is on the consumer line, it makes me wonder why Intel doesn't allow it here. The downside on the consumer line of allowing this behaviour is every so often there is a motherboard that fails to implement any Turbo Core, which has happened in my testing already.psyq321 - Monday, March 17, 2014 - link
I suppose some Intel marketing people just wanted to stop a possibility of lower-end Xeon cannibalizing higher-end Xeons by cheap overclocking. There are markets even in server business that are OK with overclocking (low latency trading, for example).It is interesting that you got it to 110 MHz BCLK. Did you use a server/workstation board or a HEDT (X79) board?
Ian Cutress - Monday, March 17, 2014 - link
This was in the MSI X79A-GD45 Plus, an X79 board. There are some server market areas that do sell pre-overclocked systems in this way, while still using Xeons, if I remember correctly.psyq321 - Monday, March 17, 2014 - link
I think HEDT boards are better when it comes to overclocking, probably due to higher component tolerances.Z9PE-D8 WS is not that good, but then again, it is a 2S board and apart from some Supermicro products (the "hyperspeed" series), the only 2S board that allows at least some overclocking of DDR3 and CPU.
Slomo4shO - Monday, March 17, 2014 - link
I am perplexed by the gaming benchmarks... Any particular reason why the 4770K and A10-7850K don't show up on all of the single and double GPU benchmarks? Especially considering that you have some tri-fire benches of the a10-7850K...Ian Cutress - Monday, March 17, 2014 - link
The AMD does not allow dual NVIDIA cards because the platform does not allow SLI. I need to re-run the 4770K in a PLX8747 enabled motherboard to get 3x SLI results across the board (you cannot get 3x SLI without a PLX chip), and I have not had a chance to run either CPU on my BF4 benchmark which has just been finalised for this review. The A10-7850K and i7-4770K numbers were taken from the Kaveri review and some internal testing - now my 2014 benchmarks are finalised I can run it on more platforms as the year goes on.et20 - Monday, March 17, 2014 - link
Why aren't Xeon E3s the recommended CPUs for enthusiast desktops?They make more sense than the Core i5 and i7 which come with integrated graphics that never gets used.
SirKnobsworth - Monday, March 17, 2014 - link
The LGA 2011 i7 chips don't have integrated graphics.Voldenuit - Monday, March 17, 2014 - link
Yeah, the Xeon E3s are great processors for enthusiasts/consumers.I was tossing up between an E3-1230 v3 (2MB more cache) or an i5 4670K (IGP, unlocked multiplier) in my latest build, and ended up with the 4670K because it had a $30 instant rebate, and I wanted quicksync (I know Ganesh says it has inferior transcode quality to x264, but I just need something quick and dirty for my phone, and honestly, I'm seeing more artefacts in the x264 samples he posted than in the QS samples). But it wasn't an easy decision.
Minor nitpick: you can get Xeon E3s with Intel HD graphics. Those are model numbers E3-12x5
http://ark.intel.com/products/family/78581/Intel-X...
mazzy - Monday, March 17, 2014 - link
Even more if you think about the facts that E3 xeons are i7 at i5 price... with ECC and VT tech... ok no OC...MrSpadge - Tuesday, March 18, 2014 - link
I'd take a CPU with IGP any day, unless it costs significantly more. Reason: increased resale value, even if I don't use the IGP myself.CrazyElf - Tuesday, March 18, 2014 - link
"I'd take a CPU with IGP any day, unless it costs significantly more. Reason: increased resale value, even if I don't use the IGP myself."I would not.
Reason is even when you use discrete, it uses additional power. IIRC for Sandy Bridge, a 3.6 GHz Sandy without the IGP switched on uses as much as a 3.4 GHz with the IGP on. Seeing that the graphics have become a bigger emphasis with Haswell, I expect that the IGP will be using a bigger percentage of the total power consumption with time.
austinsguitar - Monday, March 17, 2014 - link
okay...please tell me im not the only one who thinks this article is hilarious! look...if you are going to review two very expensive cpu's, don't test these against normal cpu's. It is shear idiocy to compare these to "desktop cpu's" because they are not....at all. Why can't anandtech do reviews on other affordable xeons and compare them all together. Also funny how they included AMD. That's cutejemima puddle-duck - Monday, March 17, 2014 - link
It's spelt 'sheer'. That aside, your ideas are intriguing to me, and I wish to subscribe to your newsletter.PEJUman - Monday, March 17, 2014 - link
you're the only one who think this article is hilarious.I thought/wished about this dataset previously. What if we simply go to XEONs and trade off clock for cores. i.e. the next logical progression from i7-4960X price bracket. obviously with no control over turbo bins & OC capability, the single thread is hosed. But wondered how many cores near-pro programs can really utilize.
I am just glad we get to see some comparisons. getting more XEON dataset would be interesting but I think it's beyond most of Anandtech reader usage partern.
austinsguitar - Monday, March 17, 2014 - link
You do indeed have a point about consumer cpu's and how they can overclock and are nifty to have for changing multiple options. but...the xeon market for average consumers needs to either be booming or a well kept secret. the 1230v3 is a fantastic cpu and can (for the most part) equal a 4770 for 240 dollars. I know it can not be overclocked, but people (about 70% of the market) do not give a care.But anandtech viewers are overly intelligent arnt we?
MrSpadge - Tuesday, March 18, 2014 - link
Had they not included these regular CPUs people would (rightfully) complain "what do these numbers mean without familiar comparison points?" Comparing 2 products to work out the real-world differences for some specific application is completely different from claiming they'd be equal.XZerg - Monday, March 17, 2014 - link
this bench also shows that the haswell had almost no CPU related performance benefits over IVB (if not slowed down performance) looking at 3770k vs 4770k and that haswell ups the gpu performance only.i really question intel's skuing of haswell...
Nintendo Maniac 64 - Monday, March 17, 2014 - link
Emulation?BMNify - Monday, March 17, 2014 - link
its a shame they didn't do a UHD x264 encode here as that would have shown a haswell AVX2 improvement (something like 90% over AVX), and why people will have to wait for the xeons to catch up to at least AVX2 if not AVX3.1psyq321 - Wednesday, March 19, 2014 - link
There is no "90% speedup over AVX" between HSW and IVB architectures.AVX (v1) is floating point only and thus was useless for x264. For floating point workloads you would be very lucky to get 10% improvement by jumping to AVX2. The only difference between AVX and AVX2 for floating point is the FMA instruction and gather, but gather is done in microcode for Haswell, so it is not actually much faster than manually gathering data.
Now, x264 AVX2 is a big improvement because it is an integer workload, and with AVX (v1) you could not do that. So x264 is jumping from SSE4.x to AVX2, which is a huge jump and it allows much more efficient processing.
For integer workloads that can be optimized so that you load and process eight 32-bit values at once, AVX2 Xeon EPs/EXs will be a big thing. Unfortunately, this is not so easy to do for a general-purpose algorithms. x264 team did the great job, but I doubt you will be using 14 core single Haswell EP (or 28 core dual CPU) for H.264 transcoding. This job can be done probably much more efficient with dedicated accelerators.
As for the scientific applications, they already benefit from AVX v1 for floating point workloads. AVX2 in Haswell is just a stop-gap as the gather is microcoded, but getting code ready for hardware gather in the future uArch is definitely a good way to go.
Finally, when Skylake arrives with AVX 3.1, this will be the next big jump after AVX (v1) for scientific / floating point use cases.
Kevin G - Monday, March 17, 2014 - link
Shouldn't both the Xeon E5-2687W v2 support 384 GB of memory? 4 channels * 3 slots per channel * 32 GB DIMM per slot? (Presumably it could be twice that using eight rank 64 GB DIMMs but I'm not sure if Intel has validated them on the 6 and 10 core dies.) Registered memory has to be used for the E6-2687w v2 to get to 256 GB, just is the chip not capable of running a third slots per channel? Seems like a weird handicap. I can only imagine this being more of a design guideline rule than anything explicit. The 150W CPU's are workstation focused which tend to only have 8 slots maximum.Also a bit weird is the inclusion of the E5-2400 series on the first page's table. While they use the same die, they use a different socket (LGA 1356) with triple memory support and only 24 PCI-e lanes. With the smaller physical area and generally lower TDP's, they're aimed squarely the blade server market. Socket LGA 2011 is far more popular in the workstation and 1U and up servers.
jchernia - Monday, March 17, 2014 - link
A 12 core chip is a server chip - the workstation/PC benchmarks are interesting, but the really interesting benchmarks would be on the server side.Ian Cutress - Monday, March 17, 2014 - link
Johan covered the server side in his article - I link to it many times in the review:http://www.anandtech.com/show/7285/intel-xeon-e5-2...
BMNify - Monday, March 17, 2014 - link
a mass of other's might argue a 12 core/24 thread chip or better is a potential "real-time" UHD x264 encoding machine , its just out of most encoders budgets, so NO SALE....Nintendo Maniac 64 - Monday, March 17, 2014 - link
Uh, where's the test set up for the 7850K?Nintendo Maniac 64 - Monday, March 17, 2014 - link
Also I believe I found a typo:"Haswell provided a significant post to emulator performance"
Shouldn't this say 'boost' rather than 'post'?
Ian Cutress - Monday, March 17, 2014 - link
Corrected :) The test setup for the A10-7850K is the same as the Kaveri review. ASRock FM2A88X Extreme6+ with extra cooling, 2x8GB DDR3-2133 (i.e. rated processor speed).Nintendo Maniac 64 - Monday, March 17, 2014 - link
So stock clocks with turbo enabled on Win7 64bit SP1 w/ core parking update?Ian Cutress - Monday, March 17, 2014 - link
Correct.mattchid - Monday, March 17, 2014 - link
I have two 2697v2 I was gifted, and while I'm only running one on an x79 MB, I have two questions I can't find answers to elsewhere on the 2697 v2:1. Is the memory limited to 1866, even on motherboards supporting higher overclocks? I have tried to run memory above that speed (1866 memory that usually over clocks well) and the computer refuses to boot past the bios at that speed.
2. What would the performance gains be with both installed, in reference to multithreaded activities, like rendering, or even more rudimentary, like x264 or handbrake conversion? I would guess with single threaded activity, there would be no difference in performance, than one CPU.
psyq321 - Tuesday, March 18, 2014 - link
I have tested two generations of 2697 v2 (C0 and C1 stepping) and both refuse to accept anything above 1866 MHz. Practically, my old workhorse (dual 2687W) was much better in that regard and could run DDR3 @2133 MHz without any trickery.Although the CPU platforms (JakeTown and IvyTown) are pin-compatible for the EP series, high-core-count (HCC) EP IvyTowns have two separate memory controllers and I suppose this introduces regressions when it comes to "overclockability" of the RAM.
As for the #2, if you are running NUMA-aware multithreaded software that can spawn 24 or 48 threads, you can expect almost linear performance scaling with dual-CPU setups.
If the software is not NUMA aware, then there are performance drops that can be 30-50% (so you get, maybe, 1.5x speedup). If the software cannot get more than, say, 8 threads, then there would be no speedups (but even in this case you can start two separate processes and do two encoding sessions at once, and regain the 2x speedup)
CamdogXIII - Monday, March 17, 2014 - link
Typo in the gaming benchmarks. Under BF4, the button selection heading reads company of heroesIan Cutress - Tuesday, March 18, 2014 - link
Corrected :)iwod - Monday, March 17, 2014 - link
This doesn't really answer any question about the State of Xeon.So what exactly is difference between a Xeon E3- v3 and a Normal Top End Haswell Chip?
We have Broadwell soon ( in a few months? ) Are we suppose to get new Haswell E5 too? Isn't the Xeon E5 always one year behind the desktop counterpart, or are they slipping even more?
JlHADJOE - Monday, March 17, 2014 - link
Xeon E3 has ECC support, which is pretty cool.venk90 - Tuesday, March 18, 2014 - link
Ian, Could you post the AMD Kaveri CPU and GPU numbers to the respective bench sections of this website ? Makes comparisons a lot easier.Ian Cutress - Tuesday, March 18, 2014 - link
I need to spend some time to organise this with my new 2014 benchmark setup. That and I've never used bench to add data before. But I will be putting some data in there for everyone :)Maxal - Tuesday, March 18, 2014 - link
There is one sad thing - disappearance of 2C/4T high clock speed CPUs, as Oracle Enterprise Edition charges by cores.....and sometimes you need just small installation but with EE features...Rick83 - Tuesday, March 18, 2014 - link
Wouldn't L3/thread be a more useful metric than L3/core in the big table?HT will only really work after all, if both threads are in cache, and if you can get a CPU with HT and one without, as is the case with the Xeons, you'd get the one without because you are running more concurrent threads. That means that under optimum conditions, you have 2 threads per core that are active, and thus 2x#cores threads that need to be in the data caches.
HalloweenJack - Tuesday, March 18, 2014 - link
holy shit anandtech you really have gone to the dogs - comparing a £2000 cpu against a £100 apu and saying its better..... and really? wheres the AMD AM3+ cpu`s? 8350 or 9590? seriouslyIan Cutress - Tuesday, March 18, 2014 - link
Let's see. I'm not comparing it against a £100 APU, I'm comparing it against the $1000 Core i7-4960X to see the difference. We're using a new set of benchmarks for 2014, which I have already run on the APU so I include them here as a point of reference for AMD's new highest performance line. It is interesting to see where the APU and Xeon line up in the benchmarks to show the difference (if any). AMD's old high end line has stagnated - I have not tested those CPUs in our new 2014 set of benchmarks. There have been no new AM3+ platforms or CPUs this year, or almost all of last year. Testing these two CPUs properly took the best part of three weeks, including all the other work such as news, motherboard reviews, Mobile World Congress coverage, meetings, extra testing, bug fixing, conversing with engineers on how to solve issues. Sure, let's just stop all that and pull out an old system to test. If I had the time I really would, but I was able to get these processors from GIGABYTE, not Intel, for a limited time. I have many other projects (memory scaling, Gaming CPU) that would take priority if I had time.AKA I think you missed the point of the article. If you have a magical portal to Narnia, I'd happily test until I was blue in the face and go as far back to old Athlon s939 CPUs. But the world moves faster than that.
deadrats - Tuesday, March 18, 2014 - link
any chance of updating this article with some x265 and/or Divx265 benchmarks? hevc is much more processor intensive and threading friendly, so these encoders may be perfect for showing a greater separation between the various core configurations.Ian Cutress - Tuesday, March 18, 2014 - link
If you have an encoder in mind drop me an email. Click my name at the top of the article.bobbozzo - Tuesday, March 18, 2014 - link
Hi,1. please change the charts' headings on the first page to say 'Cores/Threads' instead of 'Cores'.
2. it wasn't clear on the first page that this is talking about workstation CPUs.
3. "Intel can push core counts, frequency and thus price much higher than in the consumer space"
I would have said core counts and cache...
Don't the consumer parts have the highest clocks (before overclocking)?
Thanks!
bobbozzo - Tuesday, March 18, 2014 - link
"it wasn't clear on the first page that this is talking about workstation CPUs."As opposed to servers.
Ian Cutress - Tuesday, March 18, 2014 - link
1) I had it that way originally but it broke the table layout due to being too wide. I made a compromise and hoped people would follow the table in good faith.2) Generally Xeon in the name means anything Workstation and above. People use Xeons for a wide variety of uses - high end for workstaitons, or low end for servers, or vice versa.
3) Individual core counts maybe, but when looking at 8c or 12c chips in the same power bracket, the frequency is still being pushed to more stringent requirements (thus lower yields/bin counts) vs. voltages. Then again, the E3-1290 does go to 4.0 GHz anyway, so in terms of absolute frequencies you can say (some) Xeons at least match the consumer parts.
mrnuxi - Tuesday, March 18, 2014 - link
I know you were benchmarking these Xeons for home use, thus the selection of rendering and gaming benchmarks. But there are lots of us doing home virtualization (VMWare ESXi all-in-one servers using PCI-passthrough ZFS virtual SAN and multiple VMs). It would be great so see some virtualization benchmarks. For further reference see: http://www.napp-it.org/index_en.htmlIan Cutress - Tuesday, March 18, 2014 - link
Johan covered the server side in his article -http://www.anandtech.com/show/7285/intel-xeon-e5-2...
alpha754293 - Wednesday, March 19, 2014 - link
No LS-DYNA or other HPC benchmark results??? Talk to Johan.colonelclaw - Wednesday, March 19, 2014 - link
Hi Ian, Thanks for a great review.Do you think there's any possibility of adding V-Ray to your workstation benchmarks? It's an incredibly popular renderer that is multi-platform and also works in pretty much any decent 3D software (Max, Maya, C4D etc). It also sucks the life out of any computer when it's running, so would be perfect for your tests.
Kougar - Wednesday, March 19, 2014 - link
Question. Since you discuss turbo bins at length and the article revolves around them, how does Windows Server handle load balancing in regards to the turbo bins. On a 2P E5-2697 will the OS balance all the threads on a single CPU first? Spread evenly across both CPUs? Max out all physical cores before assigning threads to logical cores?Is the OS capable of spreading 3 threads to each processor to ensure they both run at the max turbo frequency for as long as possible? Or would it instead max out one processor to attempt to let the other retain a lower power state? For that matter is any of this even configurable under Windows Server?
Ytterbium - Saturday, May 3, 2014 - link
Ian, does MCE work with Xeon?Ytterbium - Tuesday, May 6, 2014 - link
MCE doesn't seem to work with Xeon 2687w.RadamanthysBe - Sunday, May 4, 2014 - link
Interresting article, but I dont agree with your conclusion on the 2667 vs 2687w :You say 2667 id cheaper, ok but 50 $ difference in list price on cpu costing 2100 $, that is less than 2%.
You also say 2687w v2 use more energy than 2667 v2, do you have proof of that ? For me the fact that 2687w v2 has a 150w tdp only mean it can keep its turbo frequency under higher load than 2667, with situations where 2667 turbo mode would drop because of the power usage while not on the 2687w v2, making it in fine a faster cpu than 2667 under heavy loads. If the two cpu run the same computation at the same frequency, there is no reason 2687w v2 uses more power, it would be like saying that i5 and i7 consume the same because they have the same tdp, while everybody knows it is not the case.
Ytterbium - Monday, May 5, 2014 - link
I just bought a 2687w v2, it ended up being $3 difference between them, I have i7-3970X so I have 150W chip anyway, so TDP wasn't really much of a factor to me.It would be interesting to do a head to head of them and see how they perform, in thermal load/power.
Following Ian's logic he's be super interested in the E5-2673 v2, this is the same as the 2667 but with 110W TDP.
If the 2690 had a little higher turbo, it would be great, 10/20 with say 3 stock and 3.8 turbo
Ytterbium - Tuesday, May 6, 2014 - link
The 2687 I got seems to run a bit cooler than my 3970X, even though there rated for the same.Venoms - Sunday, November 2, 2014 - link
I wonder which would be the more preferalbe all around CPU?I do some gaming and a lot of Maya (VRay, etc.)
I am wondering if I should go dual E5-2697v2 (for more cores), or E5-2687W v2 (for higher clock speed)?