NVIDIA GK110 vs. Dual Intel Xeon E5 2687W - Fight!

Click here to download DigiCortex Demo...

DigiCortex Engine v0.95 is finally bringing early support for GPU acceleration on CUDA-capable hardware. Despite the fact that the code is in the very early stage, it is already showing that it has great potential. In this article, we will take a look at the achieved performance of the CUDA plug-in compared to the CPU code.

One thing worth noting is, many online tests dealing with the subject of GPU vs. CPU performance are typically using single-threaded or just simply unoptimized CPU code and testing against GPU code which is written to take full advantage of the GPU architecture. Overlooking this fact might lead to premature conclusions as the numbers are often unusually high (seeing speed-up claims of several hundred times is not extraordinary). While it is certainly possible that there are algorithms that greatly fit the GPU architecture (e.g. "embarrassingly parallel" algorithms) and memory, most of the code is not that trivial to optimize for GPU in order to extract these kind of speed-ups.

In DigiCortex Engine we are actually comparing early version of CUDA code (not yet fully optimized) to the fully optimized CPU code, which is written to take maximum advantage of modern Intel CPUs by using AVX instruction set and optimized memory layout. We even use quite unorthodox hardware setup, with DDR3 RAM running at the 2133 MHz, which is out of Xeon E5 specification, but despite this - such memory is widely available on the market with little price difference (albeit in non-ECC form, as of March 2013). This way, we believe the comparison is "fair" in a sense we are not comparing optimized GPU code with unoptimized CPU code.

The system used in test is a workstation-grade PC, with Dual Intel Xeon E5 2687W CPU, which represents top-of-the-range Intel Sandy Bridge EP based CPU, running at 3.1 GHz (on ASUS Z9PE D8 WS motherboard) and 128 GB of DDR3 RAM running on 2133 MHz. Although Intel already has a newer "tick" architecture refresh (Ivy Bridge) available on the market, and the new "tock" architecture (Haswell) is soon to be released, none of the two architectures are yet available in workstation/server-grade SKUs with 8 cores. We would expect that Ivy Bridge EP will bring typical IPC improvements of ~15-20% for the same TDP as seen in previous "tock" architecture jumps, so this can be used in the calculations to estimate the future performance of Ivy Bridge EP system.

The GPU being used in tests is NVIDIA Titan with GK110 GPU, which represent the top-of-the-range Kepler-generation GPU. There are more expensive NVIDIA GK110 cards available (Tesla K20), but the GPU inside is the same, and the features enabled in Tesla K20 SKU are currently not being used by DigiCortex.

So, how do these two architectures stack up against each other?

We chose to simulate several different configurations of thalamocortical system, ranging from 32768 neurons with 2 million synapses up to 512K neurons with ~86 million synapses. We wanted to see how the CPU and GPU code scale with different workloads.

32768 Neurons - 2.0 Million Synapses

Platform Name: Num. Cores Num. Threads Core Speed RAM Max. Memory Bandwidth Avg. Simulation Speed Rel. Speedup
Dual Intel Xeon E5-2687W
("Sandy Bridge EP")
16 32 3100 MHz DDR3 2133 MHz 68.2 GiB/s 1.170x 1.0x
NVIDIA GK110
"Titan"
15 2688 837 MHz GDDR5 6008 MHz 288 GiB/s 1.8x 1.6x

As it can be seen from table above, for small simulations - GPU code path is only ~1.6x faster than CPU code path. This is primarily so because the simulation is unable to extract significant gains from the massive parallelism due to not having enough work to feed the GPU with.

However, as we move up to higher simulations, the situation is beginning to change...

262144 Neurons - 31 Million Synapses

Platform Name: Num. Cores Num. Threads Core Speed RAM Max. Memory Bandwidth Avg. Simulation Performance Rel. Speedup
Dual Intel Xeon E5-2687W
("Sandy Bridge EP")
16 32 3100 MHz DDR3 2133 MHz 68.2 GiB/s 0.085x 1.0x
NVIDIA GK110
"Titan"
15 2688 837 MHz GDDR5 6008 MHz 288 GiB/s 0.450x 5.3x

Here, we see that with the number of neurons and synapses going up, the speedups achieved with GPU code path are beginning to be more substantial. With 262144 neurons and 31 million synapses, GPU code is already 5.25 times faster. Although this is certainly better than the CPU code no doubts, it still has a room to grow...

524288 Neurons - 86 Million Synapses

Platform Name: Num. Cores Num. Threads Core Speed RAM Max. Memory Bandwidth Avg. Simulation Performance Rel. Speedup
Dual Intel Xeon E5-2687W
("Sandy Bridge EP")
16 32 3100 MHz DDR3 2133 MHz 68.2 GiB/s 0.035x 1.0x
NVIDIA GK110
"Titan"
15 2688 837 MHz GDDR5 6008 MHz 288 GiB/s 0.250x 7.1x

With the number of synapses approaching the 100 million figure, the GPU code speed-up stands at 7.1 times faster compared to the CPU path. Considering the fact that the GPU code is still not fully optimized, the speedup is definitely remarkable. With additional optimizations we expect this number to exceed factor of 10 speedup compared to the very optimized CPU code.

Performance per Dollar

Another interesting aspect of the GPU speedup is the "performance per dollar" (or Euro, here in Germany) question. While NVIDIA GK110 Titan is far from being the cheapest GPU, it is still considerably cheaper than Intel Xeon E5 2687W. However, in order to be fair - we shall compare the Tesla SKU with Intel Xeon 2687W, as Tesla supports ECC memory - useful for large-scale scientific applications.

As of today (March 15th 2013), NVIDIA Tesla K20 and two Intel Xeon 2687W cost about the same in Germany (of course, you would need the rest of the system - but let's assume the price of the rest of the system would be comparable). So, when doing large-scale neural network simulations with DigiCortex, for the similar price it is possible to get 7 times more performance with the Tesla GPU. Of course, this comparison is valid only for DigiCortex engine speed; other applications and other algorithms might have different results.

Now, of course, Intel Xeon E5 is obviously not the most optimal architecture for the DigiCortex large-scale thalamocortical simulation. This is primarily because the key steps of the spiking neural network simulation are "embarrassingly parallel" - task which is extremely suitable for GPU architectures. Intel is also working on the architecture better suited for these kind of tasks - with their "Knights Corner" or MIC (Many Integrated Core) Architecture. If we could get hold of one or two of these, DigiCortex would also get its MIC port - and we could then how Intel's MIC approach compares against NVIDIA Kepler for large-scale nerual network simulations.

NOTE: All trademarks are belonging to their respective owners.