DigiCortex 0.98 - Support for Intel Haswell AVX2 Instructions

Click here to download DigiCortex demo...

DigiCortex Engine v0.98 brings support for Advanced Vector Extensions 2 (AVX2) instructions, supported by the just launched Intel Haswell CPUs. In addition to AVX2, DigiCortex engine now also uses Fused Multiply-Add (FMA) and Gather instructions on supported CPUs. FMA instruction set is quite useful for performing synaptic receptor state update when modeled with Markram-Tsodyks model, since in both U and W variable update there are multiplications and additions. Gather instructions also help with DigiCortex architecture, since there are several synaptic parameters which are not consecutive in RAM.

Our understanding of the Haswell architecture based on public data is that Gather instructions are, in fact, microcoded and internally expanded as series of load micro-uOPs. This means that the gains from using VGATHER AVX2 instructions implemented in Haswell microarchitecture are not as high as theoretically possible. FMA instructions also can sometimes bring no improvement in latency in the current FMA implementation in Haswell generation. It is likely that we can expect further improvements in latency in future Intel microarchitectures.

Test #1 - Overall Platform Performance (not normalized)

The purpose of this test is to show how the overall successive Intel platforms compare to each other. However, since each platform available for testing is clocked differently and has different DDR3 memory bandwidth, the tests are not suitable for anything else than a high level performance outlook. We will dig deeper in the following tests and try to "equalize" the platforms so that CPU microarchitecture benefits can be compared directly. Please note that we will not show the performance of the Sandy Bridge EP (Xeon E5 2687W) system here, as it would not be directly comparable.

Test Description: 32768 Multi-Compartment Neurons, 1.8 million Synapses (2 receptors each)

Platform Name: Num. CPUs Num. Cores CPU Speed (Turbo) RAM Speed Avg. Simulation Performance Avg. Total Memory Bandwidth
Intel Haswell
Core i7-4770
1 4 3900 MHz DDR3 2400 MHz 0.465x 34.2 GiB/s
(est.)
Apple Macbook Pro Retina 15"
Intel Ivy Bridge
Core i7-3720QM
1 4 3400 MHz DDR3 1600 MHz 0.310x 19.5 GiB/s



As it could be seen from the table above, Intel Haswell platform is behaving pretty predictably and shows a big improvement. Clearly, this test cannot be used to compare the actual microarchitecture performance since too many parameters (DRAM speed, CPU frequency) are different. Therefore, we will try to make the platforms a bit more equal in the following tests...

One thing which is quite amazing is the potential performance of future Haswell EP/EX platforms. In the Sandy Bridge iteration, DigiCortex performance scaled almost linearly with the number of cores. This means that, if we assume that the future Haswell EP/EX platforms will also show the same behavior, projected DigiCortex performance on a hypothetical dual 15-core Haswell EP workstation system (30 cores in total) would be ~3.3x real-time for the 32K neurons 1.8M synapses, which is ~3x faster than dual Sandy Bridge EP system! Of course, it is not likely that the 15-core Haswell EP Xeon would clock as high as the consumer 4-core part due to prohibitively high TDP - however, if rumors are true - Haswell EP will bring DDR4 support which is more likely impact DigiCortex execution much more.



Test #2 - Haswell vs. Ivy Bridge on more equal grounds (equal CPU and DDR3 clocks)

In order to estimate the IPC improvements better, in Test #2 we will fix the CPU frequency and DDR3 speed. Therefore, we limited the Haswell Core i7 4770 maximum frequency to 3.4 GHz (which is the maximum all-core turbo of Core i7 3720QM) and downclocked our DDR3 memory to 1600 MHz while matching the timings of the DDR3 modules found in the 2012 Apple Macbook Pro 15" with Retina display. The results look like this:

Test Description: 32768 Multi-Compartment Neurons, 1.8 million Synapses (2 receptors each)

Platform Name: Num. CPUs Num. Cores CPU Speed (Actual) RAM Speed Avg. Simulation Performance Avg. Total Memory Bandwidth
Intel Haswell
Core i7-4770
1 4 3400 MHz (turbo disabled) DDR3 1600 MHz 0.380x 22.5 GiB/s
(est.)
Apple Macbook Pro Retina 15"
Intel Ivy Bridge
Core i7-3720QM
1 4 3400 MHz DDR3 1600 MHz 0.310x 19.5 GiB/s

When DDR3 and CPU clocks are fixed and set to be equal on both Haswell and Ivy Bridge platforms, DigiCortex engine is ~22% faster, clock-for-clock. These results are already promising, as they are already bigger than typical IPC gains that we observed in the Sandy Bridge->Ivy Bridge generation jumps. The reason for such jump is most likely related to AVX2 instruction set usage in the Haswell build. However, despite the promising results we are still not showing the maximum theoretical IPC gains that could be achieved with AVX2. However, in hard memory-bound scenarios such as biological network simulations, getting 22% clock-for-clock is pretty impressive for a generation jump!