Our partner Aneo, European leader in consulting in the fields of high performance computing (HPC) in finance and industry, released on March 29th a detailed study on the performance of our QH-1s (ex Q.rads) and O.mar services compared to traditional HPC infrastructures.
The article below is the English version of the original article in French posted on the Aneo website.
Are Qarnot heaters as efficient as traditional HPC machines?
Qarnot : HPC everywhere?
Cloud computing is changing the way we execute intensive computations. It could prove to be a very interesting alternative to a classical supercomputing approach.
Qarnot offers a particularly innovative HPC Cloud solution: its infrastructure is based on “computer heaters” installed in offices, social housings or buildings that reuse the heat generated by the microprocessors.
With this approach, the compute services operated by Qarnot have a carbon footprint reduced by 75%.
As a partner of Qarnot, Aneo has led a study on the performance of this solution for the processing of typical HPC workloads.
In this article, we will first present Qarnot’s infrastructure and features and then the results of our analysis.
Qarnot’s main offer is the QH-1 (ex Q.rad) heater. Installed in individual houses or in offices, a QH-1 includes 4 Intel Core i7 processors (4 cores, between 3.5 and 4 GHz, Ivy Bridge or Haswell architecture) and 16 GB of RAM. QH-1s do not hold any physical storage capacity but can access memory disks located in the same building. The Qarnot platform represents about 5 000 cores in total today and will grow to 12 000 before end of 2018.
Qarnot is particularly suited for batch HPC processing and applications called “embarrassingly parallel”, meaning with no or little dependency/communication between tasks, like image processing or financial calculations and within the limit of the available memory.
Qarnot is also developing solutions to access interconnected nodes using MPI:
- A “QH-1” profile, with standard QH-1s, interconnected but with no guarantee on their geographic location. The network’s performance on these nodes is variable and generally not suited for low latency/high throughput data intensive applications (in the 1ms range of latency).
- A “O-mar” profile, based on a specific infrastructure, where nodes are close and with better connectivity (Ethernet 1Gbps, 20 µs of latency). These nodes are much more adapted to an HPC-type workload. O-mars can only contain up to 64 nodes (256 cores), but this should change soon.
Interface and API
Qarnot offers different “profiles” depending on the nature of use, from SaaS (Software as a Service) to PaaS (Platform as a Service). A Docker environment can be found on nodes to execute calculations in an independent “container”.
In this study, we have used a specific profile created by Qarnot authorizing nodes to interconnect with MPI.
Qarnot also makes available a REST API and a Python SDK to facilitate the allocation of nodes/disks and to specify the job parameters (input and output files, etc…):
The workflow and the billing can be monitored in real-time through a Web interface:
The performance of the network is an essential criteria for HPC. This is even more important as Qarnot devices are not packing many cores and therefore imply the necessity of having a large amount of nodes.
To this end, we have measured the results of a highly communicating, distributed computing application on both profiles offered by Qarnot: QH-1 and O.mar.
The measures are first executed in mono node to evaluate the overall performance of machines and then in multi-node (using MPI) to evaluate network capacity. Each test uses an MPI process and as many threads as the number of physical cores.
The code used in our benchmarks is SeWaS (Seismic Wave Simulator), an application developed by ANEO, simulating the propagation of seismic waves and inspired by another application used by engineers from the BRGM (Bureau de recherches géologiques et minières). One of the key characteristics of SeWaS is to carry out communications between neighbouring cells during each iteration, making the application very sensitive to network latency.
SeWaS is implemented according to a task-based model with executive support PaRSEC (Parallel Runtime Scheduling and Execution Controller). This framework allows the scheduling of a task graph on architectures with distributed memory and is able to automatically recover communications by calculations.
1 – Mono-node performance
We start by comparing a Qarnot node with a standard HPC socket, on a simple test case (16 million cells and 100 time steps).
Results are presented in millions of cells treated per second (Mcells/sec).
- For the same number of cores, the performance of Qarnot nodes is comparable to the one of a same generation HPC node (Haswell). In mono node, SeWaS is limited by the memory bandwidth, it is therefore not surprising to have a factor of about 2 despite the slightly higher frequency of Qarnot nodes.
- When it comes to executing one task, the standard QH-1 profile provides the same performance as the O.mar one, which one could expect given the fact that the CPU specifications are the same.
2 – Inter-node performance
To assess the network’s performance in more detail, we measured both strong scaling and weak scaling. Strong scaling is the evolution of performance when increasing the number of nodes on a specific problem size. Weak scaling is done by also increasing the problem size, in our case this means varying the number of cells in the test case proportionately to the number of nodes.
The following performances were obtained after a medium test case scaled for 64 cores (92 million cells, 100 time steps), involving intensive calculations and communications.
With the O.mar profile, the performances are quite convincing on a large scale. The performances on 2 nodes are comparable to the ones of a Xeon.
With the standard profile, the high latency is proven to be a major concern, and the total time increases with the number of nodes. Thus showing the benefits of using a O.mar profile for this kind of operations.
If the scalability would have been perfect, the curve would be a straight line. In our study we have showed that the efficiency is good, superior to 90% up to 16 nodes and to 80% up to 48.
The final performance ratio in comparison to the Xeon node is presented below:
The study shows that the Qarnot approach is an interesting solution for HPC workloads assuming:
– The scale requirements are a match for Qarnot infrastructure
– The application requirements are within the platform limitations (memory)
– That the inter-tasks communication requirements are within the networking technology limit (1 Gbps interconnexion for O.mar). In this case It is possible to reach an efficiency of about 90%.
However, one must note that the access to interconnected nodes in MPI is a non-public feature at the moment, that the company only offers on a case by case basis depending on the client’s projects and needs.
In addition, Qarnot is also working on new solutions such as a “boiler-computer”, which should bring more possibilities in terms of hardware (600 AMD Ryzen 7 cores) and network configuration (Infiniband). Qarnot already offers this type of material through its partner Asperitas and will launch a first prototype in December 2018.
To find out more about our partner Aneo, follow this link
To visit the Qarnot website, follow this one