- Physical setup and prerequisites
- Understanding the technology
- Setting up a standard cluster
- Adding RoCE support
To perform a large computing task, for example simulating the whole fluid dynamic around a racing car, a single computer can be insufficient. To bypass this, it is possible to group several servers together and form a computing cluster.
When doing so, a key point to success is having a good interconnection network between the different servers (we will call them the computing nodes). If not, the time gained by adding computing power will be lost in increased communication times.
In this article, we will see how to set up a high performance cluster from scratch, using an interconnection technology called RoCE. The goal is to run applications using MPI.
Physical setup and prerequisites
Using RoCE requires using a compliant NIC (Network Interface Card). In our case, they are Mellanox ConnectX-4 cards. All the procedure described in this post should be similar for other cards made by Mellanox. The server used are OCP Leopard servers, embedding 2 Intel Xeon CPUs. In this example, we have 12 of those.
The switch also has to be RoCE compliant. In our case, the material also come from the OCP sphere, and is a 100 GbE wedge. Its command line interface is similar to Arista switches’. All the servers are connected to this switch. Their IPs range from 10.1.4.1 to 10.1.4.12
If you have not bought your servers yet, here are a few guidelines. HPC tasks like CFD are often memory bound, which means that the most time-consuming operations are waiting for data to come from the RAM. Therefore, you should prioritize a processor with a high number of memory channels and high RAM frequency. The number of cores is not very important, since memory channels are often saturated with 3 cores per channel.
All the servers run on Linux and have Ubuntu 20.04 installed.
Understanding the technology
To understand what is the purpose of RoCE and why it provides excellent performances, we must dive a bit into the technologies at stake.
When doing “normal” communication, a message that is sent, which originally occupies a memory space, is transferred through different layers which will each copy the message in their memory and process it before handing it over to the next layer. This layer architecture has a lot of advantages from a programming perspective, but it implies a lot of copies and resource utilization. Even with very high throughput, a network that uses this standard architecture will have a high latency and will not be able to reach good performances in clustering.
To bypass that, it is necessary to achieve a technique called RDMA (remote direct memory access). RDMA is a technique where the data is directly transferred from a server’s main memory to another, without any other copy nor intermediate buffer. This significantly reduces latency and means that the kernel will not have to bother about copying data around.
Infiniband and RoCE
To perform this technique, new technologies and protocols had to be developed. One of those was Infiniband, which is both a physical specification of buses and connectors, as well as a protocol specification. Infiniband is very effective and widespread over the TOP500, but it is also quite costly and complex to set up.
RoCE (RDMA over Converged Ethernet) is the attempt to get the better out of the two worlds: the performance of Infiniband, and the popularity of Ethernet. The way it works is by encapsulating the infiniband protocol inside Ethernet (RoCE v1) or UDP/IP (RoCE v2) packets. This is the technology that we will discover today.
Setting up a standard cluster
Before seeing how to create a RoCE cluster, we will set a cluster that runs over standard TCP connections. Here are the different steps.
Set up passwordless SSH
The first thing to do is to set up passwordless ssh between all the nodes. SSH is a protocol and a program that enables to safely connect to a remote server and execute commands on it. It will be used by the host node to start processes on all the other nodes. Yet we have to make it work without asking for a password.
To enable a passwordless connection, the receiving server needs to have in his authorized_keys file (standard location is
~/.ssh/autorized_keys) the public key of the sender (which is typically
Two options are basically available. The first is to copy around all the keys in the different autorized_keys files. The second is to generate a new key pair that will be used by all the servers (just copy the
id_rsa.pub files in every
~/.ssh/ folder). Then the autorized_key of all the nodes will be the same, and will only contain the single public key.
Once this is done, you should be able to connect through ssh to the other servers, without the terminal prompting you for a password.
Create a NFS shared directory
This step is actually optional, but greatly simplifies the task of setting up an MPI use case. To work, the MPI cluster needs to have the executable file data on every server and in the same location. Therefore, it is either possible to copy these files to each server or to create a shared directory. The first solution gives better performances, but the second is easier to set up. See this tutorial to set up a NFS directory.
Install and test MPI
Now it is time to make everything in practice. First, we will install an implementation of MPI. I went for OpenMPI, which can be installed with
apt-get install openmpi-bin.
To run OpenMPI over multiple nodes, we need to tell the host where to find the other nodes. This is done inside a host file that contains a line per server, with their IP and the number of slots on the machine. The number of slots would typically be the number of cores on the machine, but here we can start with only one slot per node, in order to try the clustering.
Here’s a little script to generate the host file for a given number of slot per node:
rm -f hostfile for ((i=1; i <= $NB_NODES; i++)) do echo “10.1.4.$i slots=$NB_SLOT_PER_NODE” >> hostfile done
The result will be:
$ NB_NODES=12 NB_SLOT_PER_NODE=1 ./generate-hostfile.sh $ cat hostfile 10.1.4.1 slots=1 10.1.4.2 slots=1 10.1.4.3 slots=1 10.1.4.4 slots=1 10.1.4.5 slots=1 10.1.4.6 slots=1 10.1.4.7 slots=1 10.1.4.8 slots=1 10.1.4.9 slots=1 10.1.4.10 slots=1 10.1.4.11 slots=1 10.1.4.12 slots=1
Now, let’s try mpirun:
mpirun --np 12 --hostfile /path/to/hostfile \ --mca plm_rsh_agent "ssh -q -o StrictHostKeyChecking=no" \ hostname
Here’s the signification of each argument:
--npindicate the number of process to start, here I have 12 machines and I want to start 1 process per machine so 12 process total
--hostfileindicates the path to the above-mentioned host file
--mca plm_rsh_agent "ssh -q -o StrictHostKeyChecking=no"is a little trick so that ssh doesn’t check if the host is known.
hostnameis finally the name of the program that will be executed. Here, it is a special program integrated in OpenMPI, where all the nodes prompt their hostname.
My result looks like:
1c34da7f9a3a 1c34da7f9a42 1c34da7f9ac2 0c42a1198eaa 1c34da7f9bb2 1c34da5c5cc4 1c34da5c5cac 1c34da7f99e2 1c34da5c5e54 1c34da7f9bba 1c34da7f9bc2 1c34da7f99ea
At this point, you can try to execute your favorite HPC program. You should see rather poor results when using multiple nodes because our interconnect is slow. It’s time to bring RoCE to action !
Adding RoCE support
Now that our basic cluster is working, let’s add RoCE support.
Installing the driver
The first thing to do is to install the Mellanox driver for the network interface card. You can get it from here. Then, you’ll have to unzip the archive, and execute
mlnxofedinstall on every node of the cluster.
The exact code, and options to send to
mlnxofedinstall may differ depending on the distribution and the driver version, but here’s what I ran:
VERSION=MLNX_OFED-5.5-18.104.22.168 DISTRO=ubuntu20.04-x86_64 wget “https://content.mellanox.com/ofed/$VERSION/$VERSION-$DISTRO.tgz” && \ tar xvf “$VERSION-$DISTRO.tgz” && \ “./$VERSION-$DISTRO/mlnxofedinstall” --add-kernel-support
Once it is done, you can verify the installed version of the driver with:
root@b8-ce-f6-fc-40-12:~$ ofed_info -s MLNX_OFED_LINUX-5.6-22.214.171.124:
Now, we need to activate two kernel modules that are needed for rdma and RoCE exchanges. Do so by executing:
modprobe rdma_cm ib_umad
Then, you can check that RoCE devices are recognized with
ibv_devinfo. Here’s what I have:
root@b8-ce-f6-fc-40-12:~$ ibv_devinfo hca_id: mlx5_0 transport: InfiniBand (0) fw_ver: 14.32.1010 node_guid: b8ce:f603:00fc:4012 sys_image_guid: b8ce:f603:00fc:4012 vendor_id: 0x02c9 vendor_part_id: 4117 hw_ver: 0x0 board_id: MT_2470112034 phys_port_cnt: 1 port: 1 state: PORT_ACTIVE (4) max_mtu: 4096 (5) active_mtu: 1024 (3) sm_lid: 0 port_lid: 0 port_lmc: 0x00 link_layer: Ethernet hca_id: mlx5_1 transport: InfiniBand (0) fw_ver: 14.32.1010 node_guid: b8ce:f603:00fc:4013 sys_image_guid: b8ce:f603:00fc:4012 vendor_id: 0x02c9 vendor_part_id: 4117 hw_ver: 0x0 board_id: MT_2470112034 phys_port_cnt: 1 port: 1 state: PORT_DOWN (1) max_mtu: 4096 (5) active_mtu: 1024 (3) sm_lid: 0 port_lid: 0 port_lmc: 0x00 link_layer: Ethernet
The output shows that only one of the port is connected: the NIC has two physical ports, the first one,
PORT_ACTIVE, which means it is connected, while the second,
mlx5_1, is marked
PORT_DOWN. The transport layer is recognized as Infiniband which is the sign that RoCE is enabled. Indeed, remember that RoCE is achieved by encapsulating Infiniband over UDP/IP.
Set the version of the RoCE protocol to v2 (
-d is the device,
-p the port and
-m the version):
root@b8-ce-f6-fc-40-12:~$ cma_roce_mode -d mlx5_0 -p 1 -m 2 RoCE v2
Finally, you can see how the devices are associated with internet interfaces with
root@b8-ce-f6-fc-40-12:~$ ibdev2netdev mlx5_0 port 1 ==> eth0 (Up) mlx5_1 port 1 ==> ens1f1np1 (Down)
Test RoCE connection
To test the connection, execute
ib_send_lat --disable_pcie_relaxed -F on one node (the server) and
ib_send_lat --disable_pcie_relaxed -F <ip-of-the-server> on another one. On my cluster, I have latencies around 2 micro seconds. If yours are a lot higher, then it might mean that RoCE is not working properly.
If you want to make sure that RoCE is being used (and that’s often useful) you can use a sniffer to track all the messages that are sent and received by a node. However, capturing RoCE traffic is not that easy, since it doesn’t go through the kernel. The procedure to do so is as follow:
- First install docker. Docker is a virtualization technology that allows to start a program in a preconfigured environment that is shipped in a docker container. That’s what we will do with tcpdump
- Now execute:
docker run -it --net=host --privileged \ -v /dev/infiniband:/dev/infiniband -v /tmp/traces:/tmp/traces \ mellanox/tcpdump-rdma bash
- Now, whenever you want to capture traffic, execute
tcpdump -i mlx5_0 -s 0 -w /tmp/traces/capture1.pcap. Then you can stop it with Ctrl+C and read the
/tmp/traces/capture1.pcapfile with wireshark.
- Beware because the files can get huge very quickly. To limit the number of packets captured, add the
MPI with RoCE
Now that RoCE is supported by the nodes, it is time to use it to speed up some applications. To do that, we will need to tell MPI to use RoCE. The good news is that mellanox’ driver comes with a tuned version of OpenMPI, called HPC-X, that is made to use mellanox NICs to their best.
Mellanox installs HPC-X at a non-standard location, so to use it we need to type:
$ export PATH=/usr/mpi/gcc/openmpi-4.1.2rc2/bin:$PATH $ export LD_LIBRARY_PATH=/usr/mpi/gcc/openmpi-4.1.2rc2/lib:$LD_LIBRARY_PATH
mpirun should now refer to HPC-X as shipped with mellanox driver:
$ which mpirun /usr/mpi/gcc/openmpi-4.0.2rc3/bin/mpirun
What we need to do is only to add a little flag to the mpi command (UCX is the library that support RoCE interconnect in OpenMPI):
mpirun --np 12 --hostfile /path/to/hostfile \ --mca plm_rsh_agent "ssh -q -o StrictHostKeyChecking=no" \ -x UCX_NET_DEVICES=mlx5_0:1 \ hostname
At this point, nothing extraordinary should happen, since hostname doesn’t involve sending and receiving messages. To test the interconnection, we will use osu-micro-benchmark. Now try:
$ export OSU="/usr/mpi/gcc/openmpi-4.1.2rc2/tests/osu-micro-benchmarks-5.6.2” $ mpirun --np 12 --hostfile /path/to/hostfile \ --mca plm_rsh_agent "ssh -q -o StrictHostKeyChecking=no" \ -x UCX_NET_DEVICES=mlx5_0:1 \ $OSU/osu_alltoall
You can use the previous method to make sure that RoCE packets are sent. Also, you can try to see the difference between with and without RoCE. However, the cluster is not really ready, and the result might be actually poor for the moment.
This is because the network has a lossy behavior for now. It means that some packets might be lost in transit, and while it can be acceptable in some cases, it is not at all in an HPC environment. Therefore, we have to configure it a bit more.
Configure the network to a lossless behavior
The key point in getting a lossless behavior is to enable a mechanism called PFC (priority flow control). The idea of this protocol is to differentiate levels of priority and adapt the behavior in function of the priority. The option that interests us is to make this priority in a lossless mode. What will happen is that when a device (the switch or a node) becomes overwhelmed, it will tell the network to stop before its buffer is entirely full and some packets are lost.
A possible solution would be to activate this mode for every priority, but it would be a bit dirty. Here we will do things properly, and activate lossless for only one priority and tell MPI to use it. There are two steps to obtain a lossless behavior:
- Enabling a priority tagging mechanism
- Activate lossless for the correct priority
Both these steps need to be done on the switch and the servers. In this example, we will use priority 3.
Configuring the switch for lossless
The syntax for configuring the switch can be quite different from one another. In my case, the wedge switch has a syntax similar to Artista switches.
Once logged in the switch, first have a look at the interfaces’ status, to know the range that should be configured:
wedge>show interfaces status Port Name Status Vlan Duplex Speed Type Flags Encapsulation Et1/1 connected trunk full 25G 100GBASE-CR4 Et1/2 connected trunk full 25G 100GBASE-CR4 Et1/3 connected trunk full 25G 100GBASE-CR4 Et1/4 connected trunk full 25G 100GBASE-CR4 Et2/1 connected trunk full 25G 100GBASE-CR4 Et2/2 connected trunk full 25G 100GBASE-CR4 Et2/3 connected trunk full 25G 100GBASE-CR4 Et2/4 connected trunk full 25G 100GBASE-CR4 Et3/1 connected trunk full 25G 100GBASE-CR4 Et3/2 connected trunk full 25G 100GBASE-CR4 Et3/3 connected trunk full 25G 100GBASE-CR4 Et3/4 connected trunk full 25G 100GBASE-CR4 Et4/1 notconnect trunk full 25G Not Present Et4/2 notconnect trunk full 25G Not Present Et4/3 notconnect trunk full 25G Not Present Et4/4 notconnect trunk full 25G Not Present [...]
Here, my range will be 1/1-3/4
Now enter configuration mode:
wedge>enable wedge#configure wedge(config)#interface ethernet 1/1-3/4 wedge(config-if-Et1/1-3/4)#
First we will set the priority tagging mechanism to L3 which means that the DSCP field in the IP header will be used. This mode is slightly easier to set up than the L2 mechanism, which requires the use of a VLAN.
wedge(config-if-Et1/1-3/4)#qos trust dscp
Now, we activate priority flow control and set priority 3 to the no drop mode:
wedge(config-if-Et1/1-3/4)#priority-flow-control on wedge(config-if-Et1/1-3/4)#priority-flow-control priority 3 no-drop
The PFC status can be print to make sure everything is set up correctly:
wedge(config-if-Et1/1-3/4)#show priority-flow-control The hardware supports PFC on priorities 0 1 2 3 4 5 6 7 PFC receive processing is enabled on priorities 0 1 2 3 4 5 6 7 The PFC watchdog timeout is 0.0 second(s) (default) The PFC watchdog recovery-time is 0.0 second(s) (auto) (default) The PFC watchdog polling-interval is 0.0 second(s) (default) The PFC watchdog action is errdisable The PFC watchdog override action drop is false The PFC watchdog non-disruptive priorities are not configured The PFC watchdog non-disruptive action is not configured The PFC watchdog port non-disruptive-only is false The PFC watchdog hardware monitored priorities are not configured Global PFC : Enabled E: PFC Enabled, D: PFC Disabled, A: PFC Active, W: PFC Watchdog Enabled Port Status Priorities Action Timeout Recovery Polling Note Interval/Mode Config/Oper --------------------------------------------------------------------------------------- Et1/1 E A W 3 - - - / - - / - DCBX disabled Et1/2 E A W 3 - - - / - - / - DCBX disabled Et1/3 E A W 3 - - - / - - / - DCBX disabled Et1/4 E A W 3 - - - / - - / - DCBX disabled Et2/1 E A W 3 - - - / - - / - DCBX disabled Et2/2 E A W 3 - - - / - - / - DCBX disabled Et2/3 E A W 3 - - - / - - / - DCBX disabled Et2/4 E A W 3 - - - / - - / - DCBX disabled Et3/1 E A W 3 - - - / - - / - DCBX disabled Et3/2 E A W 3 - - - / - - / - DCBX disabled Et3/3 E A W 3 - - - / - - / - DCBX disabled Et3/4 E A W 3 - - - / - - / - DCBX disabled
Configure the server to lossless
First, we configure the NIC by setting the trust mode (e.g. the priority tagging mechanism used) to L3 and activate no drop mode on the priority 3 with:
$ sudo mlnx_qos -i eth0 --trust=dscp --pfc 0,0,0,1,0,0,0,0
The different option have the following explanation:
-ieth0 is the Ethernet interface linked to the RoCE device to configure. You can get the internet to device association with ibdev2netdev
--trust=dscpis the option to set the tagging to L3
--pfcis followed by a serie of 8 number that indicates for each priority, from 0 to 7 if it should be enabled.
Now we only have to tell MPI to use priority 3. To do so, we set the flag:
UCX_IB_SL (for infiniband service level) to 3 and the flag
UCX_IB_TRAFFIC_CLASS to 124. This second flag will be used to fill the DSCP field. By default, priority 3 ranges from 30 to 39, so here we target 31. Since the DSCP field is filled with
UCX_IB_TRAFFIC_CLASS/4, we set it to 124.
Here’s an example for OSU benchmark:
$ mpirun --np 12 --hostfile /path/to/hostfile \ --mca plm_rsh_agent "ssh -q -o StrictHostKeyChecking=no" \ -x UCX_NET_DEVICES=mlx5_0:1 \ -x UCX_IB_SL=3 \ -x UCX_IB_TRAFFIC_CLASS=124 \ $OSU/osu_alltoall
To wrap things up, let’s see a comparison of the 3 cluster set-up we demonstrated in this post: TCP cluster, lossy RoCE and lossless RoCE. Below are the outputs of the
osu_alltoall benchmark. Each time, 4 nodes using each 28 cores were used.
A few things are noticeable in this comparison:
- First, for small message sizes, RoCE provides a huge increase in latency, almost by a factor of 10. This is thanks to the zero-copy and the kernel bypass mechanisms.
- For larger sizes, the message sizes, the task is limited by the bandwidth, so the difference between TCP and RoCE is smaller. However, RoCE is still 2 times quicker, and involves less CPU and RAM resources, so the impact would be multiplied on a real HPC workload.
- At some point, lossy RoCE experience terrible performances. Notice how between message sizes of 8192 and 16384, the latency is suddenly multiplied by 30. This means that from this message size, the network starts experiencing congestion. In a lossy setup, packets are simply lost and the servers struggle to know what data was lost, hence resulting in huge bandwidth losses. On the other hand, the lossless setup keeps a steady performance curve.