# blog

Welcome to Qarnot Cloud
Discover our work, our latest news, and tips

< Back

## Setting up a RoCE cluster

by Alexis de la Fournière - November 9, 2022 - Uncategorized

## Introduction

To perform a large computing task, for example simulating the whole fluid dynamic around a racing car, a single computer can be insufficient. To bypass this, it is possible to group several servers together and form a computing cluster.

When doing so, a key point to success is having a good interconnection network between the different servers (we will call them the computing nodes). If not, the time gained by adding computing power will be lost in increased communication times.

In this article, we will see how to set up a high performance cluster from scratch, using an interconnection technology called RoCE. The goal is to run applications using MPI.

## Physical setup and prerequisites

### Hardware

Using RoCE requires using a compliant NIC (Network Interface Card). In our case, they are Mellanox ConnectX-4 cards. All the procedure described in this post should be similar for other cards made by Mellanox. The server used are OCP Leopard servers, embedding 2 Intel Xeon CPUs. In this example, we have 12 of those.

The switch also has to be RoCE compliant. In our case, the material also come from the OCP sphere, and is a 100 GbE wedge. Its command line interface is similar to Arista switches’. All the servers are connected to this switch. Their IPs range from 10.1.4.1 to 10.1.4.12

If you have not bought your servers yet, here are a few guidelines. HPC tasks like CFD are often memory bound, which means that the most time-consuming operations are waiting for data to come from the RAM. Therefore, you should prioritize a processor with a high number of memory channels and high RAM frequency. The number of cores is not very important, since memory channels are often saturated with 3 cores per channel.

### Software

All the servers run on Linux and have Ubuntu 20.04 installed.

## Understanding the technology

To understand what is the purpose of RoCE and why it provides excellent performances, we must dive a bit into the technologies at stake.

### RDMA

When doing “normal” communication, a message that is sent, which originally occupies a memory space, is transferred through different layers which will each copy the message in their memory and process it before handing it over to the next layer. This layer architecture has a lot of advantages from a programming perspective, but it implies a lot of copies and resource utilization. Even with very high throughput, a network that uses this standard architecture will have a high latency and will not be able to reach good performances in clustering.

To bypass that, it is necessary to achieve a technique called RDMA (remote direct memory access). RDMA is a technique where the data is directly transferred from a server’s main memory to another, without any other copy nor intermediate buffer. This significantly reduces latency and means that the kernel will not have to bother about copying data around.

### Infiniband and RoCE

To perform this technique, new technologies and protocols had to be developed. One of those was Infiniband, which is both a physical specification of buses and connectors, as well as a protocol specification. Infiniband is very effective and widespread over the TOP500, but it is also quite costly and complex to set up.

RoCE (RDMA over Converged Ethernet) is the attempt to get the better out of the two worlds: the performance of Infiniband, and the popularity of Ethernet. The way it works is by encapsulating the infiniband protocol inside Ethernet (RoCE v1) or UDP/IP (RoCE v2) packets. This is the technology that we will discover today.

## Setting up a standard cluster

Before seeing how to create a RoCE cluster, we will set a cluster that runs over standard TCP connections. Here are the different steps.

The first thing to do is to set up passwordless ssh between all the nodes. SSH is a protocol and a program that enables to safely connect to a remote server and execute commands on it. It will be used by the host node to start processes on all the other nodes. Yet we have to make it work without asking for a password.

To enable a passwordless connection, the receiving server needs to have in his authorized_keys file (standard location is ~/.ssh/autorized_keys) the public key of the sender (which is typically ~/.ssh/id_rsa.pub).

Two options are basically available. The first is to copy around all the keys in the different autorized_keys files. The second is to generate a new key pair that will be used by all the servers (just copy the id_rsa and id_rsa.pub files in every ~/.ssh/ folder). Then the autorized_key of all the nodes will be the same, and will only contain the single public key.

Once this is done, you should be able to connect through ssh to the other servers, without the terminal prompting you for a password.

### Create a NFS shared directory

This step is actually optional, but greatly simplifies the task of setting up an MPI use case. To work, the MPI cluster needs to have the executable file data on every server and in the same location. Therefore, it is either possible to copy these files to each server or to create a shared directory. The first solution gives better performances, but the second is easier to set up. See this tutorial to set up a NFS directory.

### Install and test MPI

Now it is time to make everything in practice. First, we will install an implementation of MPI. I went for OpenMPI, which can be installed with apt-get install openmpi-bin.

To run OpenMPI over multiple nodes, we need to tell the host where to find the other nodes. This is done inside a host file that contains a line per server, with their IP and the number of slots on the machine. The number of slots would typically be the number of cores on the machine, but here we can start with only one slot per node, in order to try the clustering.

Here’s a little script to generate the host file for a given number of slot per node:

rm -f hostfile
for ((i=1; i <= $NB_NODES; i++)) do echo “10.1.4.$i slots=$NB_SLOT_PER_NODE” >> hostfile done The result will be: $ NB_NODES=12 NB_SLOT_PER_NODE=1 ./generate-hostfile.sh
$cat hostfile 10.1.4.1 slots=1 10.1.4.2 slots=1 10.1.4.3 slots=1 10.1.4.4 slots=1 10.1.4.5 slots=1 10.1.4.6 slots=1 10.1.4.7 slots=1 10.1.4.8 slots=1 10.1.4.9 slots=1 10.1.4.10 slots=1 10.1.4.11 slots=1 10.1.4.12 slots=1 Now, let’s try mpirun: mpirun --np 12 --hostfile /path/to/hostfile \ --mca plm_rsh_agent "ssh -q -o StrictHostKeyChecking=no" \ hostname Here’s the signification of each argument: • --np indicate the number of process to start, here I have 12 machines and I want to start 1 process per machine so 12 process total • --hostfile indicates the path to the above-mentioned host file • --mca plm_rsh_agent "ssh -q -o StrictHostKeyChecking=no" is a little trick so that ssh doesn’t check if the host is known. • hostname is finally the name of the program that will be executed. Here, it is a special program integrated in OpenMPI, where all the nodes prompt their hostname. My result looks like: 1c34da7f9a3a 1c34da7f9a42 1c34da7f9ac2 0c42a1198eaa 1c34da7f9bb2 1c34da5c5cc4 1c34da5c5cac 1c34da7f99e2 1c34da5c5e54 1c34da7f9bba 1c34da7f9bc2 1c34da7f99ea At this point, you can try to execute your favorite HPC program. You should see rather poor results when using multiple nodes because our interconnect is slow. It’s time to bring RoCE to action ! ## Adding RoCE support Now that our basic cluster is working, let’s add RoCE support. ### Installing the driver The first thing to do is to install the Mellanox driver for the network interface card. You can get it from here. Then, you’ll have to unzip the archive, and execute mlnxofedinstall on every node of the cluster. The exact code, and options to send to mlnxofedinstall may differ depending on the distribution and the driver version, but here’s what I ran: VERSION=MLNX_OFED-5.5-1.0.3.2 DISTRO=ubuntu20.04-x86_64 wget “https://content.mellanox.com/ofed/$VERSION/$VERSION-$DISTRO.tgz” && \
tar xvf “$VERSION-$DISTRO.tgz” && \
“./$VERSION-$DISTRO/mlnxofedinstall” --add-kernel-support

Once it is done, you can verify the installed version of the driver with:

root@b8-ce-f6-fc-40-12:~$ofed_info -s MLNX_OFED_LINUX-5.6-2.0.9.0: Now, we need to activate two kernel modules that are needed for rdma and RoCE exchanges. Do so by executing: modprobe rdma_cm ib_umad Then, you can check that RoCE devices are recognized with ibv_devinfo. Here’s what I have: root@b8-ce-f6-fc-40-12:~$ ibv_devinfo
hca_id:    mlx5_0
transport:            InfiniBand (0)
fw_ver:                14.32.1010
node_guid:            b8ce:f603:00fc:4012
sys_image_guid:            b8ce:f603:00fc:4012
vendor_id:            0x02c9
vendor_part_id:            4117
hw_ver:                0x0
board_id:            MT_2470112034
phys_port_cnt:            1
port:    1
state:            PORT_ACTIVE (4)
max_mtu:        4096 (5)
active_mtu:        1024 (3)
sm_lid:            0
port_lid:        0
port_lmc:        0x00

hca_id:    mlx5_1
transport:            InfiniBand (0)
fw_ver:                14.32.1010
node_guid:            b8ce:f603:00fc:4013
sys_image_guid:            b8ce:f603:00fc:4012
vendor_id:            0x02c9
vendor_part_id:            4117
hw_ver:                0x0
board_id:            MT_2470112034
phys_port_cnt:            1
port:    1
state:            PORT_DOWN (1)
max_mtu:        4096 (5)
active_mtu:        1024 (3)
sm_lid:            0
port_lid:        0
port_lmc:        0x00
link_layer:        Ethernet

The output shows that only one of the port is connected: the NIC has two physical ports, the first one, mlx5_0 is PORT_ACTIVE, which means it is connected, while the second, mlx5_1, is marked PORT_DOWN. The transport layer is recognized as Infiniband which is the sign that RoCE is enabled. Indeed, remember that RoCE is achieved by encapsulating Infiniband over UDP/IP.

Set the version of the RoCE protocol to v2 (-d is the device, -p the port and -m the version):

root@b8-ce-f6-fc-40-12:~$cma_roce_mode -d mlx5_0 -p 1 -m 2 RoCE v2 Finally, you can see how the devices are associated with internet interfaces with root@b8-ce-f6-fc-40-12:~$ ibdev2netdev
mlx5_0 port 1 ==> eth0 (Up)
mlx5_1 port 1 ==> ens1f1np1 (Down)

### Test RoCE connection

To test the connection, execute ib_send_lat --disable_pcie_relaxed -F on one node (the server) and ib_send_lat --disable_pcie_relaxed -F <ip-of-the-server> on another one. On my cluster, I have latencies around 2 micro seconds. If yours are a lot higher, then it might mean that RoCE is not working properly.

If you want to make sure that RoCE is being used (and that’s often useful) you can use a sniffer to track all the messages that are sent and received by a node. However, capturing RoCE traffic is not that easy, since it doesn’t go through the kernel. The procedure to do so is as follow:

•  First install docker. Docker is a virtualization technology that allows to start a program in a preconfigured environment that is shipped in a docker container. That’s what we will do with tcpdump
• Now execute:
docker run -it --net=host --privileged \
-v /dev/infiniband:/dev/infiniband -v /tmp/traces:/tmp/traces \
mellanox/tcpdump-rdma bash
• Now, whenever you want to capture traffic, execute tcpdump -i mlx5_0 -s 0 -w /tmp/traces/capture1.pcap. Then you can stop it with Ctrl+C and read the /tmp/traces/capture1.pcap file with wireshark.
• Beware because the files can get huge very quickly. To limit the number of packets captured, add the -c <count> option.

### MPI with RoCE

Now that RoCE is supported by the nodes, it is time to use it to speed up some applications. To do that, we will need to tell MPI to use RoCE. The good news is that mellanox’ driver comes with a tuned version of OpenMPI, called HPC-X, that is made to use mellanox NICs to their best.

Mellanox installs HPC-X at a non-standard location, so to use it we need to type:

$export PATH=/usr/mpi/gcc/openmpi-4.1.2rc2/bin:$PATH
$export LD_LIBRARY_PATH=/usr/mpi/gcc/openmpi-4.1.2rc2/lib:$LD_LIBRARY_PATH

mpirun should now refer to HPC-X as shipped with mellanox driver:

$which mpirun /usr/mpi/gcc/openmpi-4.0.2rc3/bin/mpirun What we need to do is only to add a little flag to the mpi command (UCX is the library that support RoCE interconnect in OpenMPI): mpirun --np 12 --hostfile /path/to/hostfile \ --mca plm_rsh_agent "ssh -q -o StrictHostKeyChecking=no" \ -x UCX_NET_DEVICES=mlx5_0:1 \ hostname At this point, nothing extraordinary should happen, since hostname doesn’t involve sending and receiving messages. To test the interconnection, we will use osu-micro-benchmark. Now try: $ export OSU="/usr/mpi/gcc/openmpi-4.1.2rc2/tests/osu-micro-benchmarks-5.6.2”
$mpirun --np 12 --hostfile /path/to/hostfile \ --mca plm_rsh_agent "ssh -q -o StrictHostKeyChecking=no" \ -x UCX_NET_DEVICES=mlx5_0:1 \$OSU/osu_alltoall

You can use the previous method to make sure that RoCE packets are sent. Also, you can try to see the difference between with and without RoCE. However, the cluster is not really ready, and the result might be actually poor for the moment.

This is because the network has a lossy behavior for now. It means that some packets might be lost in transit, and while it can be acceptable in some cases, it is not at all in an HPC environment. Therefore, we have to configure it a bit more.

### Configure the network to a lossless behavior

The key point in getting a lossless behavior is to enable a mechanism called PFC (priority flow control). The idea of this protocol is to differentiate levels of priority and adapt the behavior in function of the priority. The option that interests us is to make this priority in a lossless mode. What will happen is that when a device (the switch or a node) becomes overwhelmed, it will tell the network to stop before its buffer is entirely full and some packets are lost.

A possible solution would be to activate this mode for every priority, but it would be a bit dirty. Here we will do things properly, and activate lossless for only one priority and tell MPI to use it. There are two steps to obtain a lossless behavior:

• Enabling a priority tagging mechanism
• Activate lossless for the correct priority

Both these steps need to be done on the switch and the servers. In this example, we will use priority 3.

#### Configuring the switch for lossless

The syntax for configuring the switch can be quite different from one another. In my case, the wedge switch has a syntax similar to Artista switches.

Once logged in the switch, first have a look at the interfaces’ status, to know the range that should be configured:

wedge>show interfaces status
Port       Name   Status        Vlan     Duplex Speed   Type         Flags Encapsulation
Et1/1             connected     trunk    full   25G    100GBASE-CR4
Et1/2             connected     trunk    full   25G    100GBASE-CR4
Et1/3             connected     trunk    full   25G    100GBASE-CR4
Et1/4             connected     trunk    full   25G    100GBASE-CR4
Et2/1             connected     trunk    full   25G    100GBASE-CR4
Et2/2             connected     trunk    full   25G    100GBASE-CR4
Et2/3             connected     trunk    full   25G    100GBASE-CR4
Et2/4             connected     trunk    full   25G    100GBASE-CR4
Et3/1             connected     trunk    full   25G    100GBASE-CR4
Et3/2             connected     trunk    full   25G    100GBASE-CR4
Et3/3             connected     trunk    full   25G    100GBASE-CR4
Et3/4             connected     trunk    full   25G    100GBASE-CR4
Et4/1             notconnect   trunk    full   25G    Not Present
Et4/2             notconnect   trunk    full   25G    Not Present
Et4/3             notconnect   trunk    full   25G    Not Present
Et4/4             notconnect   trunk    full   25G    Not Present
[...]

Here, my range will be 1/1-3/4

Now enter configuration mode:

wedge>enable
wedge#configure
wedge(config)#interface ethernet 1/1-3/4
wedge(config-if-Et1/1-3/4)#

First we will set the priority tagging mechanism to L3 which means that the DSCP field in the IP header will be used. This mode is slightly easier to set up than the L2 mechanism, which requires the use of a VLAN.

wedge(config-if-Et1/1-3/4)#qos trust dscp

Now, we activate priority flow control and set priority 3 to the no drop mode:

wedge(config-if-Et1/1-3/4)#priority-flow-control on
wedge(config-if-Et1/1-3/4)#priority-flow-control priority 3 no-drop

The PFC status can be print to make sure everything is set up correctly:

wedge(config-if-Et1/1-3/4)#show priority-flow-control
The hardware supports PFC on priorities 0 1 2 3 4 5 6 7
PFC receive processing is enabled on priorities 0 1 2 3 4 5 6 7
The PFC watchdog timeout is 0.0 second(s) (default)
The PFC watchdog recovery-time is 0.0 second(s) (auto) (default)
The PFC watchdog polling-interval is 0.0 second(s) (default)
The PFC watchdog action is errdisable
The PFC watchdog override action drop is false
The PFC watchdog non-disruptive priorities are not configured
The PFC watchdog non-disruptive action is not configured
The PFC watchdog port non-disruptive-only is false
The PFC watchdog hardware monitored priorities are not configured
Global PFC : Enabled

E: PFC Enabled, D: PFC Disabled, A: PFC Active, W: PFC Watchdog Enabled
Port  Status Priorities Action      Timeout  Recovery        Polling         Note
Interval/Mode   Config/Oper
---------------------------------------------------------------------------------------
Et1/1      E A W      3          -           -        - / -           - / -           DCBX disabled
Et1/2      E A W      3          -           -        - / -           - / -           DCBX disabled
Et1/3      E A W      3          -           -        - / -           - / -           DCBX disabled
Et1/4      E A W      3          -           -        - / -           - / -           DCBX disabled
Et2/1      E A W      3          -           -        - / -           - / -           DCBX disabled
Et2/2      E A W      3          -           -        - / -           - / -           DCBX disabled
Et2/3      E A W      3          -           -        - / -           - / -           DCBX disabled
Et2/4      E A W      3          -           -        - / -           - / -           DCBX disabled
Et3/1      E A W      3          -           -        - / -           - / -           DCBX disabled
Et3/2      E A W      3          -           -        - / -           - / -           DCBX disabled
Et3/3      E A W      3          -           -        - / -           - / -           DCBX disabled
Et3/4      E A W      3          -           -        - / -           - / -           DCBX disabled

#### Configure the server to lossless

First, we configure the NIC by setting the trust mode (e.g. the priority tagging mechanism used) to L3 and activate no drop mode on the priority 3 with:

$sudo mlnx_qos -i eth0 --trust=dscp --pfc 0,0,0,1,0,0,0,0 The different option have the following explanation: • -i eth0 is the Ethernet interface linked to the RoCE device to configure. You can get the internet to device association with ibdev2netdev • --trust=dscp is the option to set the tagging to L3 • --pfc is followed by a serie of 8 number that indicates for each priority, from 0 to 7 if it should be enabled. Now we only have to tell MPI to use priority 3. To do so, we set the flag: UCX_IB_SL (for infiniband service level) to 3 and the flag UCX_IB_TRAFFIC_CLASS to 124. This second flag will be used to fill the DSCP field. By default, priority 3 ranges from 30 to 39, so here we target 31. Since the DSCP field is filled with UCX_IB_TRAFFIC_CLASS/4, we set it to 124. Here’s an example for OSU benchmark: $ mpirun --np 12 --hostfile /path/to/hostfile \
--mca plm_rsh_agent "ssh -q -o StrictHostKeyChecking=no" \
-x UCX_NET_DEVICES=mlx5_0:1 \
-x UCX_IB_SL=3 \
-x UCX_IB_TRAFFIC_CLASS=124 \
\$OSU/osu_alltoall

## Conclusion

To wrap things up, let’s see a comparison of the 3 cluster set-up we demonstrated in this post: TCP cluster, lossy RoCE and lossless RoCE. Below are the outputs of the osu_alltoall benchmark. Each time, 4 nodes using each 28 cores were used.

A few things are noticeable in this comparison:

• First, for small message sizes, RoCE provides a huge increase in latency, almost by a factor of 10. This is thanks to the zero-copy and the kernel bypass mechanisms.
• For larger sizes, the message sizes, the task is limited by the bandwidth, so the difference between TCP and RoCE is smaller. However, RoCE is still 2 times quicker, and involves less CPU and RAM resources, so the impact would be multiplied on a real HPC workload.
• At some point, lossy RoCE experience terrible performances. Notice how between message sizes of 8192 and 16384, the latency is suddenly multiplied by 30. This means that from this message size, the network starts experiencing congestion. In a lossy setup, packets are simply lost and the servers struggle to know what data was lost, hence resulting in huge bandwidth losses. On the other hand, the lossless setup keeps a steady performance curve.