mlx5 connect on mlx5_1 failed: Connection timed out
Describe the bug
I'm running NGC's hpl benchmark test from Slurm. When I ran hpl in an hpl container on two servers with 8 GPUs per node, I encountered a UCX error.
Steps to Reproduce
- Command line: Please see log file.
- UCX version used (from github branch XX or release YY) + UCX configure flags (can be checked by
ucx_info -v): Please see log file. - Any UCX environment variables used
Setup and versions
- OS version (e.g Linux distro) + CPU architecture (x86_64/aarch64/ppc64le/...): Please see log file.
cat /etc/issueorcat /etc/redhat-release+uname -a- For Nvidia Bluefield SmartNIC include
cat /etc/mlnx-release(the string identifies software and firmware setup)
- For RDMA/IB/RoCE related issues: Please see log file.
- Driver version:
rpm -q rdma-coreorrpm -q libibverbs- or: MLNX_OFED version
ofed_info -s
- HW information from
ibstatoribv_devinfo -vvcommand
- Driver version:
- For GPU related issues:
- GPU type : H100
- Cuda:
- Drivers version:12.2
- Check if peer-direct is loaded:
lsmod|grep nv_peer_memand/or gdrcopy:lsmod|grep gdrdrv: Please see log file.
Additional information (depending on the issue)
- OpenMPI version:5.0.3
- Output of
ucx_info -dto show transports and devices recognized by UCX: Please see log file.
@shinoharakazuya can you pls post the output of show_gids command, and check if setting UCX_IB_ROCE_LOCAL_SUBNET=y helps to resolve the issue?
@jandres742 FYI
NOTE: This issue happens on Nvidia internal cluster
@yosefe I have same issue.
client:
UCX_PROTO_ENABLE=y UCX_NET_DEVICES=mlx5_0:1,mlx5_1:1 UCX_TLS=rc UCX_PROTO_INFO=y ./install-release-mt/bin/ucx_perftest 10.16.29.13 -t ucp_am_bw -s 2097152 -n 5000000 -e
server:
UCX_PROTO_ENABLE=y UCX_NET_DEVICES=mlx5_0:1,mlx5_1:1 UCX_PROTO_INFO=y ./install-release-mt/bin/ucx_perftest -e
[1737007020.457942] [node13:967927:0] +---------------------------+-------------------------------------------------------------------------------------------------+
[1737007020.457946] [node13:967927:0] | perftest inter-node cfg#0 | active message by ucp_am_send* with reply flag(multi) from host memory |
[1737007020.457947] [node13:967927:0] +---------------------------+-------------------------------------------+-----------------------------------------------------+
[1737007020.457950] [node13:967927:0] | 0..514 | short | rc_mlx5/mlx5_0:1 |
[1737007020.457953] [node13:967927:0] | 515..4844 | zero-copy | rc_mlx5/mlx5_0:1 |
[1737007020.457955] [node13:967927:0] | 4845..inf | (?) rendezvous zero-copy read from remote | 50% on rc_mlx5/mlx5_0:1 and 50% on rc_mlx5/mlx5_1:1 |
[1737007020.457958] [node13:967927:0] +---------------------------+-------------------------------------------+-----------------------------------------------------+
[1737007021.503646] [node13:967927:a] ib_device.c:1332 UCX ERROR ibv_create_ah(dlid=49152 sl=0 port=1 src_path_bits=0 dgid=fe80::a288:c2ff:feb4:87d7 flow_label=0xffffffff sgid_index=1 traffic_class=106) for RC DEVX QP connect on mlx5_1 failed: Connection timed out
[1737007021.503771] [node13:967927:0] libperf.c:1069 UCX ERROR error handler called with status -80 (Endpoint timeout)
[root@node12 ucx-1.18.0]# show_gids
DEV PORT INDEX GID IPv4 VER DEV
--- ---- ----- --- ------------ --- ---
mlx5_0 1 0 fe80:0000:0000:0000:a288:c2ff:feb4:87e6 v1 ens2f0np0
mlx5_0 1 1 fe80:0000:0000:0000:a288:c2ff:feb4:87e6 v2 ens2f0np0
mlx5_0 1 2 0000:0000:0000:0000:0000:ffff:0a10:1d0c 10.16.29.12 v1 ens2f0np0
mlx5_0 1 3 0000:0000:0000:0000:0000:ffff:0a10:1d0c 10.16.29.12 v2 ens2f0np0
mlx5_1 1 0 fe80:0000:0000:0000:a288:c2ff:feb4:87e7 v1 ens2f1np1
mlx5_1 1 1 fe80:0000:0000:0000:a288:c2ff:feb4:87e7 v2 ens2f1np1
mlx5_2 1 0 fe80:0000:0000:0000:a288:c2ff:feb4:a562 v1 ens7f0np0
mlx5_2 1 1 fe80:0000:0000:0000:a288:c2ff:feb4:a562 v2 ens7f0np0
mlx5_2 1 2 0000:0000:0000:0000:0000:ffff:0a10:270c 10.16.39.12 v1 ens7f0np0
mlx5_2 1 3 0000:0000:0000:0000:0000:ffff:0a10:270c 10.16.39.12 v2 ens7f0np0
mlx5_3 1 0 fe80:0000:0000:0000:a288:c2ff:feb4:a563 v1 ens7f1np1
mlx5_3 1 1 fe80:0000:0000:0000:a288:c2ff:feb4:a563 v2 ens7f1np1
n_gids_found=12
@ivanallen mlx5_1 does not have an IP address, is that expected?
@ivanallen mlx5_1 does not have an IP address, is that expected?
Yes, that is expected. We don't configure mlx5_1 and mlx5_3.
Yes, that is expected. We don't configure mlx5_1 and mlx5_3.
Seems like the test being run on mlx5_1? Per the command above:
UCX_PROTO_ENABLE=y UCX_NET_DEVICES=mlx5_0:1,mlx5_1:1 UCX_TLS=rc UCX_PROTO_INFO=y ./install-release-mt/bin/ucx_perftest 10.16.29.13 -t ucp_am_bw -s 2097152 -n 5000000 -e
Yes, that is expected. We don't configure mlx5_1 and mlx5_3.
Seems like the test being run on mlx5_1? Per the command above:
UCX_PROTO_ENABLE=y UCX_NET_DEVICES=mlx5_0:1,mlx5_1:1 UCX_TLS=rc UCX_PROTO_INFO=y ./install-release-mt/bin/ucx_perftest 10.16.29.13 -t ucp_am_bw -s 2097152 -n 5000000 -e
@yosefe Do you mean using mlx5_2? mlx5_1 has no ip address. I have the same problem if I use UCX_NET_DEVICES=mlx5_0:1 and mlx5_1:1.
server:
[root@node13 ucx-1.18.0]# UCX_TLS=rc UCX_NET_DEVICES=mlx5_0:1,mlx5_2:1 UCX_PROTO_ENABLE=y UCX_PROTO_INFO=y ./install-release-mt/bin/ucx_perftest -e
[1737024891.886842] [node13:2797867:0] perftest.c:800 UCX WARN CPU affinity is not set (bound to 96 cpus). Performance may be impacted.
Waiting for connection...
Accepted connection from 10.16.29.12:52468
+----------------------------------------------------------------------------------------------------------+
| API: protocol layer |
| Test: am bandwidth / message rate |
| Data layout: (automatic) |
| Send memory: host |
| Recv memory: host |
| Message size: 1048576 |
| Window size: 32 |
| AM header size: 0 |
+----------------------------------------------------------------------------------------------------------+
[1737024893.671591] [node13:2797867:0] +---------------------------+-------------------------------------------------------------------------------------------------+
[1737024893.671602] [node13:2797867:0] | perftest inter-node cfg#0 | active message by ucp_am_send* from host memory |
[1737024893.671606] [node13:2797867:0] +---------------------------+-------------------------------------------+-----------------------------------------------------+
[1737024893.671609] [node13:2797867:0] | 0..2038 | short | rc_mlx5/mlx5_0:1 |
[1737024893.671612] [node13:2797867:0] | 2039..8246 | copy-in | rc_mlx5/mlx5_0:1 |
[1737024893.671613] [node13:2797867:0] | 8247..29420 | multi-frag copy-in | rc_mlx5/mlx5_0:1 |
[1737024893.671616] [node13:2797867:0] | 29421..inf | (?) rendezvous zero-copy read from remote | 50% on rc_mlx5/mlx5_0:1 and 50% on rc_mlx5/mlx5_2:1 |
[1737024893.671619] [node13:2797867:0] +---------------------------+-------------------------------------------+-----------------------------------------------------+
[1737024893.671782] [node13:2797867:0] +---------------------------+-------------------------------------------------------------------------------------------------+
[1737024893.671786] [node13:2797867:0] | perftest inter-node cfg#0 | active message by ucp_am_send*(fast-completion) from host memory |
[1737024893.671788] [node13:2797867:0] +---------------------------+-------------------------------------------+-----------------------------------------------------+
[1737024893.671791] [node13:2797867:0] | 0..2038 | short | rc_mlx5/mlx5_0:1 |
[1737024893.671794] [node13:2797867:0] | 2039..8246 | copy-in | rc_mlx5/mlx5_0:1 |
[1737024893.671796] [node13:2797867:0] | 8247..22493 | multi-frag copy-in | rc_mlx5/mlx5_0:1 |
[1737024893.671798] [node13:2797867:0] | 22494..262143 | multi-frag zero-copy | rc_mlx5/mlx5_0:1 |
[1737024893.671801] [node13:2797867:0] | 256K..inf | (?) rendezvous zero-copy read from remote | 50% on rc_mlx5/mlx5_0:1 and 50% on rc_mlx5/mlx5_2:1 |
[1737024893.671802] [node13:2797867:0] +---------------------------+-------------------------------------------+-----------------------------------------------------+
[1737024893.672161] [node13:2797867:0] +---------------------------+-------------------------------------------------------------------------------------------------+
[1737024893.672165] [node13:2797867:0] | perftest inter-node cfg#0 | active message by ucp_am_send*(multi) from host memory |
[1737024893.672166] [node13:2797867:0] +---------------------------+-------------------------------------------+-----------------------------------------------------+
[1737024893.672171] [node13:2797867:0] | 0..514 | short | rc_mlx5/mlx5_0:1 |
[1737024893.672173] [node13:2797867:0] | 515..4844 | zero-copy | rc_mlx5/mlx5_0:1 |
[1737024893.672175] [node13:2797867:0] | 4845..inf | (?) rendezvous zero-copy read from remote | 50% on rc_mlx5/mlx5_0:1 and 50% on rc_mlx5/mlx5_2:1 |
[1737024893.672178] [node13:2797867:0] +---------------------------+-------------------------------------------+-----------------------------------------------------+
[1737024893.672367] [node13:2797867:0] +---------------------------+-------------------------------------------------------------------------------------------------+
[1737024893.672371] [node13:2797867:0] | perftest inter-node cfg#0 | active message by ucp_am_send* with reply flag from host memory |
[1737024893.672373] [node13:2797867:0] +---------------------------+-------------------------------------------+-----------------------------------------------------+
[1737024893.672375] [node13:2797867:0] | 0..2030 | short | rc_mlx5/mlx5_0:1 |
[1737024893.672377] [node13:2797867:0] | 2031..8238 | copy-in | rc_mlx5/mlx5_0:1 |
[1737024893.672378] [node13:2797867:0] | 8239..29420 | multi-frag copy-in | rc_mlx5/mlx5_0:1 |
[1737024893.672381] [node13:2797867:0] | 29421..inf | (?) rendezvous zero-copy read from remote | 50% on rc_mlx5/mlx5_0:1 and 50% on rc_mlx5/mlx5_2:1 |
[1737024893.672384] [node13:2797867:0] +---------------------------+-------------------------------------------+-----------------------------------------------------+
[1737024893.672535] [node13:2797867:0] +---------------------------+-------------------------------------------------------------------------------------------------+
[1737024893.672538] [node13:2797867:0] | perftest inter-node cfg#0 | active message by ucp_am_send* with reply flag(fast-completion) from host memory |
[1737024893.672540] [node13:2797867:0] +---------------------------+-------------------------------------------+-----------------------------------------------------+
[1737024893.672543] [node13:2797867:0] | 0..2030 | short | rc_mlx5/mlx5_0:1 |
[1737024893.672545] [node13:2797867:0] | 2031..8238 | copy-in | rc_mlx5/mlx5_0:1 |
[1737024893.672548] [node13:2797867:0] | 8239..22493 | multi-frag copy-in | rc_mlx5/mlx5_0:1 |
[1737024893.672551] [node13:2797867:0] | 22494..262143 | multi-frag zero-copy | rc_mlx5/mlx5_0:1 |
[1737024893.672554] [node13:2797867:0] | 256K..inf | (?) rendezvous zero-copy read from remote | 50% on rc_mlx5/mlx5_0:1 and 50% on rc_mlx5/mlx5_2:1 |
[1737024893.672556] [node13:2797867:0] +---------------------------+-------------------------------------------+-----------------------------------------------------+
[1737024893.672739] [node13:2797867:0] +---------------------------+-------------------------------------------------------------------------------------------------+
[1737024893.672742] [node13:2797867:0] | perftest inter-node cfg#0 | active message by ucp_am_send* with reply flag(multi) from host memory |
[1737024893.672744] [node13:2797867:0] +---------------------------+-------------------------------------------+-----------------------------------------------------+
[1737024893.672748] [node13:2797867:0] | 0..514 | short | rc_mlx5/mlx5_0:1 |
[1737024893.672752] [node13:2797867:0] | 515..4844 | zero-copy | rc_mlx5/mlx5_0:1 |
[1737024893.672755] [node13:2797867:0] | 4845..inf | (?) rendezvous zero-copy read from remote | 50% on rc_mlx5/mlx5_0:1 and 50% on rc_mlx5/mlx5_2:1 |
[1737024893.672756] [node13:2797867:0] +---------------------------+-------------------------------------------+-----------------------------------------------------+
[1737024894.680128] [node13:2797867:0] ib_device.c:1332 UCX ERROR ibv_create_ah(dlid=49152 sl=0 port=1 src_path_bits=0 dgid=::ffff:10.16.39.12 flow_label=0xffffffff sgid_index=3 traffic_class=106) for RC DEVX QP connect on mlx5_2 failed: Connection timed out
[1737024894.680178] [node13:2797867:0] libperf.c:1069 UCX ERROR error handler called with status -80 (Endpoint timeout)
[root@node13 ucx-1.18.0]#
client:
[root@node12 ucx-1.18.0]# UCX_NET_DEVICES=mlx5_0:1,mlx5_2:1 UCX_PROTO_ENABLE=y UCX_TLS=rc UCX_PROTO_INFO=y ./install-release-mt/bin/ucx_perftest 10.16.29.13 -t ucp_am_bw -s 1048576 -n 5000000 -e [53/1874]
[1737024666.944378] [node13:2783427:0] perftest.c:800 UCX WARN CPU affinity is not set (bound to 96 cpus). Performance may be impacted.
+--------------+--------------+------------------------------+---------------------+-----------------------+
| | | overhead (usec) | bandwidth (MB/s) | message rate (msg/s) |
+--------------+--------------+----------+---------+---------+----------+----------+-----------+-----------+
| Stage | # iterations | 50.0%ile | average | overall | average | overall | average | overall |
+--------------+--------------+----------+---------+---------+----------+----------+-----------+-----------+
[1737024667.122800] [node13:2783427:0] +---------------------------+-------------------------------------------------------------------------------------------------+
[1737024667.122811] [node13:2783427:0] | perftest inter-node cfg#0 | active message by ucp_am_send* from host memory |
[1737024667.122814] [node13:2783427:0] +---------------------------+-------------------------------------------+-----------------------------------------------------+
[1737024667.122817] [node13:2783427:0] | 0..2038 | short | rc_mlx5/mlx5_0:1 |
[1737024667.122819] [node13:2783427:0] | 2039..8246 | copy-in | rc_mlx5/mlx5_0:1 |
[1737024667.122822] [node13:2783427:0] | 8247..29420 | multi-frag copy-in | rc_mlx5/mlx5_0:1 |
[1737024667.122825] [node13:2783427:0] | 29421..inf | (?) rendezvous zero-copy read from remote | 50% on rc_mlx5/mlx5_0:1 and 50% on rc_mlx5/mlx5_2:1 |
[1737024667.122827] [node13:2783427:0] +---------------------------+-------------------------------------------+-----------------------------------------------------+
[1737024667.122978] [node13:2783427:0] +---------------------------+-------------------------------------------------------------------------------------------------+
[1737024667.122982] [node13:2783427:0] | perftest inter-node cfg#0 | active message by ucp_am_send*(fast-completion) from host memory |
[1737024667.122984] [node13:2783427:0] +---------------------------+-------------------------------------------+-----------------------------------------------------+
[1737024667.122987] [node13:2783427:0] | 0..2038 | short | rc_mlx5/mlx5_0:1 |
[1737024667.122990] [node13:2783427:0] | 2039..8246 | copy-in | rc_mlx5/mlx5_0:1 |
[1737024667.122993] [node13:2783427:0] | 8247..22493 | multi-frag copy-in | rc_mlx5/mlx5_0:1 |
[1737024667.122997] [node13:2783427:0] | 22494..262143 | multi-frag zero-copy | rc_mlx5/mlx5_0:1 |
[1737024667.122999] [node13:2783427:0] | 256K..inf | (?) rendezvous zero-copy read from remote | 50% on rc_mlx5/mlx5_0:1 and 50% on rc_mlx5/mlx5_2:1 |
[1737024667.123001] [node13:2783427:0] +---------------------------+-------------------------------------------+-----------------------------------------------------+
[1737024667.123351] [node13:2783427:0] +---------------------------+-------------------------------------------------------------------------------------------------+
[1737024667.123355] [node13:2783427:0] | perftest inter-node cfg#0 | active message by ucp_am_send*(multi) from host memory |
[1737024667.123356] [node13:2783427:0] +---------------------------+-------------------------------------------+-----------------------------------------------------+
[1737024667.123360] [node13:2783427:0] | 0..514 | short | rc_mlx5/mlx5_0:1 |
[1737024667.123362] [node13:2783427:0] | 515..4844 | zero-copy | rc_mlx5/mlx5_0:1 |
[1737024667.123364] [node13:2783427:0] | 4845..inf | (?) rendezvous zero-copy read from remote | 50% on rc_mlx5/mlx5_0:1 and 50% on rc_mlx5/mlx5_2:1 |
[1737024667.123368] [node13:2783427:0] +---------------------------+-------------------------------------------+-----------------------------------------------------+
[1737024667.123534] [node13:2783427:0] +---------------------------+-------------------------------------------------------------------------------------------------+
[1737024667.123537] [node13:2783427:0] | perftest inter-node cfg#0 | active message by ucp_am_send* with reply flag from host memory |
[1737024667.123541] [node13:2783427:0] +---------------------------+-------------------------------------------+-----------------------------------------------------+
[1737024667.123543] [node13:2783427:0] | 0..2030 | short | rc_mlx5/mlx5_0:1 |
[1737024667.123545] [node13:2783427:0] | 2031..8238 | copy-in | rc_mlx5/mlx5_0:1 |
[1737024667.123546] [node13:2783427:0] | 8239..29420 | multi-frag copy-in | rc_mlx5/mlx5_0:1 |
[1737024667.123550] [node13:2783427:0] | 29421..inf | (?) rendezvous zero-copy read from remote | 50% on rc_mlx5/mlx5_0:1 and 50% on rc_mlx5/mlx5_2:1 |
[1737024667.123553] [node13:2783427:0] +---------------------------+-------------------------------------------+-----------------------------------------------------+
[1737024667.123705] [node13:2783427:0] +---------------------------+-------------------------------------------------------------------------------------------------+
[1737024667.123708] [node13:2783427:0] | perftest inter-node cfg#0 | active message by ucp_am_send* with reply flag(fast-completion) from host memory |
[1737024667.123710] [node13:2783427:0] +---------------------------+-------------------------------------------+-----------------------------------------------------+
[1737024667.123714] [node13:2783427:0] | 0..2030 | short | rc_mlx5/mlx5_0:1 |
[1737024667.123716] [node13:2783427:0] | 2031..8238 | copy-in | rc_mlx5/mlx5_0:1 |
[1737024667.123718] [node13:2783427:0] | 8239..22493 | multi-frag copy-in | rc_mlx5/mlx5_0:1 |
[1737024667.123720] [node13:2783427:0] | 22494..262143 | multi-frag zero-copy | rc_mlx5/mlx5_0:1 |
[1737024667.123723] [node13:2783427:0] | 256K..inf | (?) rendezvous zero-copy read from remote | 50% on rc_mlx5/mlx5_0:1 and 50% on rc_mlx5/mlx5_2:1 |
[1737024667.123727] [node13:2783427:0] +---------------------------+-------------------------------------------+-----------------------------------------------------+
[1737024667.123900] [node13:2783427:0] +---------------------------+-------------------------------------------------------------------------------------------------+
[1737024667.123904] [node13:2783427:0] | perftest inter-node cfg#0 | active message by ucp_am_send* with reply flag(multi) from host memory |
[1737024667.123906] [node13:2783427:0] +---------------------------+-------------------------------------------+-----------------------------------------------------+
[1737024667.123909] [node13:2783427:0] | 0..514 | short | rc_mlx5/mlx5_0:1 |
[1737024667.123912] [node13:2783427:0] | 515..4844 | zero-copy | rc_mlx5/mlx5_0:1 |
[1737024667.123914] [node13:2783427:0] | 4845..inf | (?) rendezvous zero-copy read from remote | 50% on rc_mlx5/mlx5_0:1 and 50% on rc_mlx5/mlx5_2:1 |
[1737024667.123917] [node13:2783427:0] +---------------------------+-------------------------------------------+-----------------------------------------------------+
[1737024720.960016] [node13:2783427:0] libperf.c:1069 UCX ERROR error handler called with status -80 (Endpoint timeout)
Can you try:
ping -I ens7f0np0 10.16.39.12 on node13?
Also, can you try adding UCX_IB_ROCE_LOCAL_SUBNET=y (to both client and server)?
@yosefe Sorry, it looks like a network failure. I'll look into it myself first.
Can you try:
ping -I ens7f0np0 10.16.39.12on node13?
[root@localhost network-scripts]# ping -I ens7f0np0 10.16.39.12
PING 10.16.39.12 (10.16.39.12) from 10.16.39.13 ens7f0np0: 56(84) bytes of data.
From 10.16.39.13 icmp_seq=1 Destination Host Unreachable
From 10.16.39.13 icmp_seq=2 Destination Host Unreachable
From 10.16.39.13 icmp_seq=3 Destination Host Unreachable
From 10.16.39.13 icmp_seq=4 Destination Host Unreachable
From 10.16.39.13 icmp_seq=5 Destination Host Unreachable
[root@localhost network-scripts]# ping 10.16.39.12
PING 10.16.39.12 (10.16.39.12) 56(84) bytes of data.
From 10.16.39.13 icmp_seq=1 Destination Host Unreachable
From 10.16.39.13 icmp_seq=2 Destination Host Unreachable
From 10.16.39.13 icmp_seq=3 Destination Host Unreachable
@yosefe When I configure UCX_NET_DEVICES=mlx5_0:1,mlx5_2:1 is already working. But only 2*100Gbps bandwidth.
However, in my other environment(425Gbps), without limiting UCX_NET_DEVICES=mlx5_0:1,mlx5_2:1 can also work properly, and can get 425Gbs bandwidth.
@ivanallen what is the network speed of each NIC (can be checked by ibstat or ibv_devinfo)? Does the other environment have more configured NICs?
@ivanallen what is the network speed of each NIC (can be checked by ibstat or ibv_devinfo)? Does the other environment have more configured NICs?
Hi @yosefe, Can we look at #10430 first? I suspect there is a problem with the conversion between bond and non-bond. This time let's look at the bandwidth of the bond environment first.