ucx icon indicating copy to clipboard operation
ucx copied to clipboard

mlx5 connect on mlx5_1 failed: Connection timed out

Open shinoharakazuya opened this issue 1 year ago • 4 comments

Describe the bug

I'm running NGC's hpl benchmark test from Slurm. When I ran hpl in an hpl container on two servers with 8 GPUs per node, I encountered a UCX error.

Steps to Reproduce

  • Command line: Please see log file.
  • UCX version used (from github branch XX or release YY) + UCX configure flags (can be checked by ucx_info -v): Please see log file.
  • Any UCX environment variables used

Setup and versions

  • OS version (e.g Linux distro) + CPU architecture (x86_64/aarch64/ppc64le/...): Please see log file.
    • cat /etc/issue or cat /etc/redhat-release + uname -a
    • For Nvidia Bluefield SmartNIC include cat /etc/mlnx-release (the string identifies software and firmware setup)
  • For RDMA/IB/RoCE related issues: Please see log file.
    • Driver version:
      • rpm -q rdma-core or rpm -q libibverbs
      • or: MLNX_OFED version ofed_info -s
    • HW information from ibstat or ibv_devinfo -vv command
  • For GPU related issues:
    • GPU type : H100
    • Cuda:
      • Drivers version:12.2
      • Check if peer-direct is loaded: lsmod|grep nv_peer_mem and/or gdrcopy: lsmod|grep gdrdrv : Please see log file.

Additional information (depending on the issue)

  • OpenMPI version:5.0.3
  • Output of ucx_info -d to show transports and devices recognized by UCX: Please see log file.

shinoharakazuya avatar Jun 24 '24 04:06 shinoharakazuya

@shinoharakazuya can you pls post the output of show_gids command, and check if setting UCX_IB_ROCE_LOCAL_SUBNET=y helps to resolve the issue?

yosefe avatar Jun 29 '24 11:06 yosefe

@jandres742 FYI

changchengx avatar Jun 30 '24 05:06 changchengx

NOTE: This issue happens on Nvidia internal cluster

yosefe avatar Jun 30 '24 07:06 yosefe

@yosefe I have same issue.

client:

UCX_PROTO_ENABLE=y UCX_NET_DEVICES=mlx5_0:1,mlx5_1:1 UCX_TLS=rc  UCX_PROTO_INFO=y ./install-release-mt/bin/ucx_perftest 10.16.29.13 -t ucp_am_bw -s 2097152  -n 5000000 -e

server:

UCX_PROTO_ENABLE=y UCX_NET_DEVICES=mlx5_0:1,mlx5_1:1 UCX_PROTO_INFO=y ./install-release-mt/bin/ucx_perftest -e
[1737007020.457942] [node13:967927:0]   +---------------------------+-------------------------------------------------------------------------------------------------+
[1737007020.457946] [node13:967927:0]   | perftest inter-node cfg#0 | active message by ucp_am_send* with reply flag(multi) from host memory                          |
[1737007020.457947] [node13:967927:0]   +---------------------------+-------------------------------------------+-----------------------------------------------------+
[1737007020.457950] [node13:967927:0]   |                    0..514 | short                                     | rc_mlx5/mlx5_0:1                                    |
[1737007020.457953] [node13:967927:0]   |                 515..4844 | zero-copy                                 | rc_mlx5/mlx5_0:1                                    |
[1737007020.457955] [node13:967927:0]   |                 4845..inf | (?) rendezvous zero-copy read from remote | 50% on rc_mlx5/mlx5_0:1 and 50% on rc_mlx5/mlx5_1:1 |
[1737007020.457958] [node13:967927:0]   +---------------------------+-------------------------------------------+-----------------------------------------------------+
[1737007021.503646] [node13:967927:a]       ib_device.c:1332 UCX  ERROR   ibv_create_ah(dlid=49152 sl=0 port=1 src_path_bits=0 dgid=fe80::a288:c2ff:feb4:87d7 flow_label=0xffffffff sgid_index=1 traffic_class=106) for RC DEVX QP connect on mlx5_1 failed: Connection timed out
[1737007021.503771] [node13:967927:0]         libperf.c:1069 UCX  ERROR error handler called with status -80 (Endpoint timeout)
[root@node12 ucx-1.18.0]# show_gids
DEV     PORT    INDEX   GID                                     IPv4            VER     DEV
---     ----    -----   ---                                     ------------    ---     ---
mlx5_0  1       0       fe80:0000:0000:0000:a288:c2ff:feb4:87e6                 v1      ens2f0np0
mlx5_0  1       1       fe80:0000:0000:0000:a288:c2ff:feb4:87e6                 v2      ens2f0np0
mlx5_0  1       2       0000:0000:0000:0000:0000:ffff:0a10:1d0c 10.16.29.12     v1      ens2f0np0
mlx5_0  1       3       0000:0000:0000:0000:0000:ffff:0a10:1d0c 10.16.29.12     v2      ens2f0np0
mlx5_1  1       0       fe80:0000:0000:0000:a288:c2ff:feb4:87e7                 v1      ens2f1np1
mlx5_1  1       1       fe80:0000:0000:0000:a288:c2ff:feb4:87e7                 v2      ens2f1np1
mlx5_2  1       0       fe80:0000:0000:0000:a288:c2ff:feb4:a562                 v1      ens7f0np0
mlx5_2  1       1       fe80:0000:0000:0000:a288:c2ff:feb4:a562                 v2      ens7f0np0
mlx5_2  1       2       0000:0000:0000:0000:0000:ffff:0a10:270c 10.16.39.12     v1      ens7f0np0
mlx5_2  1       3       0000:0000:0000:0000:0000:ffff:0a10:270c 10.16.39.12     v2      ens7f0np0
mlx5_3  1       0       fe80:0000:0000:0000:a288:c2ff:feb4:a563                 v1      ens7f1np1
mlx5_3  1       1       fe80:0000:0000:0000:a288:c2ff:feb4:a563                 v2      ens7f1np1
n_gids_found=12

ivanallen avatar Jan 16 '25 05:01 ivanallen

@ivanallen mlx5_1 does not have an IP address, is that expected?

yosefe avatar Jan 16 '25 09:01 yosefe

@ivanallen mlx5_1 does not have an IP address, is that expected?

Yes, that is expected. We don't configure mlx5_1 and mlx5_3.

ivanallen avatar Jan 16 '25 10:01 ivanallen

Yes, that is expected. We don't configure mlx5_1 and mlx5_3.

Seems like the test being run on mlx5_1? Per the command above:

UCX_PROTO_ENABLE=y UCX_NET_DEVICES=mlx5_0:1,mlx5_1:1 UCX_TLS=rc UCX_PROTO_INFO=y ./install-release-mt/bin/ucx_perftest 10.16.29.13 -t ucp_am_bw -s 2097152 -n 5000000 -e

yosefe avatar Jan 16 '25 10:01 yosefe

Yes, that is expected. We don't configure mlx5_1 and mlx5_3.

Seems like the test being run on mlx5_1? Per the command above:

UCX_PROTO_ENABLE=y UCX_NET_DEVICES=mlx5_0:1,mlx5_1:1 UCX_TLS=rc UCX_PROTO_INFO=y ./install-release-mt/bin/ucx_perftest 10.16.29.13 -t ucp_am_bw -s 2097152 -n 5000000 -e

@yosefe Do you mean using mlx5_2? mlx5_1 has no ip address. I have the same problem if I use UCX_NET_DEVICES=mlx5_0:1 and mlx5_1:1.

server:

[root@node13 ucx-1.18.0]# UCX_TLS=rc UCX_NET_DEVICES=mlx5_0:1,mlx5_2:1 UCX_PROTO_ENABLE=y UCX_PROTO_INFO=y ./install-release-mt/bin/ucx_perftest -e
[1737024891.886842] [node13:2797867:0]        perftest.c:800  UCX  WARN  CPU affinity is not set (bound to 96 cpus). Performance may be impacted.
Waiting for connection...
Accepted connection from 10.16.29.12:52468
+----------------------------------------------------------------------------------------------------------+
| API:          protocol layer                                                                             |
| Test:         am bandwidth / message rate                                                                |
| Data layout:  (automatic)                                                                                |
| Send memory:  host                                                                                       |
| Recv memory:  host                                                                                       |
| Message size: 1048576                                                                                    |
| Window size:  32                                                                                         |
| AM header size: 0                                                                                        |
+----------------------------------------------------------------------------------------------------------+
[1737024893.671591] [node13:2797867:0]   +---------------------------+-------------------------------------------------------------------------------------------------+
[1737024893.671602] [node13:2797867:0]   | perftest inter-node cfg#0 | active message by ucp_am_send* from host memory                                                 |
[1737024893.671606] [node13:2797867:0]   +---------------------------+-------------------------------------------+-----------------------------------------------------+
[1737024893.671609] [node13:2797867:0]   |                   0..2038 | short                                     | rc_mlx5/mlx5_0:1                                    |
[1737024893.671612] [node13:2797867:0]   |                2039..8246 | copy-in                                   | rc_mlx5/mlx5_0:1                                    |
[1737024893.671613] [node13:2797867:0]   |               8247..29420 | multi-frag copy-in                        | rc_mlx5/mlx5_0:1                                    |
[1737024893.671616] [node13:2797867:0]   |                29421..inf | (?) rendezvous zero-copy read from remote | 50% on rc_mlx5/mlx5_0:1 and 50% on rc_mlx5/mlx5_2:1 |
[1737024893.671619] [node13:2797867:0]   +---------------------------+-------------------------------------------+-----------------------------------------------------+
[1737024893.671782] [node13:2797867:0]   +---------------------------+-------------------------------------------------------------------------------------------------+
[1737024893.671786] [node13:2797867:0]   | perftest inter-node cfg#0 | active message by ucp_am_send*(fast-completion) from host memory                                |
[1737024893.671788] [node13:2797867:0]   +---------------------------+-------------------------------------------+-----------------------------------------------------+
[1737024893.671791] [node13:2797867:0]   |                   0..2038 | short                                     | rc_mlx5/mlx5_0:1                                    |
[1737024893.671794] [node13:2797867:0]   |                2039..8246 | copy-in                                   | rc_mlx5/mlx5_0:1                                    |
[1737024893.671796] [node13:2797867:0]   |               8247..22493 | multi-frag copy-in                        | rc_mlx5/mlx5_0:1                                    |
[1737024893.671798] [node13:2797867:0]   |             22494..262143 | multi-frag zero-copy                      | rc_mlx5/mlx5_0:1                                    |
[1737024893.671801] [node13:2797867:0]   |                 256K..inf | (?) rendezvous zero-copy read from remote | 50% on rc_mlx5/mlx5_0:1 and 50% on rc_mlx5/mlx5_2:1 |
[1737024893.671802] [node13:2797867:0]   +---------------------------+-------------------------------------------+-----------------------------------------------------+
[1737024893.672161] [node13:2797867:0]   +---------------------------+-------------------------------------------------------------------------------------------------+
[1737024893.672165] [node13:2797867:0]   | perftest inter-node cfg#0 | active message by ucp_am_send*(multi) from host memory                                          |
[1737024893.672166] [node13:2797867:0]   +---------------------------+-------------------------------------------+-----------------------------------------------------+
[1737024893.672171] [node13:2797867:0]   |                    0..514 | short                                     | rc_mlx5/mlx5_0:1                                    |
[1737024893.672173] [node13:2797867:0]   |                 515..4844 | zero-copy                                 | rc_mlx5/mlx5_0:1                                    |
[1737024893.672175] [node13:2797867:0]   |                 4845..inf | (?) rendezvous zero-copy read from remote | 50% on rc_mlx5/mlx5_0:1 and 50% on rc_mlx5/mlx5_2:1 |
[1737024893.672178] [node13:2797867:0]   +---------------------------+-------------------------------------------+-----------------------------------------------------+
[1737024893.672367] [node13:2797867:0]   +---------------------------+-------------------------------------------------------------------------------------------------+
[1737024893.672371] [node13:2797867:0]   | perftest inter-node cfg#0 | active message by ucp_am_send* with reply flag from host memory                                 |
[1737024893.672373] [node13:2797867:0]   +---------------------------+-------------------------------------------+-----------------------------------------------------+
[1737024893.672375] [node13:2797867:0]   |                   0..2030 | short                                     | rc_mlx5/mlx5_0:1                                    |
[1737024893.672377] [node13:2797867:0]   |                2031..8238 | copy-in                                   | rc_mlx5/mlx5_0:1                                    |
[1737024893.672378] [node13:2797867:0]   |               8239..29420 | multi-frag copy-in                        | rc_mlx5/mlx5_0:1                                    |
[1737024893.672381] [node13:2797867:0]   |                29421..inf | (?) rendezvous zero-copy read from remote | 50% on rc_mlx5/mlx5_0:1 and 50% on rc_mlx5/mlx5_2:1 |
[1737024893.672384] [node13:2797867:0]   +---------------------------+-------------------------------------------+-----------------------------------------------------+
[1737024893.672535] [node13:2797867:0]   +---------------------------+-------------------------------------------------------------------------------------------------+
[1737024893.672538] [node13:2797867:0]   | perftest inter-node cfg#0 | active message by ucp_am_send* with reply flag(fast-completion) from host memory                |
[1737024893.672540] [node13:2797867:0]   +---------------------------+-------------------------------------------+-----------------------------------------------------+
[1737024893.672543] [node13:2797867:0]   |                   0..2030 | short                                     | rc_mlx5/mlx5_0:1                                    |
[1737024893.672545] [node13:2797867:0]   |                2031..8238 | copy-in                                   | rc_mlx5/mlx5_0:1                                    |
[1737024893.672548] [node13:2797867:0]   |               8239..22493 | multi-frag copy-in                        | rc_mlx5/mlx5_0:1                                    |
[1737024893.672551] [node13:2797867:0]   |             22494..262143 | multi-frag zero-copy                      | rc_mlx5/mlx5_0:1                                    |
[1737024893.672554] [node13:2797867:0]   |                 256K..inf | (?) rendezvous zero-copy read from remote | 50% on rc_mlx5/mlx5_0:1 and 50% on rc_mlx5/mlx5_2:1 |
[1737024893.672556] [node13:2797867:0]   +---------------------------+-------------------------------------------+-----------------------------------------------------+
[1737024893.672739] [node13:2797867:0]   +---------------------------+-------------------------------------------------------------------------------------------------+
[1737024893.672742] [node13:2797867:0]   | perftest inter-node cfg#0 | active message by ucp_am_send* with reply flag(multi) from host memory                          |
[1737024893.672744] [node13:2797867:0]   +---------------------------+-------------------------------------------+-----------------------------------------------------+
[1737024893.672748] [node13:2797867:0]   |                    0..514 | short                                     | rc_mlx5/mlx5_0:1                                    |
[1737024893.672752] [node13:2797867:0]   |                 515..4844 | zero-copy                                 | rc_mlx5/mlx5_0:1                                    |
[1737024893.672755] [node13:2797867:0]   |                 4845..inf | (?) rendezvous zero-copy read from remote | 50% on rc_mlx5/mlx5_0:1 and 50% on rc_mlx5/mlx5_2:1 |
[1737024893.672756] [node13:2797867:0]   +---------------------------+-------------------------------------------+-----------------------------------------------------+
[1737024894.680128] [node13:2797867:0]       ib_device.c:1332 UCX  ERROR   ibv_create_ah(dlid=49152 sl=0 port=1 src_path_bits=0 dgid=::ffff:10.16.39.12 flow_label=0xffffffff sgid_index=3 traffic_class=106) for RC DEVX QP connect on mlx5_2 failed: Connection timed out
[1737024894.680178] [node13:2797867:0]         libperf.c:1069 UCX  ERROR error handler called with status -80 (Endpoint timeout)
[root@node13 ucx-1.18.0]#

client:

[root@node12 ucx-1.18.0]# UCX_NET_DEVICES=mlx5_0:1,mlx5_2:1 UCX_PROTO_ENABLE=y UCX_TLS=rc  UCX_PROTO_INFO=y ./install-release-mt/bin/ucx_perftest 10.16.29.13 -t ucp_am_bw -s 1048576  -n 5000000 -e                                 [53/1874]
[1737024666.944378] [node13:2783427:0]        perftest.c:800  UCX  WARN  CPU affinity is not set (bound to 96 cpus). Performance may be impacted.
+--------------+--------------+------------------------------+---------------------+-----------------------+
|              |              |       overhead (usec)        |   bandwidth (MB/s)  |  message rate (msg/s) |
+--------------+--------------+----------+---------+---------+----------+----------+-----------+-----------+
|    Stage     | # iterations | 50.0%ile | average | overall |  average |  overall |  average  |  overall  |
+--------------+--------------+----------+---------+---------+----------+----------+-----------+-----------+
[1737024667.122800] [node13:2783427:0]   +---------------------------+-------------------------------------------------------------------------------------------------+
[1737024667.122811] [node13:2783427:0]   | perftest inter-node cfg#0 | active message by ucp_am_send* from host memory                                                 |
[1737024667.122814] [node13:2783427:0]   +---------------------------+-------------------------------------------+-----------------------------------------------------+
[1737024667.122817] [node13:2783427:0]   |                   0..2038 | short                                     | rc_mlx5/mlx5_0:1                                    |
[1737024667.122819] [node13:2783427:0]   |                2039..8246 | copy-in                                   | rc_mlx5/mlx5_0:1                                    |
[1737024667.122822] [node13:2783427:0]   |               8247..29420 | multi-frag copy-in                        | rc_mlx5/mlx5_0:1                                    |
[1737024667.122825] [node13:2783427:0]   |                29421..inf | (?) rendezvous zero-copy read from remote | 50% on rc_mlx5/mlx5_0:1 and 50% on rc_mlx5/mlx5_2:1 |
[1737024667.122827] [node13:2783427:0]   +---------------------------+-------------------------------------------+-----------------------------------------------------+
[1737024667.122978] [node13:2783427:0]   +---------------------------+-------------------------------------------------------------------------------------------------+
[1737024667.122982] [node13:2783427:0]   | perftest inter-node cfg#0 | active message by ucp_am_send*(fast-completion) from host memory                                |
[1737024667.122984] [node13:2783427:0]   +---------------------------+-------------------------------------------+-----------------------------------------------------+
[1737024667.122987] [node13:2783427:0]   |                   0..2038 | short                                     | rc_mlx5/mlx5_0:1                                    |
[1737024667.122990] [node13:2783427:0]   |                2039..8246 | copy-in                                   | rc_mlx5/mlx5_0:1                                    |
[1737024667.122993] [node13:2783427:0]   |               8247..22493 | multi-frag copy-in                        | rc_mlx5/mlx5_0:1                                    |
[1737024667.122997] [node13:2783427:0]   |             22494..262143 | multi-frag zero-copy                      | rc_mlx5/mlx5_0:1                                    |
[1737024667.122999] [node13:2783427:0]   |                 256K..inf | (?) rendezvous zero-copy read from remote | 50% on rc_mlx5/mlx5_0:1 and 50% on rc_mlx5/mlx5_2:1 |
[1737024667.123001] [node13:2783427:0]   +---------------------------+-------------------------------------------+-----------------------------------------------------+
[1737024667.123351] [node13:2783427:0]   +---------------------------+-------------------------------------------------------------------------------------------------+
[1737024667.123355] [node13:2783427:0]   | perftest inter-node cfg#0 | active message by ucp_am_send*(multi) from host memory                                          |
[1737024667.123356] [node13:2783427:0]   +---------------------------+-------------------------------------------+-----------------------------------------------------+
[1737024667.123360] [node13:2783427:0]   |                    0..514 | short                                     | rc_mlx5/mlx5_0:1                                    |
[1737024667.123362] [node13:2783427:0]   |                 515..4844 | zero-copy                                 | rc_mlx5/mlx5_0:1                                    |
[1737024667.123364] [node13:2783427:0]   |                 4845..inf | (?) rendezvous zero-copy read from remote | 50% on rc_mlx5/mlx5_0:1 and 50% on rc_mlx5/mlx5_2:1 |
[1737024667.123368] [node13:2783427:0]   +---------------------------+-------------------------------------------+-----------------------------------------------------+
[1737024667.123534] [node13:2783427:0]   +---------------------------+-------------------------------------------------------------------------------------------------+
[1737024667.123537] [node13:2783427:0]   | perftest inter-node cfg#0 | active message by ucp_am_send* with reply flag from host memory                                 |
[1737024667.123541] [node13:2783427:0]   +---------------------------+-------------------------------------------+-----------------------------------------------------+
[1737024667.123543] [node13:2783427:0]   |                   0..2030 | short                                     | rc_mlx5/mlx5_0:1                                    |
[1737024667.123545] [node13:2783427:0]   |                2031..8238 | copy-in                                   | rc_mlx5/mlx5_0:1                                    |
[1737024667.123546] [node13:2783427:0]   |               8239..29420 | multi-frag copy-in                        | rc_mlx5/mlx5_0:1                                    |
[1737024667.123550] [node13:2783427:0]   |                29421..inf | (?) rendezvous zero-copy read from remote | 50% on rc_mlx5/mlx5_0:1 and 50% on rc_mlx5/mlx5_2:1 |
[1737024667.123553] [node13:2783427:0]   +---------------------------+-------------------------------------------+-----------------------------------------------------+
[1737024667.123705] [node13:2783427:0]   +---------------------------+-------------------------------------------------------------------------------------------------+
[1737024667.123708] [node13:2783427:0]   | perftest inter-node cfg#0 | active message by ucp_am_send* with reply flag(fast-completion) from host memory                |
[1737024667.123710] [node13:2783427:0]   +---------------------------+-------------------------------------------+-----------------------------------------------------+
[1737024667.123714] [node13:2783427:0]   |                   0..2030 | short                                     | rc_mlx5/mlx5_0:1                                    |
[1737024667.123716] [node13:2783427:0]   |                2031..8238 | copy-in                                   | rc_mlx5/mlx5_0:1                                    |
[1737024667.123718] [node13:2783427:0]   |               8239..22493 | multi-frag copy-in                        | rc_mlx5/mlx5_0:1                                    |
[1737024667.123720] [node13:2783427:0]   |             22494..262143 | multi-frag zero-copy                      | rc_mlx5/mlx5_0:1                                    |
[1737024667.123723] [node13:2783427:0]   |                 256K..inf | (?) rendezvous zero-copy read from remote | 50% on rc_mlx5/mlx5_0:1 and 50% on rc_mlx5/mlx5_2:1 |
[1737024667.123727] [node13:2783427:0]   +---------------------------+-------------------------------------------+-----------------------------------------------------+
[1737024667.123900] [node13:2783427:0]   +---------------------------+-------------------------------------------------------------------------------------------------+
[1737024667.123904] [node13:2783427:0]   | perftest inter-node cfg#0 | active message by ucp_am_send* with reply flag(multi) from host memory                          |
[1737024667.123906] [node13:2783427:0]   +---------------------------+-------------------------------------------+-----------------------------------------------------+
[1737024667.123909] [node13:2783427:0]   |                    0..514 | short                                     | rc_mlx5/mlx5_0:1                                    |
[1737024667.123912] [node13:2783427:0]   |                 515..4844 | zero-copy                                 | rc_mlx5/mlx5_0:1                                    |
[1737024667.123914] [node13:2783427:0]   |                 4845..inf | (?) rendezvous zero-copy read from remote | 50% on rc_mlx5/mlx5_0:1 and 50% on rc_mlx5/mlx5_2:1 |
[1737024667.123917] [node13:2783427:0]   +---------------------------+-------------------------------------------+-----------------------------------------------------+
[1737024720.960016] [node13:2783427:0]         libperf.c:1069 UCX  ERROR error handler called with status -80 (Endpoint timeout)

ivanallen avatar Jan 16 '25 10:01 ivanallen

Can you try: ping -I ens7f0np0 10.16.39.12 on node13?

yosefe avatar Jan 16 '25 13:01 yosefe

Also, can you try adding UCX_IB_ROCE_LOCAL_SUBNET=y (to both client and server)?

yosefe avatar Jan 16 '25 13:01 yosefe

@yosefe Sorry, it looks like a network failure. I'll look into it myself first.

Can you try: ping -I ens7f0np0 10.16.39.12 on node13?

[root@localhost network-scripts]# ping -I ens7f0np0 10.16.39.12
PING 10.16.39.12 (10.16.39.12) from 10.16.39.13 ens7f0np0: 56(84) bytes of data.
From 10.16.39.13 icmp_seq=1 Destination Host Unreachable
From 10.16.39.13 icmp_seq=2 Destination Host Unreachable
From 10.16.39.13 icmp_seq=3 Destination Host Unreachable
From 10.16.39.13 icmp_seq=4 Destination Host Unreachable
From 10.16.39.13 icmp_seq=5 Destination Host Unreachable
[root@localhost network-scripts]# ping 10.16.39.12
PING 10.16.39.12 (10.16.39.12) 56(84) bytes of data.
From 10.16.39.13 icmp_seq=1 Destination Host Unreachable
From 10.16.39.13 icmp_seq=2 Destination Host Unreachable
From 10.16.39.13 icmp_seq=3 Destination Host Unreachable

ivanallen avatar Jan 17 '25 05:01 ivanallen

@yosefe When I configure UCX_NET_DEVICES=mlx5_0:1,mlx5_2:1 is already working. But only 2*100Gbps bandwidth.

However, in my other environment(425Gbps), without limiting UCX_NET_DEVICES=mlx5_0:1,mlx5_2:1 can also work properly, and can get 425Gbs bandwidth.

ivanallen avatar Jan 17 '25 07:01 ivanallen

@ivanallen what is the network speed of each NIC (can be checked by ibstat or ibv_devinfo)? Does the other environment have more configured NICs?

yosefe avatar Jan 19 '25 08:01 yosefe

@ivanallen what is the network speed of each NIC (can be checked by ibstat or ibv_devinfo)? Does the other environment have more configured NICs?

Hi @yosefe, Can we look at #10430 first? I suspect there is a problem with the conversion between bond and non-bond. This time let's look at the bandwidth of the bond environment first.

ivanallen avatar Jan 20 '25 04:01 ivanallen