v4.1.5 UCX_NET_DEVICES not selecting TCP devices correctly
Details of the problem
- OS version (e.g Linux distro)
- Rocky Linux release 9.4 (Blue Onyx)
- Driver version:
- rdma-core-2404mlnx51-1.2404066.x86_64
- MLNX_OFED_LINUX-24.04-0.6.6.0
Setting UCX_NET_DEVICES to target only TCP devices when RoCE is available seems to be ignored in favour of some fallback.
I'm running a 2 node IMB_MPI PingPong to benchmark RoCE against regular TCP ethernet.
Setting UCX_NET_DEVICES=all or mlx5_0:1 gives the optimal performance and uses RDMA as expected.
Setting UCX_NET_DEVICES=eth0, eth1, or anything else still appears to use RoCE at only a slightly longer latency
HW information from ibstat or ibv_devinfo -vv command :
hca_id: mlx5_0
transport: InfiniBand (0)
fw_ver: 20.36.1010
node_guid: fa16:3eff:fe4f:f5e9
sys_image_guid: 0c42:a103:0003:5d82
vendor_id: 0x02c9
vendor_part_id: 4124
hw_ver: 0x0
board_id: MT_0000000224
phys_port_cnt: 1
port: 1
state: PORT_ACTIVE (4)
max_mtu: 4096 (5)
active_mtu: 1024 (3)
sm_lid: 0
port_lid: 0
port_lmc: 0x00
link_layer: Ethernet
How ompi is configured from ompi_info | grep Configure :
Configured architecture: x86_64-pc-linux-gnu
Configured by: abuild
Configured on: Thu Aug 3 14:25:15 UTC 2023
Configure command line: '--prefix=/opt/ohpc/pub/mpi/openmpi4-gnu12/4.1.5'
'--disable-static' '--enable-builtin-atomics'
'--with-sge' '--enable-mpi-cxx'
'--with-hwloc=/opt/ohpc/pub/libs/hwloc'
'--with-libfabric=/opt/ohpc/pub/mpi/libfabric/1.18.0'
'--with-ucx=/opt/ohpc/pub/mpi/ucx-ohpc/1.14.0'
'--without-verbs' '--with-tm=/opt/pbs/'
Following the advice from Here, it is apparently due to a higher priority of OpenMPI's btl/openib component but I don't think it can be if --without-verbs and openib is not available when searching ompi_info | grep btl.
As suggested in the UCX issue, adding -mca pml_ucx_tls any -mca pml_ucx_devices any to my mpirun has fixed this problem, but I was wondering what in the MCA precisely causes this behaviour.
Here's my batch script:
#!/usr/bin/env bash
#SBATCH --ntasks=2
#SBATCH --ntasks-per-node=1
#SBATCH --output=%x.%j.out
#SBATCH --error=%x.%j.out
#SBATCH --exclusive
#SBATCH --partition=standard
module load gnu12 openmpi4 imb
export UCX_NET_DEVICES=mlx5_0:1
echo SLURM_JOB_NODELIST: $SLURM_JOB_NODELIST
echo SLURM_JOB_ID: $SLURM_JOB_ID
echo UCX_NET_DEVICES: $UCX_NET_DEVICES
export UCX_LOG_LEVEL=data
mpirun -mca pml_ucx_tls any -mca pml_ucx_devices any IMB-MPI1 pingpong -iter_policy off
@bertiethorpe I can't reproduce described behavior with ompi and ucx bult from sources (see below), what I'm missing?
- I removed libfabric and pbs
- used osu instead of IMB but it should not make a difference:
$ <path>/ompi_install/bin/ompi_info | grep Configure
Configured architecture: x86_64-pc-linux-gnu
Configured by: evgenylek
Configured on: Tue Oct 1 17:07:14 UTC 2024
Configure command line: '--prefix=<path>/ompi_install' '--disable-static' '--enable-builtin-atomics' '--with-sge' '--enable-mpi-cxx' '--without-verbs'
$ ibdev2netdev | grep Up
mlx5_0 port 1 ==> ib0 (Up)
mlx5_2 port 1 ==> ib2 (Up)
mlx5_3 port 1 ==> enp129s0f1np1 (Up)
mlx5_4 port 1 ==> ib3 (Up)
$ mpirun -H host1,host2 -n 2 /osu-micro-benchmarks-5.8/mpi/pt2pt/osu_latency -m 0:128
# OSU MPI Latency Test v5.8
# Size Latency (us)
0 0.89
1 0.89
2 0.89
4 0.89
8 0.88
16 0.89
32 0.91
64 1.03
128 1.07
$ mpirun -x UCX_NET_DEVICES=mlx5_0:1 -H host1,host2 -n 2 /osu-micro-benchmarks-5.8/mpi/pt2pt/osu_latency -m 0:128
# OSU MPI Latency Test v5.8
# Size Latency (us)
0 0.89
1 0.89
2 0.88
4 0.88
8 0.88
16 0.89
32 0.91
64 1.02
128 1.07
$ mpirun -x UCX_NET_DEVICES=mlx5_3:1 -H host1,host2 -n 2 /osu-micro-benchmarks-5.8/mpi/pt2pt/osu_latency -m 0:128
# OSU MPI Latency Test v5.8
# Size Latency (us)
0 1.33
1 1.34
2 1.34
4 1.34
8 1.34
16 1.34
32 1.38
64 1.60
128 1.67
$ mpirun -x UCX_NET_DEVICES=enp129s0f1np1 -H host1,host2 -n 2 /osu-micro-benchmarks-5.8/mpi/pt2pt/osu_latency -m 0:128
# OSU MPI Latency Test v5.8
# Size Latency (us)
0 55.89
1 56.11
2 56.15
4 56.29
8 56.09
16 56.12
32 56.14
64 56.62
128 56.86
$ mpirun -x UCX_NET_DEVICES=eno1 -H host1,host2 -n 2 /osu-micro-benchmarks-5.8/mpi/pt2pt/osu_latency -m 0:128
# OSU MPI Latency Test v5.8
# Size Latency (us)
0 60.95
1 61.04
2 61.11
4 61.12
8 61.05
16 61.10
32 61.16
64 61.43
128 61.69
@bertiethorpe can you pls increase the verbosity of OpenMPI, by adding -mca pml_ucx_verbose 99 after mpirun (along with -x UCX_NET_DEVICES=eth0), and post the resulting output?
Thanks!
@yosefe, @evgeny-leksikov, @jsquyres and @janjust: I am not sure if the following helps, but here it is.
I built the latest version of OpenMPI (OMPI) and an older version (5.0.2 and 5.0.5) to reproduce a nonworking and a working version. It appears that something is not working as expected with the newer versions of OMPI. However, as presented below, i have a built that works for 5.0.2. I did not proceed to test more versions.
ompi_info:
Open MPI: 5.0.2
Open MPI repo revision: v5.0.2
Open MPI release date: Unreleased developer copy
MPI API: 3.1.0
Ident string: 5.0.2
Prefix: /usr/local
Configured architecture: x86_64-pc-linux-gnu
Configured by: root
Configured on: Sun Jan 12 04:07:13 UTC 2025
Configure host: 744947182c1f
Configure command line: '--prefix=/usr/local' '--with-ucx=/usr/local'
'--enable-orterun-prefix-by-default'
'--enable-mca-no-build=btl-uct'
Built by:
Built on: Sun Jan 12 04:12:58 UTC 2025
Built host: 744947182c1f
C bindings: yes
Fort mpif.h: yes (all)
Fort use mpi: yes (full: ignore TKR)
Fort use mpi size: deprecated-ompi-info-value
Fort use mpi_f08: yes
Fort mpi_f08 compliance: The mpi_f08 module is available, but due to
limitations in the gfortran compiler and/or Open
MPI, does not support the following: array
subsections, direct passthru (where possible) to
underlying Open MPI's C functionality
Fort mpi_f08 subarrays: no
Java bindings: no
Wrapper compiler rpath: runpath
C compiler: gcc
C compiler absolute: /bin/gcc
C compiler family name: GNU
C compiler version: 8.5.0
C++ compiler: g++
C++ compiler absolute: /bin/g++
Fort compiler: gfortran
Fort compiler abs: /bin/gfortran
Fort ignore TKR: yes (!GCC$ ATTRIBUTES NO_ARG_CHECK ::)
Fort 08 assumed shape: yes
Fort optional args: yes
Fort INTERFACE: yes
Fort ISO_FORTRAN_ENV: yes
Fort STORAGE_SIZE: yes
Fort BIND(C) (all): yes
Fort ISO_C_BINDING: yes
Fort SUBROUTINE BIND(C): yes
Fort TYPE,BIND(C): yes
Fort T,BIND(C,name="a"): yes
Fort PRIVATE: yes
Fort ABSTRACT: yes
Fort ASYNCHRONOUS: yes
Fort PROCEDURE: yes
Fort USE...ONLY: yes
Fort C_FUNLOC: yes
Fort f08 using wrappers: yes
Fort MPI_SIZEOF: yes
C profiling: yes
Fort mpif.h profiling: yes
Fort use mpi profiling: yes
Fort use mpi_f08 prof: yes
Thread support: posix (MPI_THREAD_MULTIPLE: yes, OPAL support: yes,
OMPI progress: no, Event lib: yes)
Sparse Groups: no
Internal debug support: no
MPI interface warnings: yes
MPI parameter check: runtime
Memory profiling support: no
Memory debugging support: no
dl support: yes
Heterogeneous support: no
MPI_WTIME support: native
Symbol vis. support: yes
Host topology support: yes
IPv6 support: no
MPI extensions: affinity, cuda, ftmpi, rocm
Fault Tolerance support: yes
FT MPI support: yes
MPI_MAX_PROCESSOR_NAME: 256
MPI_MAX_ERROR_STRING: 256
MPI_MAX_OBJECT_NAME: 64
MPI_MAX_INFO_KEY: 36
MPI_MAX_INFO_VAL: 256
MPI_MAX_PORT_NAME: 1024
MPI_MAX_DATAREP_STRING: 128
MCA accelerator: null (MCA v2.1.0, API v1.0.0, Component v5.0.2)
MCA allocator: basic (MCA v2.1.0, API v2.0.0, Component v5.0.2)
MCA allocator: bucket (MCA v2.1.0, API v2.0.0, Component v5.0.2)
MCA backtrace: execinfo (MCA v2.1.0, API v2.0.0, Component v5.0.2)
MCA btl: self (MCA v2.1.0, API v3.3.0, Component v5.0.2)
MCA btl: sm (MCA v2.1.0, API v3.3.0, Component v5.0.2)
MCA btl: tcp (MCA v2.1.0, API v3.3.0, Component v5.0.2)
MCA dl: dlopen (MCA v2.1.0, API v1.0.0, Component v5.0.2)
MCA if: linux_ipv6 (MCA v2.1.0, API v2.0.0, Component
v5.0.2)
MCA if: posix_ipv4 (MCA v2.1.0, API v2.0.0, Component
v5.0.2)
MCA installdirs: env (MCA v2.1.0, API v2.0.0, Component v5.0.2)
MCA installdirs: config (MCA v2.1.0, API v2.0.0, Component v5.0.2)
MCA memory: patcher (MCA v2.1.0, API v2.0.0, Component v5.0.2)
MCA mpool: hugepage (MCA v2.1.0, API v3.1.0, Component v5.0.2)
MCA patcher: overwrite (MCA v2.1.0, API v1.0.0, Component
v5.0.2)
MCA rcache: grdma (MCA v2.1.0, API v3.3.0, Component v5.0.2)
MCA reachable: weighted (MCA v2.1.0, API v2.0.0, Component v5.0.2)
MCA shmem: mmap (MCA v2.1.0, API v2.0.0, Component v5.0.2)
MCA shmem: posix (MCA v2.1.0, API v2.0.0, Component v5.0.2)
MCA shmem: sysv (MCA v2.1.0, API v2.0.0, Component v5.0.2)
MCA smsc: cma (MCA v2.1.0, API v1.0.0, Component v5.0.2)
MCA threads: pthreads (MCA v2.1.0, API v1.0.0, Component v5.0.2)
MCA timer: linux (MCA v2.1.0, API v2.0.0, Component v5.0.2)
MCA bml: r2 (MCA v2.1.0, API v2.1.0, Component v5.0.2)
MCA coll: adapt (MCA v2.1.0, API v2.4.0, Component v5.0.2)
MCA coll: basic (MCA v2.1.0, API v2.4.0, Component v5.0.2)
MCA coll: han (MCA v2.1.0, API v2.4.0, Component v5.0.2)
MCA coll: inter (MCA v2.1.0, API v2.4.0, Component v5.0.2)
MCA coll: libnbc (MCA v2.1.0, API v2.4.0, Component v5.0.2)
MCA coll: self (MCA v2.1.0, API v2.4.0, Component v5.0.2)
MCA coll: sync (MCA v2.1.0, API v2.4.0, Component v5.0.2)
MCA coll: tuned (MCA v2.1.0, API v2.4.0, Component v5.0.2)
MCA coll: ftagree (MCA v2.1.0, API v2.4.0, Component v5.0.2)
MCA coll: hcoll (MCA v2.1.0, API v2.4.0, Component v5.0.2)
MCA coll: monitoring (MCA v2.1.0, API v2.4.0, Component
v5.0.2)
MCA coll: sm (MCA v2.1.0, API v2.4.0, Component v5.0.2)
MCA fbtl: posix (MCA v2.1.0, API v2.0.0, Component v5.0.2)
MCA fcoll: dynamic (MCA v2.1.0, API v2.0.0, Component v5.0.2)
MCA fcoll: dynamic_gen2 (MCA v2.1.0, API v2.0.0, Component
v5.0.2)
MCA fcoll: individual (MCA v2.1.0, API v2.0.0, Component
v5.0.2)
MCA fcoll: vulcan (MCA v2.1.0, API v2.0.0, Component v5.0.2)
MCA fs: ufs (MCA v2.1.0, API v2.0.0, Component v5.0.2)
MCA hook: comm_method (MCA v2.1.0, API v1.0.0, Component
v5.0.2)
MCA io: ompio (MCA v2.1.0, API v2.0.0, Component v5.0.2)
MCA io: romio341 (MCA v2.1.0, API v2.0.0, Component v5.0.2)
MCA op: avx (MCA v2.1.0, API v1.0.0, Component v5.0.2)
MCA osc: sm (MCA v2.1.0, API v3.0.0, Component v5.0.2)
MCA osc: monitoring (MCA v2.1.0, API v3.0.0, Component
v5.0.2)
MCA osc: rdma (MCA v2.1.0, API v3.0.0, Component v5.0.2)
MCA osc: ucx (MCA v2.1.0, API v3.0.0, Component v5.0.2)
MCA part: persist (MCA v2.1.0, API v4.0.0, Component v5.0.2)
MCA pml: cm (MCA v2.1.0, API v2.1.0, Component v5.0.2)
MCA pml: monitoring (MCA v2.1.0, API v2.1.0, Component
v5.0.2)
MCA pml: ob1 (MCA v2.1.0, API v2.1.0, Component v5.0.2)
MCA pml: ucx (MCA v2.1.0, API v2.1.0, Component v5.0.2)
MCA pml: v (MCA v2.1.0, API v2.1.0, Component v5.0.2)
MCA sharedfp: individual (MCA v2.1.0, API v2.0.0, Component
v5.0.2)
MCA sharedfp: lockedfile (MCA v2.1.0, API v2.0.0, Component
v5.0.2)
MCA sharedfp: sm (MCA v2.1.0, API v2.0.0, Component v5.0.2)
MCA topo: basic (MCA v2.1.0, API v2.2.0, Component v5.0.2)
MCA topo: treematch (MCA v2.1.0, API v2.2.0, Component
v5.0.2)
MCA vprotocol: pessimist (MCA v2.1.0, API v2.0.0, Component
v5.0.2)
- Runs on OMPI 5.0.2:
-mca pml ucx --mca btl ^vader,tcp,openib,uct -x UCX_NET_DEVICES=mlx5_0:1
# OSU MPI All-to-All Personalized Exchange Latency Test v7.1
# Datatype: MPI_CHAR.
# Size Avg Latency(us)
1 4.20
2 5.09
4 3.68
8 4.03
16 4.24
32 5.48
64 7.37
128 7.10
256 7.54
512 10.86
1024 13.46
2048 16.65
4096 26.47
8192 46.12
16384 80.96
32768 152.56
65536 310.15
131072 636.13
262144 1312.73
524288 2727.59
1048576 5604.98
-x UCX_NET_DEVICES=lo --mca coll ^hcoll --mca btl ^vader,self,tcp,openib,uct
# OSU MPI All-to-All Personalized Exchange Latency Test v7.1
# Datatype: MPI_CHAR.
# Size Avg Latency(us)
1 92.76
2 94.03
4 71.23
8 71.11
16 75.24
32 71.16
64 80.31
128 89.17
256 140.97
512 109.10
1024 125.49
2048 176.34
4096 256.48
8192 393.21
16384 777.16
32768 1532.98
65536 3991.32
131072 7831.29
262144 15324.53
524288 30227.76
1048576 60535.43
--mca coll ^hcoll
# OSU MPI All-to-All Personalized Exchange Latency Test v7.1
# Datatype: MPI_CHAR.
# Size Avg Latency(us)
1 4.51
2 4.42
4 9.16
8 9.14
16 9.48
32 10.74
64 10.11
128 10.99
256 12.66
512 8.53
1024 9.82
2048 17.10
4096 25.33
8192 43.60
16384 77.65
32768 146.68
65536 290.46
131072 620.92
262144 1299.81
524288 2719.89
1048576 5549.75
-x UCX_NET_DEVICES=mlx5_0:1 --mca routed direct --mca coll ^hcoll
# OSU MPI All-to-All Personalized Exchange Latency Test v7.1
# Datatype: MPI_CHAR.
# Size Avg Latency(us)
1 4.54
2 4.48
4 9.20
8 9.11
16 9.43
32 9.92
64 10.30
128 11.05
256 12.90
512 8.68
1024 9.80
2048 14.37
4096 25.72
8192 44.33
16384 78.81
32768 148.41
65536 293.32
131072 621.30
262144 1301.58
524288 2713.73
1048576 5542.62
-x UCX_NET_DEVICES=all --mca routed direct --mca coll ^hcoll
# OSU MPI All-to-All Personalized Exchange Latency Test v7.1
# Datatype: MPI_CHAR.
# Size Avg Latency(us)
1 4.48
2 4.49
4 9.23
8 9.31
16 9.51
32 9.83
64 10.35
128 11.10
256 12.70
512 8.65
1024 9.78
2048 14.21
4096 25.27
8192 45.32
16384 77.48
32768 146.71
65536 292.03
131072 619.11
262144 1297.05
524288 2728.60
1048576 5541.79
-x UCX_NET_DEVICES=eth0 --mca routed direct --mca coll ^hcoll
# OSU MPI All-to-All Personalized Exchange Latency Test v7.1
# Datatype: MPI_CHAR.
# Size Avg Latency(us)
1 88.47
2 86.85
4 73.91
8 70.82
16 76.02
32 75.62
64 91.96
128 103.55
256 139.06
512 108.88
1024 125.95
2048 176.67
4096 256.12
8192 392.61
16384 776.01
32768 1523.09
65536 3982.52
131072 7862.55
262144 15454.59
524288 30260.65
1048576 60475.30
-x UCX_NET_DEVICES=lo --mca routed direct --mca coll ^hcoll
# OSU MPI All-to-All Personalized Exchange Latency Test v7.1
# Datatype: MPI_CHAR.
# Size Avg Latency(us)
1 92.04
2 91.74
4 71.86
8 71.72
16 76.30
32 78.89
64 92.81
128 107.37
256 141.70
512 109.66
1024 126.11
2048 177.17
4096 255.36
8192 395.78
16384 785.76
32768 1557.99
65536 4035.93
131072 7849.01
262144 15691.52
524288 32492.75
1048576 60601.98
--mca btl_tcp_if_include eth0,lo --mca routed direct --mca coll ^hcoll
# OSU MPI All-to-All Personalized Exchange Latency Test v7.1
# Datatype: MPI_CHAR.
# Size Avg Latency(us)
1 93.87
2 92.88
4 71.79
8 73.78
16 76.53
32 79.56
64 92.14
128 105.27
256 143.71
512 107.98
1024 125.89
2048 177.95
4096 258.96
8192 398.15
16384 804.27
32768 1524.64
65536 3975.23
131072 7806.18
262144 15415.16
524288 30361.39
1048576 64038.22
--mca btl tcp,self,vader --mca pml ^ucx --mca coll ^hcoll
# OSU MPI All-to-All Personalized Exchange Latency Test v7.1
# Datatype: MPI_CHAR.
# Size Avg Latency(us)
1 99.36
2 94.99
4 72.44
8 74.38
16 79.04
32 80.30
64 92.98
128 105.93
256 143.32
512 111.50
1024 128.55
2048 178.13
4096 259.11
8192 394.63
16384 789.31
32768 1541.04
65536 4032.19
131072 7845.18
262144 15384.49
524288 30367.15
1048576 60630.55
All this looks good to me. Allowing UCX to pick the communication device gives you IB (and a latency of 4us for the all-to-all) while enforcing a specific device (mostly TCP in these examples) works but gives a much higher latency. What exactly is the question we are trying to answer here ?
All this looks good to me. Allowing UCX to pick the communication device gives you IB (and a latency of 4us for the all-to-all) while enforcing a specific device (mostly TCP in these examples) works but gives a much higher latency. What exactly is the question we are trying to answer here ?
Thank you for your feedback. The main question we were trying to answer was why the UCX_NET_DEVICES flag was not selecting the correct TCP interface. I found that upgrading the OMPI version and installing UCX from source resolved the issue.
Perhaps, the original authors of the question could provide some more context.
@bertiethorpe , it seems like the issue is fixed in new OMPI versions, so can we close it?
It looks like this issue is expecting a response, but hasn't gotten one yet. If there are no responses in the next 2 weeks, we'll assume that the issue has been abandoned and will close it.
Per the above comment, it has been a month with no reply on this issue. It looks like this issue has been abandoned.
I'm going to close this issue. If I'm wrong and this issue is not abandoned, please feel free to re-open it. Thank you!