ucx
ucx copied to clipboard
Attempts communication over unroutable IP interface
Describe the bug
We are trying to use openmpi+ucx to run a simple benchmark from osu-micro-benchmarks (osu_bcast). When two machines have a running docker container, the interface docker0 is up on both machines and ucx attempts to use this interface.
Steps to Reproduce
- UCX version: 1.14.0
$ srun -n2 -p histamine osu_bcast
# OSU MPI Broadcast Latency Test v7.2
# Datatype: MPI_CHAR.
# Size Avg Latency(us)
[1695332443.361791] [histamine0:1224050:0] sock.c:323 UCX ERROR connect(fd=42, dest_addr=172.17.0.1:54351) failed: Connection refused
[histamine0:1224050] pml_ucx.c:424 Error: ucp_ep_create(proc=1) failed: Destination is unreachable
[histamine0:1224050] pml_ucx.c:477 Error: Failed to resolve UCX endpoint for rank 1
[histamine0:1224050] *** An error occurred in MPI_Bcast
[histamine0:1224050] *** reported by process [2135184123,0]
[histamine0:1224050] *** on communicator MPI_COMM_WORLD
[histamine0:1224050] *** MPI_ERR_OTHER: known error not in list
[histamine0:1224050] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[histamine0:1224050] *** and potentially your MPI job)
slurmstepd: error: *** STEP 1484.0 ON histamine0 CANCELLED AT 2023-09-21T21:40:43 ***
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
srun: error: histamine0: task 0: Exited with exit code 16
srun: error: histamine1: task 1: Killed
# Library version: 1.14.0
# Library path: /apps/spacks/2023-08-14/opt/spack/linux-rocky9-x86_64/gcc-11.3.1/ucx-1.14.0-6ffd5tfh7oi2syb6vbysamiw64ej4qyg/lib/libucs.so.0
# API headers version: 1.14.0
# Git branch '', revision ae505b9
# Configured with: --disable-logging --disable-debug --disable-assertions --disable-params-check --prefix=/apps/spacks/2023-08-14/opt/spack/linux-rocky9-x86_64/gcc-11.3.1/ucx-1.14.0-6ffd5tfh7oi2syb6vbysamiw64ej4qyg --without-go --disable-doxygen-doc --enable-numa --disable-assertions --enable-compiler-opt=3 --without-java --enable-shared --enable-static --disable-logging --enable-mt --with-openmp --enable-optimizations --disable-params-check --disable-gtest --with-pic --with-cuda=/apps/spacks/2023-08-14/opt/spack/linux-rocky9-x86_64/gcc-11.3.1/cuda-11.8.0-vltbfy3o7lx4up3gryipectsmvy2fctc --enable-cma --without-dc --without-dm --with-gdrcopy=/apps/spacks/2023-08-14/opt/spack/linux-rocky9-x86_64/gcc-11.3.1/gdrcopy-2.3-zm6nhbdg72dwuu7yd6ddskjwgrmef4zl --without-ib-hw-tm --without-knem --with-mlx5-dv --with-rc --with-ud --without-xpmem --without-fuse3 --without-bfd --without-rdmacm --with-verbs=/apps/spacks/2023-08-14/opt/spack/linux-rocky9-x86_64/gcc-11.3.1/rdma-core-41.0-zlh7l5va5ia5pxhyqrilx7tftswu4kke --without-avx --without-sse41 --with-sse42 --without-rocm
Setup and versions
- OS version : RockyLinux 9.2
Additional information (depending on the issue)
@G-Ragghianti
- Can you pls try setting UCX_NET_DEVICES=eno1,eno2 as workaround?
- Can you pls post the output of "ip a s docker0" when docker is running?
Thanks for looking into this.
A whitelist of network interfaces is less ideal than a blacklist since we have machines in our cluster with various names of interfaces. However, testing this on two machines with docker0 interfaces results in the following:
$ UCX_NET_DEVICES=enp68s0f0 srun -w histamine0,histamine1 -n 2 osu_bcast
# OSU MPI Broadcast Latency Test v7.2
# Datatype: MPI_CHAR.
# Size Avg Latency(us)
1 21.15
2 12.16
4 3.84
8 21.99
16 15.63
32 22.68
64 17.78
128 20.80
256 21.12
512 28.57
1024 38.62
2048 149.20
4096 151.73
8192 163.03
16384 176.28
32768 229.11
--------------------------------------------------------------------------
WARNING: Open MPI accepted a TCP connection from what appears to be a
another Open MPI process but cannot find a corresponding process
entry for that peer.
This attempted connection will be ignored; your MPI job may or may not
continue properly.
Local host: histamine0
PID: 1820732
--------------------------------------------------------------------------
^Csrun: interrupt (one more within 1 sec to abort)
When executed on two machiens without docker0 interfaces, the benchmark is able to run cleanly, but the performances is significantly less than expected. This is likely because it is using the tcp interface for all communications instead of using the infiniband interface for message passing.
Here is the ip config:
4: docker0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default
link/ether 02:42:af:04:00:ef brd ff:ff:ff:ff:ff:ff
inet 172.17.0.1/16 brd 172.17.255.255 scope global docker0
valid_lft forever preferred_lft forever
inet6 fe80::42:afff:fe04:ef/64 scope link
valid_lft forever preferred_lft forever
I found some code in uct/tcp/tcp_iface.c which seems to support a configuration variable "PREFER_DEFAULT" which would prefer interfaces which have default routing, however it isn't clear how to set this, and it looks like the default should already be set to prefer default. Our docker0 interface isn't the default route, so this doesn't appear to be in effect.
@G-Ragghianti
- The error above seem to come from OpenMPI. Can you pls try to set OMPI_MCA_btl_tcp_if_include=enp68s0f0?
- I'm thinking of adding logic to UCX to ignore bridge interfaces (such as docker0) for TCP transport.
Yes, the error comes from OMPI in that case. I've been trying different combinations between using OMPI_MCA_btl_tcp_if_exclude, OMPI_MCA_btl_tcp_if_include, and UCX_NET_DEVICES with mainly confusion resulting. The only combination that works without errors is the following:
OMPI_MCA_btl_tcp_if_include=ibp193s0f0 UCX_NET_DEVICES=ibp193s0f0
However, this results in slower performance than the Infiniband network should provide.
$ OMPI_MCA_btl_tcp_if_exclude=docker0 srun -w histamine0,histamine1 -n 2 osu_bcast
# OSU MPI Broadcast Latency Test v7.2
# Datatype: MPI_CHAR.
# Size Avg Latency(us)
[1695675466.657697] [histamine1:7221 :0] sock.c:323 UCX ERROR connect(fd=43, dest_addr=172.17.0.1:60657) failed: Connection refused
^C
$ OMPI_MCA_btl_tcp_if_exclude=docker0 UCX_NET_DEVICES=ibp193s0f0 srun -w histamine0,histamine1 -n 2 osu_bcast
# OSU MPI Broadcast Latency Test v7.2
# Datatype: MPI_CHAR.
# Size Avg Latency(us)
--------------------------------------------------------------------------
WARNING: Open MPI accepted a TCP connection from what appears to be a
another Open MPI process but cannot find a corresponding process
entry for that peer.
This attempted connection will be ignored; your MPI job may or may not
continue properly.
Local host: histamine0
PID: 29507
--------------------------------------------------------------------------
^Csrun: interrupt (one more within 1 sec to abort)
srun: StepId=1629.0 tasks 0-1: running
^Csrun: sending Ctrl-C to StepId=1629.0
srun: forcing job termination
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
$ OMPI_MCA_btl_tcp_if_include=ibp193s0f0 UCX_NET_DEVICES=ibp193s0f0 srun -w histamine0,histamine1 -n 2 osu_bcast
# OSU MPI Broadcast Latency Test v7.2
# Datatype: MPI_CHAR.
# Size Avg Latency(us)
1 13.34
...
1048576 1947.68
# This is too slow
@abouteiller
Basically, the BTLs are used by the OB1 PML, while the UCX net devices are used by the UCX PML. The OMPI current default being UCX (where supported), any change to OMPI_MCA_btl_tcp_*
parameters will be ignored. Thus, the only argument that matters is UCX_NET_DEVICES=ibp193s0f0
.
But when I use just UCX_NET_DEVICES=ibp193s0f0 I get the following error from OMPI?
$ UCX_NET_DEVICES=ibp193s0f0 srun -n2 -w histamine0,histamine1 osu_bcast
# OSU MPI Broadcast Latency Test v7.2
# Datatype: MPI_CHAR.
# Size Avg Latency(us)
--------------------------------------------------------------------------
WARNING: Open MPI accepted a TCP connection from what appears to be a
another Open MPI process but cannot find a corresponding process
entry for that peer.
This attempted connection will be ignored; your MPI job may or may not
continue properly.
Local host: histamine0
PID: 34418
--------------------------------------------------------------------------
This is indeed a message from the TCP BTL, which seems to indicate your run did not use the UCX PML but instead rely on the OB1 PML. The results you posted here indicate the use of a variety of PMLs on the same machines, something that should not happen between runs from the same OMPI version without additional MCA parameters.
As was said above there was two things going on here
- one is that when PML UCX fails to initialize, it would fallback to using PLM OB1. That causes other errors that are not relevant to the use case. You can force PML UCX to rule that out (see below).
- You need to list all the required interfaces in
UCX_NET_DEVICES
, that is, both the TCP interface used by CM and the MLX interface used by ib/rc_mlx5.
The following setup works,
salloc -N 2 -p histamine mpiexec -x UCX_LOG_LEVEL=TRACE -x UCX_NET_DEVICES=mlx5_0:1,ibp193s0f0 --mca pml ucx -N 1 IMB-MPI1 pingpong
Note that when the IPoIB is active, this TCP interface is used by default by rdma/cm rather than docker0
, so just mpiexec
with no fancy options works. It's only when the fallback TCP is not IPoIB that it picks docker0
. I didn't check if this is by chance or by design though.
Thanks @abouteiller but the two machines that I'm testing this on have ipoib configure (and verified) and UCX is still trying to use the docker0 interface if I don't give it network devices to use/not use.
$ srun -n2 -p histamine osu_bcast
# OSU MPI Broadcast Latency Test v7.2
# Datatype: MPI_CHAR.
# Size Avg Latency(us)
[1695759709.384330] [histamine1:106025:0] sock.c:323 UCX ERROR connect(fd=43, dest_addr=172.17.0.1:33141) failed: Connection refused
^Csrun: interrupt (one more within 1 sec to abort)
srun: StepId=1758.0 tasks 0-1: running
^Csrun: interrupt (one more within 1 sec to abort)
srun: StepId=1758.0 tasks 0-1: running
OK, it does work differently when using mpirun/mpiexec as the launcher instead of srun.
So far, it looks like using UCX_NET_DEVICES=mlx5_0:1 is a successful workaround for the case where infiniband is connected, but the problem still remains in the case where we want machines to communicate over an ethernet interface and ignore the docker0 interface. Is the only solution to enumerate all the different ethernet interfaces and then explicitly list them?
You don't need to list them all, just list the ones that have a unique physical backend (assuming there is no QoS installed on the interfaces that would limit the performance).
@G-Ragghianti
- With IB - Is
UCX_NET_DEVICES=mlx5_0:1
really needed when Infiniband is connected? I'd expect UCX to select the Infiniband devices since they have higher bandwidth. Can you please run with UCX_LOG_LEVEL=info and port the output? - Without IB - When want to run over an ethernet interface, the only solution for now is to list that ethernet interface in UCX_NET_DEVICES. We can add logic to UCX to ignore bridge interfaces such as docker0 so it will not be needed.
- Yes I can confirm that it prefers to use the docker0 interface even if an Infiniband interface is available. Here is the output you requested:
[ICL:methane ~]$ unset UCX_NET_DEVICES
[ICL:methane ~]$ srun -n 2 -w histamine0,histamine1 osu_bcast
[1697054513.763848] [histamine1:6446 :0] ucp_context.c:1969 UCX INFO Version 1.14.0 (loaded from /apps/spacks/2023-08-14/opt/spack/linux-rocky9-x86_64/gcc-11.3.1/ucx-1.14.0-6ffd5tfh7oi2syb6vbysamiw64ej4qyg/lib/libucp.so.0)
[1697054513.763364] [histamine0:9854 :0] ucp_context.c:1969 UCX INFO Version 1.14.0 (loaded from /apps/spacks/2023-08-14/opt/spack/linux-rocky9-x86_64/gcc-11.3.1/ucx-1.14.0-6ffd5tfh7oi2syb6vbysamiw64ej4qyg/lib/libucp.so.0)
[1697054513.899640] [histamine1:6446 :0] parser.c:1998 UCX INFO UCX_* env variable: UCX_LOG_LEVEL=info
[1697054513.900230] [histamine1:6446 :0] ucp_context.c:1969 UCX INFO Version 1.14.0 (loaded from /apps/spacks/2023-08-14/opt/spack/linux-rocky9-x86_64/gcc-11.3.1/ucx-1.14.0-6ffd5tfh7oi2syb6vbysamiw64ej4qyg/lib/libucp.so.0)
[1697054513.914643] [histamine1:6446 :0] ucp_worker.c:1783 UCX INFO ep_cfg[0]: tag(self/memory cma/memory rc_mlx5/mlx5_0:1)
[1697054513.892111] [histamine0:9854 :0] parser.c:1998 UCX INFO UCX_* env variable: UCX_LOG_LEVEL=info
[1697054513.900467] [histamine0:9854 :0] ucp_context.c:1969 UCX INFO Version 1.14.0 (loaded from /apps/spacks/2023-08-14/opt/spack/linux-rocky9-x86_64/gcc-11.3.1/ucx-1.14.0-6ffd5tfh7oi2syb6vbysamiw64ej4qyg/lib/libucp.so.0)
[1697054513.915254] [histamine0:9854 :0] ucp_worker.c:1783 UCX INFO ep_cfg[0]: tag(self/memory cma/memory rc_mlx5/mlx5_0:1)
# OSU MPI Broadcast Latency Test v7.2
# Datatype: MPI_CHAR.
# Size Avg Latency(us)
[1697054513.928277] [histamine1:6446 :0] ucp_worker.c:1783 UCX INFO ep_cfg[1]: tag(rc_mlx5/mlx5_0:1 tcp/enp68s0f0)
[1697054513.928928] [histamine1:6446 :0] sock.c:323 UCX ERROR connect(fd=43, dest_addr=172.17.0.1:46175) failed: Connection refused
With UCX_NET_DEVICES=mlx5_0:1
[ICL:methane ~]$ export UCX_NET_DEVICES=mlx5_0:1
[ICL:methane ~]$ srun -n 2 -w histamine0,histamine1 osu_bcast
[1697054570.396917] [histamine0:9896 :0] ucp_context.c:1969 UCX INFO Version 1.14.0 (loaded from /apps/spacks/2023-08-14/opt/spack/linux-rocky9-x86_64/gcc-11.3.1/ucx-1.14.0-6ffd5tfh7oi2syb6vbysamiw64ej4qyg/lib/libucp.so.0)
[1697054570.427446] [histamine1:6484 :0] ucp_context.c:1969 UCX INFO Version 1.14.0 (loaded from /apps/spacks/2023-08-14/opt/spack/linux-rocky9-x86_64/gcc-11.3.1/ucx-1.14.0-6ffd5tfh7oi2syb6vbysamiw64ej4qyg/lib/libucp.so.0)
[1697054570.543673] [histamine1:6484 :0] parser.c:1998 UCX INFO UCX_* env variables: UCX_NET_DEVICES=mlx5_0:1 UCX_LOG_LEVEL=info
[1697054570.544553] [histamine1:6484 :0] ucp_context.c:1969 UCX INFO Version 1.14.0 (loaded from /apps/spacks/2023-08-14/opt/spack/linux-rocky9-x86_64/gcc-11.3.1/ucx-1.14.0-6ffd5tfh7oi2syb6vbysamiw64ej4qyg/lib/libucp.so.0)
[1697054570.558667] [histamine1:6484 :0] ucp_worker.c:1783 UCX INFO ep_cfg[0]: tag(self/memory cma/memory rc_mlx5/mlx5_0:1)
[1697054570.516758] [histamine0:9896 :0] parser.c:1998 UCX INFO UCX_* env variables: UCX_NET_DEVICES=mlx5_0:1 UCX_LOG_LEVEL=info
[1697054570.544690] [histamine0:9896 :0] ucp_context.c:1969 UCX INFO Version 1.14.0 (loaded from /apps/spacks/2023-08-14/opt/spack/linux-rocky9-x86_64/gcc-11.3.1/ucx-1.14.0-6ffd5tfh7oi2syb6vbysamiw64ej4qyg/lib/libucp.so.0)
[1697054570.559267] [histamine0:9896 :0] ucp_worker.c:1783 UCX INFO ep_cfg[0]: tag(self/memory cma/memory rc_mlx5/mlx5_0:1)
# OSU MPI Broadcast Latency Test v7.2
# Datatype: MPI_CHAR.
# Size Avg Latency(us)
[1697054570.571564] [histamine0:9896 :0] ucp_worker.c:1783 UCX INFO ep_cfg[1]: tag(rc_mlx5/mlx5_0:1)
1 1.29
2 1.29
4 1.30
8 1.29
16 1.30
32 1.39
64 1.81
128 1.49
256 1.90
512 1.96
1024 2.16
2048 2.45
4096 3.06
8192 3.91
16384 5.11
32768 7.95
65536 11.60
131072 19.45
262144 62.87
524288 117.69
1048576 227.14
[1697054570.572278] [histamine1:6484 :a] ucp_worker.c:1783 UCX INFO ep_cfg[1]: tag(rc_mlx5/mlx5_0:1)
Hi @G-Ragghianti, We merged PR https://github.com/openucx/ucx/pull/9475 that should fix the issue. We also pushed the changes to v1.16.x branch (https://github.com/openucx/ucx/pull/9487).