ucx icon indicating copy to clipboard operation
ucx copied to clipboard

Attempts communication over unroutable IP interface

Open G-Ragghianti opened this issue 1 year ago • 16 comments

Describe the bug

We are trying to use openmpi+ucx to run a simple benchmark from osu-micro-benchmarks (osu_bcast). When two machines have a running docker container, the interface docker0 is up on both machines and ucx attempts to use this interface.

Steps to Reproduce

  • UCX version: 1.14.0
$ srun -n2 -p histamine osu_bcast

# OSU MPI Broadcast Latency Test v7.2
# Datatype: MPI_CHAR.
# Size       Avg Latency(us)
[1695332443.361791] [histamine0:1224050:0]            sock.c:323  UCX  ERROR   connect(fd=42, dest_addr=172.17.0.1:54351) failed: Connection refused
[histamine0:1224050] pml_ucx.c:424  Error: ucp_ep_create(proc=1) failed: Destination is unreachable
[histamine0:1224050] pml_ucx.c:477  Error: Failed to resolve UCX endpoint for rank 1
[histamine0:1224050] *** An error occurred in MPI_Bcast
[histamine0:1224050] *** reported by process [2135184123,0]
[histamine0:1224050] *** on communicator MPI_COMM_WORLD
[histamine0:1224050] *** MPI_ERR_OTHER: known error not in list
[histamine0:1224050] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[histamine0:1224050] ***    and potentially your MPI job)
slurmstepd: error: *** STEP 1484.0 ON histamine0 CANCELLED AT 2023-09-21T21:40:43 ***
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
srun: error: histamine0: task 0: Exited with exit code 16
srun: error: histamine1: task 1: Killed
# Library version: 1.14.0
# Library path: /apps/spacks/2023-08-14/opt/spack/linux-rocky9-x86_64/gcc-11.3.1/ucx-1.14.0-6ffd5tfh7oi2syb6vbysamiw64ej4qyg/lib/libucs.so.0
# API headers version: 1.14.0
# Git branch '', revision ae505b9
# Configured with: --disable-logging --disable-debug --disable-assertions --disable-params-check --prefix=/apps/spacks/2023-08-14/opt/spack/linux-rocky9-x86_64/gcc-11.3.1/ucx-1.14.0-6ffd5tfh7oi2syb6vbysamiw64ej4qyg --without-go --disable-doxygen-doc --enable-numa --disable-assertions --enable-compiler-opt=3 --without-java --enable-shared --enable-static --disable-logging --enable-mt --with-openmp --enable-optimizations --disable-params-check --disable-gtest --with-pic --with-cuda=/apps/spacks/2023-08-14/opt/spack/linux-rocky9-x86_64/gcc-11.3.1/cuda-11.8.0-vltbfy3o7lx4up3gryipectsmvy2fctc --enable-cma --without-dc --without-dm --with-gdrcopy=/apps/spacks/2023-08-14/opt/spack/linux-rocky9-x86_64/gcc-11.3.1/gdrcopy-2.3-zm6nhbdg72dwuu7yd6ddskjwgrmef4zl --without-ib-hw-tm --without-knem --with-mlx5-dv --with-rc --with-ud --without-xpmem --without-fuse3 --without-bfd --without-rdmacm --with-verbs=/apps/spacks/2023-08-14/opt/spack/linux-rocky9-x86_64/gcc-11.3.1/rdma-core-41.0-zlh7l5va5ia5pxhyqrilx7tftswu4kke --without-avx --without-sse41 --with-sse42 --without-rocm

Setup and versions

  • OS version : RockyLinux 9.2

Additional information (depending on the issue)

ucx_error.log ucx_info-d.log

G-Ragghianti avatar Sep 21 '23 21:09 G-Ragghianti

@G-Ragghianti

  1. Can you pls try setting UCX_NET_DEVICES=eno1,eno2 as workaround?
  2. Can you pls post the output of "ip a s docker0" when docker is running?

yosefe avatar Sep 22 '23 13:09 yosefe

Thanks for looking into this.

A whitelist of network interfaces is less ideal than a blacklist since we have machines in our cluster with various names of interfaces. However, testing this on two machines with docker0 interfaces results in the following:

$ UCX_NET_DEVICES=enp68s0f0 srun -w histamine0,histamine1 -n 2 osu_bcast

# OSU MPI Broadcast Latency Test v7.2
# Datatype: MPI_CHAR.
# Size       Avg Latency(us)
1                      21.15
2                      12.16
4                       3.84
8                      21.99
16                     15.63
32                     22.68
64                     17.78
128                    20.80
256                    21.12
512                    28.57
1024                   38.62
2048                  149.20
4096                  151.73
8192                  163.03
16384                 176.28
32768                 229.11
--------------------------------------------------------------------------
WARNING: Open MPI accepted a TCP connection from what appears to be a
another Open MPI process but cannot find a corresponding process
entry for that peer.

This attempted connection will be ignored; your MPI job may or may not
continue properly.

  Local host: histamine0
  PID:        1820732
--------------------------------------------------------------------------
^Csrun: interrupt (one more within 1 sec to abort)

When executed on two machiens without docker0 interfaces, the benchmark is able to run cleanly, but the performances is significantly less than expected. This is likely because it is using the tcp interface for all communications instead of using the infiniband interface for message passing.

Here is the ip config:

4: docker0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default 
    link/ether 02:42:af:04:00:ef brd ff:ff:ff:ff:ff:ff
    inet 172.17.0.1/16 brd 172.17.255.255 scope global docker0
       valid_lft forever preferred_lft forever
    inet6 fe80::42:afff:fe04:ef/64 scope link 
       valid_lft forever preferred_lft forever

I found some code in uct/tcp/tcp_iface.c which seems to support a configuration variable "PREFER_DEFAULT" which would prefer interfaces which have default routing, however it isn't clear how to set this, and it looks like the default should already be set to prefer default. Our docker0 interface isn't the default route, so this doesn't appear to be in effect.

G-Ragghianti avatar Sep 22 '23 18:09 G-Ragghianti

@G-Ragghianti

  1. The error above seem to come from OpenMPI. Can you pls try to set OMPI_MCA_btl_tcp_if_include=enp68s0f0?
  2. I'm thinking of adding logic to UCX to ignore bridge interfaces (such as docker0) for TCP transport.

yosefe avatar Sep 24 '23 06:09 yosefe

Yes, the error comes from OMPI in that case. I've been trying different combinations between using OMPI_MCA_btl_tcp_if_exclude, OMPI_MCA_btl_tcp_if_include, and UCX_NET_DEVICES with mainly confusion resulting. The only combination that works without errors is the following:

OMPI_MCA_btl_tcp_if_include=ibp193s0f0 UCX_NET_DEVICES=ibp193s0f0

However, this results in slower performance than the Infiniband network should provide.

$ OMPI_MCA_btl_tcp_if_exclude=docker0 srun -w histamine0,histamine1 -n 2 osu_bcast
# OSU MPI Broadcast Latency Test v7.2
# Datatype: MPI_CHAR.
# Size       Avg Latency(us)
[1695675466.657697] [histamine1:7221 :0]            sock.c:323  UCX  ERROR   connect(fd=43, dest_addr=172.17.0.1:60657) failed: Connection refused
^C


$ OMPI_MCA_btl_tcp_if_exclude=docker0 UCX_NET_DEVICES=ibp193s0f0 srun -w histamine0,histamine1 -n 2 osu_bcast
# OSU MPI Broadcast Latency Test v7.2
# Datatype: MPI_CHAR.
# Size       Avg Latency(us)
--------------------------------------------------------------------------
WARNING: Open MPI accepted a TCP connection from what appears to be a
another Open MPI process but cannot find a corresponding process
entry for that peer.

This attempted connection will be ignored; your MPI job may or may not
continue properly.

  Local host: histamine0
  PID:        29507
--------------------------------------------------------------------------
^Csrun: interrupt (one more within 1 sec to abort)
srun: StepId=1629.0 tasks 0-1: running
^Csrun: sending Ctrl-C to StepId=1629.0
srun: forcing job termination
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.


$ OMPI_MCA_btl_tcp_if_include=ibp193s0f0 UCX_NET_DEVICES=ibp193s0f0 srun -w histamine0,histamine1 -n 2 osu_bcast
# OSU MPI Broadcast Latency Test v7.2
# Datatype: MPI_CHAR.
# Size       Avg Latency(us)
1                      13.34
...
1048576              1947.68
# This is too slow

G-Ragghianti avatar Sep 25 '23 21:09 G-Ragghianti

@abouteiller

G-Ragghianti avatar Sep 25 '23 21:09 G-Ragghianti

Basically, the BTLs are used by the OB1 PML, while the UCX net devices are used by the UCX PML. The OMPI current default being UCX (where supported), any change to OMPI_MCA_btl_tcp_* parameters will be ignored. Thus, the only argument that matters is UCX_NET_DEVICES=ibp193s0f0.

bosilca avatar Sep 25 '23 21:09 bosilca

But when I use just UCX_NET_DEVICES=ibp193s0f0 I get the following error from OMPI?

$ UCX_NET_DEVICES=ibp193s0f0 srun -n2 -w histamine0,histamine1 osu_bcast

# OSU MPI Broadcast Latency Test v7.2
# Datatype: MPI_CHAR.
# Size       Avg Latency(us)
--------------------------------------------------------------------------
WARNING: Open MPI accepted a TCP connection from what appears to be a
another Open MPI process but cannot find a corresponding process
entry for that peer.

This attempted connection will be ignored; your MPI job may or may not
continue properly.

  Local host: histamine0
  PID:        34418
--------------------------------------------------------------------------


G-Ragghianti avatar Sep 25 '23 21:09 G-Ragghianti

This is indeed a message from the TCP BTL, which seems to indicate your run did not use the UCX PML but instead rely on the OB1 PML. The results you posted here indicate the use of a variety of PMLs on the same machines, something that should not happen between runs from the same OMPI version without additional MCA parameters.

bosilca avatar Sep 26 '23 13:09 bosilca

As was said above there was two things going on here

  1. one is that when PML UCX fails to initialize, it would fallback to using PLM OB1. That causes other errors that are not relevant to the use case. You can force PML UCX to rule that out (see below).
  2. You need to list all the required interfaces in UCX_NET_DEVICES, that is, both the TCP interface used by CM and the MLX interface used by ib/rc_mlx5.

The following setup works,

salloc -N 2 -p histamine mpiexec  -x UCX_LOG_LEVEL=TRACE -x UCX_NET_DEVICES=mlx5_0:1,ibp193s0f0 --mca pml ucx   -N 1 IMB-MPI1 pingpong      

Note that when the IPoIB is active, this TCP interface is used by default by rdma/cm rather than docker0, so just mpiexec with no fancy options works. It's only when the fallback TCP is not IPoIB that it picks docker0. I didn't check if this is by chance or by design though.

abouteiller avatar Sep 26 '23 18:09 abouteiller

Thanks @abouteiller but the two machines that I'm testing this on have ipoib configure (and verified) and UCX is still trying to use the docker0 interface if I don't give it network devices to use/not use.

$ srun -n2 -p histamine osu_bcast

# OSU MPI Broadcast Latency Test v7.2
# Datatype: MPI_CHAR.
# Size       Avg Latency(us)
[1695759709.384330] [histamine1:106025:0]            sock.c:323  UCX  ERROR   connect(fd=43, dest_addr=172.17.0.1:33141) failed: Connection refused
^Csrun: interrupt (one more within 1 sec to abort)
srun: StepId=1758.0 tasks 0-1: running
^Csrun: interrupt (one more within 1 sec to abort)
srun: StepId=1758.0 tasks 0-1: running

G-Ragghianti avatar Sep 26 '23 20:09 G-Ragghianti

OK, it does work differently when using mpirun/mpiexec as the launcher instead of srun.

G-Ragghianti avatar Sep 26 '23 22:09 G-Ragghianti

So far, it looks like using UCX_NET_DEVICES=mlx5_0:1 is a successful workaround for the case where infiniband is connected, but the problem still remains in the case where we want machines to communicate over an ethernet interface and ignore the docker0 interface. Is the only solution to enumerate all the different ethernet interfaces and then explicitly list them?

G-Ragghianti avatar Sep 27 '23 22:09 G-Ragghianti

You don't need to list them all, just list the ones that have a unique physical backend (assuming there is no QoS installed on the interfaces that would limit the performance).

bosilca avatar Sep 27 '23 22:09 bosilca

@G-Ragghianti

  1. With IB - Is UCX_NET_DEVICES=mlx5_0:1 really needed when Infiniband is connected? I'd expect UCX to select the Infiniband devices since they have higher bandwidth. Can you please run with UCX_LOG_LEVEL=info and port the output?
  2. Without IB - When want to run over an ethernet interface, the only solution for now is to list that ethernet interface in UCX_NET_DEVICES. We can add logic to UCX to ignore bridge interfaces such as docker0 so it will not be needed.

yosefe avatar Oct 08 '23 09:10 yosefe

  1. Yes I can confirm that it prefers to use the docker0 interface even if an Infiniband interface is available. Here is the output you requested:
[ICL:methane ~]$ unset UCX_NET_DEVICES
[ICL:methane ~]$ srun -n 2 -w histamine0,histamine1 osu_bcast
[1697054513.763848] [histamine1:6446 :0]     ucp_context.c:1969 UCX  INFO  Version 1.14.0 (loaded from /apps/spacks/2023-08-14/opt/spack/linux-rocky9-x86_64/gcc-11.3.1/ucx-1.14.0-6ffd5tfh7oi2syb6vbysamiw64ej4qyg/lib/libucp.so.0)
[1697054513.763364] [histamine0:9854 :0]     ucp_context.c:1969 UCX  INFO  Version 1.14.0 (loaded from /apps/spacks/2023-08-14/opt/spack/linux-rocky9-x86_64/gcc-11.3.1/ucx-1.14.0-6ffd5tfh7oi2syb6vbysamiw64ej4qyg/lib/libucp.so.0)
[1697054513.899640] [histamine1:6446 :0]          parser.c:1998 UCX  INFO  UCX_* env variable: UCX_LOG_LEVEL=info
[1697054513.900230] [histamine1:6446 :0]     ucp_context.c:1969 UCX  INFO  Version 1.14.0 (loaded from /apps/spacks/2023-08-14/opt/spack/linux-rocky9-x86_64/gcc-11.3.1/ucx-1.14.0-6ffd5tfh7oi2syb6vbysamiw64ej4qyg/lib/libucp.so.0)
[1697054513.914643] [histamine1:6446 :0]      ucp_worker.c:1783 UCX  INFO    ep_cfg[0]: tag(self/memory cma/memory rc_mlx5/mlx5_0:1)
[1697054513.892111] [histamine0:9854 :0]          parser.c:1998 UCX  INFO  UCX_* env variable: UCX_LOG_LEVEL=info
[1697054513.900467] [histamine0:9854 :0]     ucp_context.c:1969 UCX  INFO  Version 1.14.0 (loaded from /apps/spacks/2023-08-14/opt/spack/linux-rocky9-x86_64/gcc-11.3.1/ucx-1.14.0-6ffd5tfh7oi2syb6vbysamiw64ej4qyg/lib/libucp.so.0)
[1697054513.915254] [histamine0:9854 :0]      ucp_worker.c:1783 UCX  INFO    ep_cfg[0]: tag(self/memory cma/memory rc_mlx5/mlx5_0:1)

# OSU MPI Broadcast Latency Test v7.2
# Datatype: MPI_CHAR.
# Size       Avg Latency(us)
[1697054513.928277] [histamine1:6446 :0]      ucp_worker.c:1783 UCX  INFO    ep_cfg[1]: tag(rc_mlx5/mlx5_0:1 tcp/enp68s0f0)
[1697054513.928928] [histamine1:6446 :0]            sock.c:323  UCX  ERROR   connect(fd=43, dest_addr=172.17.0.1:46175) failed: Connection refused

With UCX_NET_DEVICES=mlx5_0:1

[ICL:methane ~]$ export UCX_NET_DEVICES=mlx5_0:1
[ICL:methane ~]$ srun -n 2 -w histamine0,histamine1 osu_bcast
[1697054570.396917] [histamine0:9896 :0]     ucp_context.c:1969 UCX  INFO  Version 1.14.0 (loaded from /apps/spacks/2023-08-14/opt/spack/linux-rocky9-x86_64/gcc-11.3.1/ucx-1.14.0-6ffd5tfh7oi2syb6vbysamiw64ej4qyg/lib/libucp.so.0)
[1697054570.427446] [histamine1:6484 :0]     ucp_context.c:1969 UCX  INFO  Version 1.14.0 (loaded from /apps/spacks/2023-08-14/opt/spack/linux-rocky9-x86_64/gcc-11.3.1/ucx-1.14.0-6ffd5tfh7oi2syb6vbysamiw64ej4qyg/lib/libucp.so.0)
[1697054570.543673] [histamine1:6484 :0]          parser.c:1998 UCX  INFO  UCX_* env variables: UCX_NET_DEVICES=mlx5_0:1 UCX_LOG_LEVEL=info
[1697054570.544553] [histamine1:6484 :0]     ucp_context.c:1969 UCX  INFO  Version 1.14.0 (loaded from /apps/spacks/2023-08-14/opt/spack/linux-rocky9-x86_64/gcc-11.3.1/ucx-1.14.0-6ffd5tfh7oi2syb6vbysamiw64ej4qyg/lib/libucp.so.0)
[1697054570.558667] [histamine1:6484 :0]      ucp_worker.c:1783 UCX  INFO    ep_cfg[0]: tag(self/memory cma/memory rc_mlx5/mlx5_0:1)
[1697054570.516758] [histamine0:9896 :0]          parser.c:1998 UCX  INFO  UCX_* env variables: UCX_NET_DEVICES=mlx5_0:1 UCX_LOG_LEVEL=info
[1697054570.544690] [histamine0:9896 :0]     ucp_context.c:1969 UCX  INFO  Version 1.14.0 (loaded from /apps/spacks/2023-08-14/opt/spack/linux-rocky9-x86_64/gcc-11.3.1/ucx-1.14.0-6ffd5tfh7oi2syb6vbysamiw64ej4qyg/lib/libucp.so.0)
[1697054570.559267] [histamine0:9896 :0]      ucp_worker.c:1783 UCX  INFO    ep_cfg[0]: tag(self/memory cma/memory rc_mlx5/mlx5_0:1)

# OSU MPI Broadcast Latency Test v7.2
# Datatype: MPI_CHAR.
# Size       Avg Latency(us)
[1697054570.571564] [histamine0:9896 :0]      ucp_worker.c:1783 UCX  INFO    ep_cfg[1]: tag(rc_mlx5/mlx5_0:1)
1                       1.29
2                       1.29
4                       1.30
8                       1.29
16                      1.30
32                      1.39
64                      1.81
128                     1.49
256                     1.90
512                     1.96
1024                    2.16
2048                    2.45
4096                    3.06
8192                    3.91
16384                   5.11
32768                   7.95
65536                  11.60
131072                 19.45
262144                 62.87
524288                117.69
1048576               227.14
[1697054570.572278] [histamine1:6484 :a]      ucp_worker.c:1783 UCX  INFO    ep_cfg[1]: tag(rc_mlx5/mlx5_0:1)


G-Ragghianti avatar Oct 11 '23 20:10 G-Ragghianti

Hi @G-Ragghianti, We merged PR https://github.com/openucx/ucx/pull/9475 that should fix the issue. We also pushed the changes to v1.16.x branch (https://github.com/openucx/ucx/pull/9487).

rakhmets avatar Dec 15 '23 13:12 rakhmets