Open UCX 1.15.0 with Open MPI 4.1.1 - running osu_iallgather/osu_iallgatherv stucked when the message size reached 65536
Describe the bug
We use Open UCX 1.15.0 with Open MPI 4.1.1 to run osu_iallgather/osu_iallgatherv. However, when the message size reached 65536, the program was stucked, we waited at least 30 minutes but printed nothing no more.
Things we have tried
- add `-x UCX_RC_MLX5_RX_QUEUE_LEN=8191', it works!
- add '-x UCX_RNDV_THRESH=8192', it also works!
Steps to Reproduce
-
Command line
mpirun -x UCX_TLS=sm,rc_x -x UCX_NET_DEVICES=mlx5_1:1 -np 1024 -N 128 --hostfile hostfile_path -mca pml ucx -mca btl ^vader,tcp,openib,uct osu_iallgather -i 2 -
UCX version used :
1.15.0 -
UCX configure flags (can be checked by
ucx_info -v)
Library version: 1.15.0 Library path: /lib/libucs.so.0 API headers version: 1.15.0 Git branch '', revision Configured with: --disable-logging --disable-debug --disable-assertions --disable-params-check --enable-optimizations --prefix=/openucx --enable-mt
-
Any UCX environment variables used
- UCX_TLS=sm,rc_x
- UCX_NET_DEVICES=mlx5_1:1
Setup and versions
- OS version (e.g Linux distro)
-
Linux 6426-node125 4.19.90-2112.8.0.0131.oe1.aarch64 #1 SMP Fri Dec 31 19:53:20 UTC 2021 aarch64 aarch64 aarch64 GNU/Linux
-
- CPU architecture (x86_64/aarch64/ppc64le/...)
- aarch64
- For RDMA/IB/RoCE related issues:
- Driver version:
-
rdma-core-54mlnx1-1.54303.aarch64 -
MLNX_OFED_LINUX-5.4-3.0.3.0
-
- HW information from
ibstatoribv_devinfo -vvcommand
- Driver version:
CA 'mlx5_1' CA type: MT4121 Number of ports: 1 Firmware version: 16.31.2006 Hardware version: 0 Node GUID: 0x98039b030071f6e9 System image GUID: 0x98039b030071f6e8 Port 1: State: Active Physical state: LinkUp Rate: 100 Base lid: 0 LMC: 0 SM lid: 0 Capability mask: 0x00010000 Port GUID: 0x9a039bfffe71f6e9 Link layer: Ethernet
Additional information (depending on the issue)
- OpenMPI version
-
Open MPI 4.1.1
-
- OSU version
-
osu-micro-benchmarks-7.1-1
-
- Output log
osu_iallgatherv
add -x UCX_RC_MLX5_RX_QUEUE_LEN=8191
osu_iallgatherv
add -x UCX_RNDV_THRESH=8192
osu_iallgather
add -x UCX_RC_MLX5_RX_QUEUE_LEN=8191
osu_iallgather
add -x UCX_RNDV_THRESH=8192
Hi,
I noticed that when you set UCX_RNDV_THRESH=8192, you didn't set UCX_TLS=sm,rc_x. I guess that in the case of UCX_RNDV_THRESH=8192, the reason was the use of different transport by the UCX.
Does the program stuck if the command line contains UCX_TLS=sm,rc_x along with UCX_RNDV_THRESH=8192?
mpirun -x UCX_RNDV_THRESH=8192 -x UCX_TLS=sm,rc_x -x UCX_NET_DEVICES=mlx5_1:1 -np 1024 -N 128 --hostfile hostfile_path -mca pml ucx -mca btl ^vader,tcp,openib,uct osu_iallgather -i 2
Does the program stuck if the command line doesn't contain UCX_TLS=sm,rc_x?
mpirun -x UCX_NET_DEVICES=mlx5_1:1 -np 1024 -N 128 --hostfile hostfile_path -mca pml ucx -mca btl ^vader,tcp,openib,uct osu_iallgather -i 2
Does the program stuck if the command line contains UCX_TLS=sm,rc_x,dc?
mpirun -x UCX_TLS=sm,rc_x,dc -x UCX_NET_DEVICES=mlx5_1:1 -np 1024 -N 128 --hostfile hostfile_path -mca pml ucx -mca btl ^vader,tcp,openib,uct osu_iallgather -i 2
Thanks for your reply! Following screenshots are the results I have tried.
- contains
UCX_TLS=sm,rc_x alongwithUCX_RNDV_THRESH=8192 - doesn't contain
UCX_TLS=sm,rc_x - contains
UCX_TLS=sm,rc_x,dc