ompi
ompi copied to clipboard
RDMA Read Error on DGX B200 systems
We are running an application on a single node using CUDA 12.8. We are using openMPI 5.0.7 with UCS 1.8.0 and gdrcopy 1.5.0 and nv_peer_mem v 1.3.
We are getting the error below: ib_mlx5_log.c:179 Remote operation error on mlx5_4:1/IB (synd 0x14 vend 0x89 hw_synd 0/0) ib_mlx5_log.c:179 RC QP 0x53 wqe[0]: RDMA_READ s-- [rva 0x7fe7d1a00000 rkey 0x180ac9] [va 0x7f15bfe00000 len 10020 lkey 0x182fee] [rqpn 0x5f dlid=65 sl=0 port=1 src_path_bits=0] ==== backtrace (tid: 6594) ==== 0 /shared_data/third_party/openmpi-5.0.7/lib/libucs.so.0(ucs_handle_error+0x2e4) [0x7f18ebadf934] 1 /shared_data/third_party/openmpi-5.0.7/lib/libucs.so.0(ucs_fatal_error_message+0xc2) [0x7f18ebadc9c2] 2 /shared_data/third_party/openmpi-5.0.7/lib/libucs.so.0(ucs_log_default_handler+0xf7e) [0x7f18ebae15ee] 3 /shared_data/third_party/openmpi-5.0.7/lib/libucs.so.0(ucs_log_dispatch+0xe4) [0x7f18ebae1a14] 4 /shared_data/third_party/openmpi-5.0.7/lib/ucx/libuct_ib_mlx5.so.0(uct_ib_mlx5_completion_with_err+0x60d) [0x7f160210036d] 5 /shared_data/third_party/openmpi-5.0.7/lib/ucx/libuct_ib_mlx5.so.0(uct_rc_mlx5_iface_handle_failure+0x134) [0x7f1602115f24] 6 /shared_data/third_party/openmpi-5.0.7/lib/ucx/libuct_ib_mlx5.so.0(uct_ib_mlx5_check_completion+0x3d) [0x7f160210153d] 7 /shared_data/third_party/openmpi-5.0.7/lib/ucx/libuct_ib_mlx5.so.0(+0x2b2f7) [0x7f16021172f7] 8 /shared_data/third_party/openmpi-5.0.7/lib/libucp.so.0(ucp_worker_progress+0x2a) [0x7f18ebb6c05a] 9 /shared_data/third_party/openmpi-5.0.7/lib/libopen-pal.so.80(opal_progress+0x34) [0x7f18ebc2e734] 10 /shared_data/third_party/openmpi-5.0.7/lib/libmpi.so.40(ompi_request_default_wait+0x140) [0x7f1904112020] 11 /shared_data/third_party/openmpi-5.0.7/lib/libmpi.so.40(MPI_Wait+0x54) [0x7f190415de84]
Background information
What version of Open MPI are you using? (e.g., v4.1.6, v5.0.1, git branch name and hash, etc.)
5.0.7, UCX 1.8.0
Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)
OpenMPI was compiled from source
Please describe the system on which you are running
- Operating system/version: Ubuntu 22.04.4 LTS
- Computer hardware: NVIDIA DGX B200
- Network type: Infiniband NDR
Details of the problem
The command we are issuing is:
/third_party/openmpi-5.0.7/bin/mpirun -np 8 -hostfile ./hostfile --report-bindings --bind-to core --map-by ppr:8:node:PE=14 --mca pml ucx --mca btl ^openib -x UCX_TLS=self,sm,cma,cuda_copy,gdr_copy,rc_v -x UCX_IB_GPU_DIRECT_RDMA=1 ./mpi_rail_mapping_b200.sh /install_path/openmpi507-25.6.0/bin/OurExecutable
and the contents of mpi_rail_mapping.sh is:
#!/bin/bash export LOCAL_RANK=$OMPI_COMM_WORLD_LOCAL_RANK IB_DEVS=(4 7 8 9 10 13 14 15) CUDA_DEV=$LOCAL_RANK IB_DEV=${IB_DEVS[$LOCAL_RANK]} export UCX_NET_DEVICES=mlx5_$IB_DEV:1 echo "local rank $CUDA_DEV: using hca $IB_DEV" exec $*
The IB_DEVS is set to map each GPU to the nearest IB card according to the output of nvidia-smi topo -m shown below:
nvidia-smi topo -m GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7 NIC0 NIC1 NIC2 NIC3 NIC4 NIC5 NIC6 NIC7 NIC8 NIC9 NIC10 NIC11 NIC12 NIC13 NIC14 NIC15 CPU Affinit y NUMA Affinity GPU NUMA ID GPU0 X NV18 NV18 NV18 NV18 NV18 NV18 NV18 NODE NODE NODE NODE PXB NODE NODE NODE NODE NODE SYS SYS SYS SYS SYS SYS 0-55 0 N /A GPU1 NV18 X NV18 NV18 NV18 NV18 NV18 NV18 NODE NODE NODE NODE NODE NODE NODE PXB NODE NODE SYS SYS SYS SYS SYS SYS 0-55 0 N /A GPU2 NV18 NV18 X NV18 NV18 NV18 NV18 NV18 NODE NODE NODE NODE NODE NODE NODE NODE PXB NODE SYS SYS SYS SYS SYS SYS 0-55 0 N /A GPU3 NV18 NV18 NV18 X NV18 NV18 NV18 NV18 NODE NODE NODE NODE NODE NODE NODE NODE NODE PXB SYS SYS SYS SYS SYS SYS 0-55 0 N /A GPU4 NV18 NV18 NV18 NV18 X NV18 NV18 NV18 SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS PXB NODE NODE NODE NODE NODE 56-111 1 N /A GPU5 NV18 NV18 NV18 NV18 NV18 X NV18 NV18 SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS NODE NODE NODE PXB NODE NODE 56-111 1 N /A GPU6 NV18 NV18 NV18 NV18 NV18 NV18 X NV18 SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS NODE NODE NODE NODE PXB NODE 56-111 1 N /A GPU7 NV18 NV18 NV18 NV18 NV18 NV18 NV18 X SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS NODE NODE NODE NODE NODE PXB 56-111 1 N /A NIC0 NODE NODE NODE NODE SYS SYS SYS SYS X PIX PIX PIX NODE NODE NODE NODE NODE NODE SYS SYS SYS SYS SYS SYS NIC1 NODE NODE NODE NODE SYS SYS SYS SYS PIX X PIX PIX NODE NODE NODE NODE NODE NODE SYS SYS SYS SYS SYS SYS NIC2 NODE NODE NODE NODE SYS SYS SYS SYS PIX PIX X PIX NODE NODE NODE NODE NODE NODE SYS SYS SYS SYS SYS SYS NIC3 NODE NODE NODE NODE SYS SYS SYS SYS PIX PIX PIX X NODE NODE NODE NODE NODE NODE SYS SYS SYS SYS SYS SYS NIC4 PXB NODE NODE NODE SYS SYS SYS SYS NODE NODE NODE NODE X NODE NODE NODE NODE NODE SYS SYS SYS SYS SYS SYS NIC5 NODE NODE NODE NODE SYS SYS SYS SYS NODE NODE NODE NODE NODE X PIX NODE NODE NODE SYS SYS SYS SYS SYS SYS NIC6 NODE NODE NODE NODE SYS SYS SYS SYS NODE NODE NODE NODE NODE PIX X NODE NODE NODE SYS SYS SYS SYS SYS SYS NIC7 NODE PXB NODE NODE SYS SYS SYS SYS NODE NODE NODE NODE NODE NODE NODE X NODE NODE SYS SYS SYS SYS SYS SYS NIC8 NODE NODE PXB NODE SYS SYS SYS SYS NODE NODE NODE NODE NODE NODE NODE NODE X NODE SYS SYS SYS SYS SYS SYS NIC9 NODE NODE NODE PXB SYS SYS SYS SYS NODE NODE NODE NODE NODE NODE NODE NODE NODE X SYS SYS SYS SYS SYS SYS NIC10 SYS SYS SYS SYS PXB NODE NODE NODE SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS X NODE NODE NODE NODE NODE NIC11 SYS SYS SYS SYS NODE NODE NODE NODE SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS NODE X PIX NODE NODE NODE NIC12 SYS SYS SYS SYS NODE NODE NODE NODE SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS NODE PIX X NODE NODE NODE NIC13 SYS SYS SYS SYS NODE PXB NODE NODE SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS NODE NODE NODE X NODE NODE NIC14 SYS SYS SYS SYS NODE NODE PXB NODE SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS NODE NODE NODE NODE X NODE NIC15 SYS SYS SYS SYS NODE NODE NODE PXB SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS NODE NODE NODE NODE NODE X
Legend:
X = Self SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI) NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU) PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge) PIX = Connection traversing at most a single PCIe bridge NV# = Connection traversing a bonded set of # NVLinks
NIC Legend:
NIC0: mlx5_0 NIC1: mlx5_1 NIC2: mlx5_2 NIC3: mlx5_3 NIC4: mlx5_4 NIC5: mlx5_5 NIC6: mlx5_6 NIC7: mlx5_7 NIC8: mlx5_8 NIC9: mlx5_9 NIC10: mlx5_10 NIC11: mlx5_11 NIC12: mlx5_12 NIC13: mlx5_13 NIC14: mlx5_14 NIC15: mlx5_15
If we permute the IB_DEV mapping, to say (4,8,9,10,13,14,15,7) it will run, but I assume that is because the messages are going through the NUMA node, and no longer using GPU RDMA.
Also, if I add cuda_ipc to the UCX_TLS it will work, but this is because it is using NVLINK instead of the IB cards.
Any advice would be appreciated.
@Akshay-Venkatesh ^ any advice?