When testing ROCm D2D transfers with UCX_TLS=rc, how does setting UCX_IB_GPU_DIRECT_RDMA=0 affect the osu_bw test results?
When using UCX_TLS=rc to test ROCm D2D transfers, setting UCX_IB_GPU_DIRECT_RDMA=0 doesn't affect the osu_bw test results. Is this because rc doesn't use GPUDirect RDMA technology, or is it because GPUDirect RDMA is enabled by default when using rc?
rc transports can use GPU direct RDMA feature.
The default value of UCX_IB_GPU_DIRECT_RDMA is 'try'. This means that GPU direct RDMA will be used if UCX finds the necessary driver on the target system. Which is ROCm KFD driver in case of ROCm.
Please try to set UCX_IB_GPU_DIRECT_RDMA=1. You will see error message if the driver cannot be found on your system.
You’re right, but what puzzles me is that when I set UCX_IB_GPU_DIRECT_RDMA=0, my test results are the same as when UCX_IB_GPU_DIRECT_RDMA=1. Do you know why this happens?
mpirun -np 2 -H a:1,b:1 -mca pml ucx -x UCX_NET_DEVICES=mlx5_0:1,mlx5_1:1,mlx5_2:1,mlx5_3:1 -x UCX_TLS=rc -x UCX_IB_GPU_DIRECT_RDMA=0 -x LD_LIBRARY_PATH osu_bw -d rocm D D
# OSU MPI-ROCM Bandwidth Test v7.3
# Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D)
# Size Bandwidth (MB/s)
# Datatype: MPI_CHAR.
1 0.79
2 1.57
4 3.14
8 6.29
16 6.69
32 7.57
64 8.24
128 8.39
256 8.45
512 8.56
1024 8.59
2048 8.62
4096 8.63
8192 8.63
16384 5958.54
32768 3811.03
65536 3251.07
131072 3263.04
262144 3273.16
524288 3272.21
1048576 3277.51
2097152 3277.63
4194304 3275.10
mpirun -np 2 -H a:1,b:1 -mca pml ucx -x UCX_NET_DEVICES=mlx5_0:1,mlx5_1:1,mlx5_2:1,mlx5_3:1 -x UCX_TLS=rc -x UCX_IB_GPU_DIRECT_RDMA=1-x LD_LIBRARY_PATH osu_bw -d rocm D D
1 0.78
2 1.57
4 3.14
8 6.28
16 7.09
32 7.57
64 8.25
128 8.39
256 8.43
512 8.53
1024 8.59
2048 8.62
4096 8.63
8192 8.64
16384 5924.55
32768 3822.05
65536 3252.99
131072 3269.29
262144 3269.66
524288 3274.54
1048576 3278.45
2097152 3276.73
4194304 3276.40
I would not set UCX_TLS=rc, you are basically excluding the rocm components. At the bare minimum, UCX will not be able to detect/recognize the rocm memory types, i.e. it will not be able to tell that it is dealing with GPU memory, and I am not 100% sure what is the impact of that. I would recommend to at least set UCX_TLS=rocm,rc
I am not entirely sure what generation of IB hardware you are using, but the bandwidth values that you show are very low, most likely data is funneled through the CPU memory in your case. I would recommend a) try first only one HCA at a time (ideally the one closest to the GPU that you are using), b) double check that acs is disabled on your system, since that might prevent direct GPU to HCA communication. You should not have to worry about the IB_GPU_DIRECT_RDMA setting, we usually don't set that value in order to achieve full line BW.
Also, are you using the a Mellanox OFED driver on your system, or the standard Linux RMDA packages? I would recommend MOFED for easier interactions with the GPUs