ucx icon indicating copy to clipboard operation
ucx copied to clipboard

When testing ROCm D2D transfers with UCX_TLS=rc, how does setting UCX_IB_GPU_DIRECT_RDMA=0 affect the osu_bw test results?

Open shuiYizero opened this issue 1 year ago • 5 comments

When using UCX_TLS=rc to test ROCm D2D transfers, setting UCX_IB_GPU_DIRECT_RDMA=0 doesn't affect the osu_bw test results. Is this because rc doesn't use GPUDirect RDMA technology, or is it because GPUDirect RDMA is enabled by default when using rc?

shuiYizero avatar Aug 20 '24 21:08 shuiYizero

rc transports can use GPU direct RDMA feature. The default value of UCX_IB_GPU_DIRECT_RDMA is 'try'. This means that GPU direct RDMA will be used if UCX finds the necessary driver on the target system. Which is ROCm KFD driver in case of ROCm. Please try to set UCX_IB_GPU_DIRECT_RDMA=1. You will see error message if the driver cannot be found on your system.

rakhmets avatar Aug 21 '24 10:08 rakhmets

You’re right, but what puzzles me is that when I set UCX_IB_GPU_DIRECT_RDMA=0, my test results are the same as when UCX_IB_GPU_DIRECT_RDMA=1. Do you know why this happens?

mpirun -np 2 -H a:1,b:1 -mca pml ucx -x UCX_NET_DEVICES=mlx5_0:1,mlx5_1:1,mlx5_2:1,mlx5_3:1  -x UCX_TLS=rc -x  UCX_IB_GPU_DIRECT_RDMA=0 -x LD_LIBRARY_PATH osu_bw -d rocm D D
# OSU MPI-ROCM Bandwidth Test v7.3
# Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D)
# Size      Bandwidth (MB/s)
# Datatype: MPI_CHAR.
1                       0.79
2                       1.57
4                       3.14
8                       6.29
16                      6.69
32                      7.57
64                      8.24
128                     8.39
256                     8.45
512                     8.56
1024                    8.59
2048                    8.62
4096                    8.63
8192                    8.63
16384                5958.54
32768                3811.03
65536                3251.07
131072               3263.04
262144               3273.16
524288               3272.21
1048576              3277.51
2097152              3277.63
4194304              3275.10
mpirun -np 2 -H a:1,b:1 -mca pml ucx -x UCX_NET_DEVICES=mlx5_0:1,mlx5_1:1,mlx5_2:1,mlx5_3:1  -x UCX_TLS=rc -x  UCX_IB_GPU_DIRECT_RDMA=1-x LD_LIBRARY_PATH osu_bw -d rocm D D

1                       0.78
2                       1.57
4                       3.14
8                       6.28
16                      7.09
32                      7.57
64                      8.25
128                     8.39
256                     8.43
512                     8.53
1024                    8.59
2048                    8.62
4096                    8.63
8192                    8.64
16384                5924.55
32768                3822.05
65536                3252.99
131072               3269.29
262144               3269.66
524288               3274.54
1048576              3278.45
2097152              3276.73
4194304              3276.40

shuiYizero avatar Aug 21 '24 10:08 shuiYizero

I would not set UCX_TLS=rc, you are basically excluding the rocm components. At the bare minimum, UCX will not be able to detect/recognize the rocm memory types, i.e. it will not be able to tell that it is dealing with GPU memory, and I am not 100% sure what is the impact of that. I would recommend to at least set UCX_TLS=rocm,rc

edgargabriel avatar Aug 21 '24 13:08 edgargabriel

I am not entirely sure what generation of IB hardware you are using, but the bandwidth values that you show are very low, most likely data is funneled through the CPU memory in your case. I would recommend a) try first only one HCA at a time (ideally the one closest to the GPU that you are using), b) double check that acs is disabled on your system, since that might prevent direct GPU to HCA communication. You should not have to worry about the IB_GPU_DIRECT_RDMA setting, we usually don't set that value in order to achieve full line BW.

edgargabriel avatar Aug 21 '24 13:08 edgargabriel

Also, are you using the a Mellanox OFED driver on your system, or the standard Linux RMDA packages? I would recommend MOFED for easier interactions with the GPUs

edgargabriel avatar Aug 21 '24 13:08 edgargabriel