perftest ib_write_bw -d mlx5_0 -F -R -q 2 --use_cuda=0 <IP>

Hi,

we tried to test GPUDirect RDMA.

Test pod deployed from https://github.com/Mellanox/k8s-images

we deployed 2 pods:

Server pod:

root@rdma-cuda-test-pod-1:~# ib_write_bw -d mlx5_0 -F -R -q 2 --use_cuda=0

Waiting for client to connect... *

Client pod:

root@rdma-cuda-test-pod-1:~# ib_write_bw -d mlx5_0 -F -R -q 2 --use_cuda=0 192.168.111.1 initializing CUDA Listing all CUDA devices in system: CUDA device 0: PCIe address is 02:00

Picking device No. 0 [pid = 56, dev = 0] device name = [NVIDIA A30-8C] creating CUDA Ctx making it the current CUDA Ctx cuMemAlloc() of a 262144 bytes GPU buffer allocated GPU buffer address at 0000010013000000 pointer=0x10013000000 Couldn't allocate MR failed to create mr Failed to create MR Failed to initialize RDMA contexts. ERRNO: Bad address. Failed to handle RDMA CM event. ERRNO: Bad address. Failed to connect RDMA CM events. ERRNO: Bad address. Segmentation fault (core dumped)

what does "Couldn't allocate MR" mean?

thanks in advance

Oct 04 '21 16:10 francisguillier

Sorry: to provide some more context: I am testing GPU Operator + Network Operator. nv-peermem has been enabled with GPU Operator deployment

Oct 04 '21 16:10 francisguillier

Hi！ Have you solved this problem yet？I have also encountered this problem and would like to ask you how to solve it. thanks in advance.

May 24 '23 07:05 wangku0

try sudo modprobe nvidia-peermem https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/troubleshooting.html#gpu-to-nic-communication

Jan 09 '24 11:01 zpkhor

ib_write_bw -d mlx5_0 -F -R -q 2 --use_cuda=0 <IP> - Couldn't allocate MR