perftest icon indicating copy to clipboard operation
perftest copied to clipboard

ib_write_bw -d mlx5_0 -F -R -q 2 --use_cuda=0 <IP> - Couldn't allocate MR

Open francisguillier opened this issue 4 years ago • 3 comments

Hi,

we tried to test GPUDirect RDMA.

Test pod deployed from https://github.com/Mellanox/k8s-images

we deployed 2 pods:

Server pod:

root@rdma-cuda-test-pod-1:~# ib_write_bw -d mlx5_0 -F -R -q 2 --use_cuda=0


  • Waiting for client to connect... *

Client pod:

root@rdma-cuda-test-pod-1:~# ib_write_bw -d mlx5_0 -F -R -q 2 --use_cuda=0 192.168.111.1 initializing CUDA Listing all CUDA devices in system: CUDA device 0: PCIe address is 02:00

Picking device No. 0 [pid = 56, dev = 0] device name = [NVIDIA A30-8C] creating CUDA Ctx making it the current CUDA Ctx cuMemAlloc() of a 262144 bytes GPU buffer allocated GPU buffer address at 0000010013000000 pointer=0x10013000000 Couldn't allocate MR failed to create mr Failed to create MR Failed to initialize RDMA contexts. ERRNO: Bad address. Failed to handle RDMA CM event. ERRNO: Bad address. Failed to connect RDMA CM events. ERRNO: Bad address. Segmentation fault (core dumped)

what does "Couldn't allocate MR" mean?

thanks in advance

francisguillier avatar Oct 04 '21 16:10 francisguillier

Sorry: to provide some more context: I am testing GPU Operator + Network Operator. nv-peermem has been enabled with GPU Operator deployment

francisguillier avatar Oct 04 '21 16:10 francisguillier

Hi! Have you solved this problem yet?I have also encountered this problem and would like to ask you how to solve it. thanks in advance.

wangku0 avatar May 24 '23 07:05 wangku0

try sudo modprobe nvidia-peermem https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/troubleshooting.html#gpu-to-nic-communication

zpkhor avatar Jan 09 '24 11:01 zpkhor