ib_write_bw -d mlx5_0 -F -R -q 2 --use_cuda=0 <IP> - Couldn't allocate MR
Hi,
we tried to test GPUDirect RDMA.
Test pod deployed from https://github.com/Mellanox/k8s-images
we deployed 2 pods:
Server pod:
root@rdma-cuda-test-pod-1:~# ib_write_bw -d mlx5_0 -F -R -q 2 --use_cuda=0
- Waiting for client to connect... *
Client pod:
root@rdma-cuda-test-pod-1:~# ib_write_bw -d mlx5_0 -F -R -q 2 --use_cuda=0 192.168.111.1 initializing CUDA Listing all CUDA devices in system: CUDA device 0: PCIe address is 02:00
Picking device No. 0 [pid = 56, dev = 0] device name = [NVIDIA A30-8C] creating CUDA Ctx making it the current CUDA Ctx cuMemAlloc() of a 262144 bytes GPU buffer allocated GPU buffer address at 0000010013000000 pointer=0x10013000000 Couldn't allocate MR failed to create mr Failed to create MR Failed to initialize RDMA contexts. ERRNO: Bad address. Failed to handle RDMA CM event. ERRNO: Bad address. Failed to connect RDMA CM events. ERRNO: Bad address. Segmentation fault (core dumped)
what does "Couldn't allocate MR" mean?
thanks in advance
Sorry: to provide some more context: I am testing GPU Operator + Network Operator. nv-peermem has been enabled with GPU Operator deployment
Hi! Have you solved this problem yet?I have also encountered this problem and would like to ask you how to solve it. thanks in advance.
try sudo modprobe nvidia-peermem
https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/troubleshooting.html#gpu-to-nic-communication