Is setting IBGDA necessary for test_internode.py?
I noticed the following steps in the guide:
Enable IBGDA by modifying /etc/modprobe.d/nvidia.conf:
options nvidia NVreg_EnableStreamMemOPs=1 NVreg_RegistryDwords="PeerMappingOverride=1;"
Update kernel configuration:
sudo update-initramfs -u
sudo reboot
Due to some environment permission issues, I can't do this step for now. Is it possible to run test_internode.py without doing this step?
Yes, you can. The normal kernels use IBRC instead of IBGDA. But we plan to support AR later, which always requires IBGDA.
Yes, you can. The normal kernels use IBRC instead of IBGDA. But we plan to support AR later, which always requires IBGDA.
I ran 2 H20 node test_internode.py without IBGDA and encountered the following error, I'm not sure if it's related to IBGDA. @LyricZhao
nvshmem_src/src/modules/bootstrap/uid/bootstrap_uid.cpp:499: non-zero status: -3 nvshmem_src/src/modules/bootstrap/uid/bootstrap_uid.cpp:bootstrap_net_recv:99: Message truncated : received 40 bytes instead of 1
nvshmem_src/src/modules/bootstrap/uid/bootstrap_uid.cpp:499: non-zero status: -3 nvshmem_src/src/modules/bootstrap/uid/bootstrap_uid.cpp:bootstrap_net_recv:99: Message truncated : received 127 bytes instead of 8
nvshmem_src/src/modules/bootstrap/uid/bootstrap_uid.cpp:499: non-zero status: -3 nvshmem_src/src/host/topo/topo.cpp:477: non-zero status: -3 allgather of ipc handles failed
nvshmem_src/src/host/init/init.cu:992: non-zero status: 7 building transport map failed
- Based on your logs, it appears that the system is unable to retrieve information from other ranks during bootstrap. We recommend checking your network connectivity settings, including:
- Proper IP and network interface configuration (NVSHMEM_HCA_LIST)
- For RoCE, ensure correct settings for:
- NVSHMEM_IB_GID_INDEX
- NVSHMEM_IB_TRAFFIC_CLASS
- We strongly recommend properly enabling IBGDA usage to prevent potential unknown issues.
I would like to run test_internode.py with IBGDA enabled because it looks dual-port RNIC is supported by going with IBGDA over RoCE network. Could you please give me a detailed configuration to enable IBGDA?
Many thanks.
I used megatron-lm to test two H100s and 16 cards in RoCE network with ep=16. I also encountered the above bootstrap_net_recv:99: Message truncated: received 40 bytes instead of 8. I set IBGDA, but it prompts: WARN: init failed for remote transport: ibrc.
I would like to run test_internode.py with IBGDA enabled because it looks dual-port RNIC is supported by going with IBGDA over RoCE network. Could you please give me a detailed configuration to enable IBGDA?
Many thanks.
You can set envrionments
NVSHMEM_IB_ENABLE_IBGDA=1
NVSHMEM_IBGDA_NIC_HANDLER=gpu
to enable IBGDA.
I used megatron-lm to test two H100s and 16 cards in RoCE network with ep=16. I also encountered the above
bootstrap_net_recv:99: Message truncated: received 40 bytes instead of 8. I set IBGDA, but it prompts:WARN: init failed for remote transport: ibrc.
This appears to be an error during NVSHMEM bootstrap. Please verify your network configuration is correct. It's recommended to run the tests in NVSHMEM perftest first to validate your network setup.
Note that even when IBGDA is enabled, NVSHMEM will still create IBRC connections, so seeing this warning message makes sense. For more details, please refer to the NVSHMEM documentation.
I used megatron-lm to test two H100s and 16 cards in RoCE network with ep=16. I also encountered the above
bootstrap_net_recv:99: Message truncated: received 40 bytes instead of 8. I set IBGDA, but it prompts:WARN: init failed for remote transport: ibrc.This appears to be an error during NVSHMEM bootstrap. Please verify your network configuration is correct. It's recommended to run the tests in NVSHMEM perftest first to validate your network setup.
Note that even when IBGDA is enabled, NVSHMEM will still create IBRC connections, so this warning message is expected. For more details, please refer to the NVSHMEM documentation.
My script:
NCCL_DEBUG=INFO MASTER_ADDR=xxx WORLD_SIZE=2 RANK=0 python tests/test_internode.py
NCCL_DEBUG=INFO MASTER_ADDR=xxx WORLD_SIZE=2 RANK=1 python tests/test_internode.py
RANK=0 result:
nvshmem_src/src/modules/bootstrap/uid/bootstrap_uid.cpp:bootstrap_net_recv:99: Message truncated : received 40 bytes instead of 8
nvshmem_src/src/modules/bootstrap/uid/bootstrap_uid.cpp:499: non-zero status: -3 nvshmem_src/src/host/topo/topo.cpp:477: non-zero status: -3 allgather of ipc handles failed
...
nvshmem_src/src/host/init/init.cu:992: non-zero status: 7 building transport map failed
nvshmem_src/src/host/init/init.cu:nvshmemi_check_state_and_init:1074: nvshmem initialization failed, exiting
nvshmem_src/src/host/init/init.cu:992: non-zero status: 7 building transport map failed
RANK=1 result:
nvshmem_src/src/modules/transport/ibrc/ibrc.cpp:nvshmemt_init:1850: neither nv_peer_mem, or nvidia_peermem detected. Skipping transport.
nvshmem_src/src/host/topo/topo.cpp:469: [GPU 0] Peer GPU 1 is not accessible, exiting ...
nvshmem_src/src/host/init/init.cu:992: non-zero status: 3 building transport map failed
...
WARN: init failed for remote transport: ibrc
@Baibaifan The message neither nv_peer_mem nor nvidia_peermem detected indicates that your system environment does not currently support GPU Direct RDMA. To resolve this, please try loading the GDR kernel module by running one of the following commands:
modprobe nv_peer_mem
# or
modprobe nvidia_peermem
This should enable GPU Direct RDMA functionality.
modprobe nvidia_peermem
@haswelliris After I successfully set modprobe nv_peer_mem, repeating the above command appears:
rank0: There is no error output for rank0.
rank1:
Caught signal 11 (Segmentation fault: address not mapped to object at address 0x11000008)
==== backtrace (tid: 84663) ====
0 0x0000000000042520 __sigaction() ???:0
1 0x0000000000010a13 process_recv() :0
2 0x00000000000112e5 progress_recv() :0
3 0x00000000000113dc nvshmemt_ibrc_progress() :0
4 0x000000000020256c progress_transports() ???:0
5 0x0000000000202c52 nvshmemi_proxy_progress() ???:0
6 0x0000000000094ac3 pthread_condattr_setpshared() ???:0
7 0x0000000000126850 __xmknodat() ???:0
=================================
I would like to run test_internode.py with IBGDA enabled because it looks dual-port RNIC is supported by going with IBGDA over RoCE network. Could you please give me a detailed configuration to enable IBGDA? Many thanks.
You can set envrionments
NVSHMEM_IB_ENABLE_IBGDA=1 NVSHMEM_IBGDA_NIC_HANDLER=gpu to enable IBGDA.
Many thanks, will try it.
Is there any way to determine if IBGDA is correctly enabled, because the performance seems no difference weather I set NVSHMEM_IB_ENABLE_IBGDA=1 or 0, and the result looks ok. And is it necessary to compile libgdsync(https://github.com/gpudirect/libgdsync) before compiling nvshmem?
I would like to run test_internode.py with IBGDA enabled because it looks dual-port RNIC is supported by going with IBGDA over RoCE network. Could you please give me a detailed configuration to enable IBGDA? Many thanks.
You can set envrionments
NVSHMEM_IB_ENABLE_IBGDA=1 NVSHMEM_IBGDA_NIC_HANDLER=gpu to enable IBGDA.
Many thanks for your kindly response. It looks not work because ibrc.cxx:progress_send(...) is called for transferring data by checking a log message added to this function (ibrc.cxx:progress_send(...)). Maybe, any other configuration missed to enable IBGDA?
I would like to run test_internode.py with IBGDA enabled because it looks dual-port RNIC is supported by going with IBGDA over RoCE network. Could you please give me a detailed configuration to enable IBGDA? Many thanks.
You can set envrionments NVSHMEM_IB_ENABLE_IBGDA=1 NVSHMEM_IBGDA_NIC_HANDLER=gpu to enable IBGDA.
Many thanks for your kindly response. It looks not work because ibrc.cxx:progress_send(...) is called for transferring data by checking a log message added to this function (ibrc.cxx:progress_send(...)). Maybe, any other configuration missed to enable IBGDA?
By debugging this issue, ibgda.cc:nvshmemt_init(...) is failed with the error message as following:
WARN: device mlx5_1 cannot allocate buffer on the specified memory type. Skipping...
This problem is caused by mlx5dv_devx_umem_reg(...) failure.
Any suggestion would be appreciated.
Thanks.
I would like to run test_internode.py with IBGDA enabled because it looks dual-port RNIC is supported by going with IBGDA over RoCE network. Could you please give me a detailed configuration to enable IBGDA? Many thanks.
You can set envrionments NVSHMEM_IB_ENABLE_IBGDA=1 NVSHMEM_IBGDA_NIC_HANDLER=gpu to enable IBGDA.
Many thanks for your kindly response. It looks not work because ibrc.cxx:progress_send(...) is called for transferring data by checking a log message added to this function (ibrc.cxx:progress_send(...)). Maybe, any other configuration missed to enable IBGDA?
By debugging this issue, ibgda.cc:nvshmemt_init(...) is failed with the error message as following:
WARN: device mlx5_1 cannot allocate buffer on the specified memory type. Skipping...
This problem is caused by mlx5dv_devx_umem_reg(...) failure.
Any suggestion would be appreciated.
Thanks.
We encountered the same problem as above mlx5dv_devx_umem_reg function failed using dma-buf or nv_peer_mem. Can someone give the version of the kernel that IBGDA can work on and the ofed driver version number of the Mellanox nic? Great debt of gratitude.
@NDk9856 We use linux kernel version 5.15 and OFED version 5.8.1.0.1 with nvidia_peermem.
@NDk9856 We use linux kernel version 5.15 and OFED version 5.8.1.0.1 with nvidia_peermem.
Thank you so much @sphish. We fixed the mlx5dv_devx_umem_reg failed problem by disabling nv_peer_mem module then enable nvidia_peermem module manually.We work on Ubuntu 22.04 LTS with kernel version 5.15.0-140, MOFED 5.8-6. Hope helpful for someone else.