DeepEP Is setting IBGDA necessary for test

I noticed the following steps in the guide:

Enable IBGDA by modifying /etc/modprobe.d/nvidia.conf:
options nvidia NVreg_EnableStreamMemOPs=1 NVreg_RegistryDwords="PeerMappingOverride=1;"
Update kernel configuration:
  sudo update-initramfs -u
  sudo reboot

Due to some environment permission issues, I can't do this step for now. Is it possible to run test_internode.py without doing this step?

Mar 02 '25 14:03 zhangml

Yes, you can. The normal kernels use IBRC instead of IBGDA. But we plan to support AR later, which always requires IBGDA.

Mar 03 '25 02:03 LyricZhao

Yes, you can. The normal kernels use IBRC instead of IBGDA. But we plan to support AR later, which always requires IBGDA.

I ran 2 H20 node test_internode.py without IBGDA and encountered the following error, I'm not sure if it's related to IBGDA. @LyricZhao

nvshmem_src/src/modules/bootstrap/uid/bootstrap_uid.cpp:499: non-zero status: -3 nvshmem_src/src/modules/bootstrap/uid/bootstrap_uid.cpp:bootstrap_net_recv:99: Message truncated : received 40 bytes instead of 1

nvshmem_src/src/modules/bootstrap/uid/bootstrap_uid.cpp:499: non-zero status: -3 nvshmem_src/src/modules/bootstrap/uid/bootstrap_uid.cpp:bootstrap_net_recv:99: Message truncated : received 127 bytes instead of 8

nvshmem_src/src/modules/bootstrap/uid/bootstrap_uid.cpp:499: non-zero status: -3 nvshmem_src/src/host/topo/topo.cpp:477: non-zero status: -3 allgather of ipc handles failed 

nvshmem_src/src/host/init/init.cu:992: non-zero status: 7 building transport map failed

Mar 03 '25 06:03 zhangml

Based on your logs, it appears that the system is unable to retrieve information from other ranks during bootstrap. We recommend checking your network connectivity settings, including:
- Proper IP and network interface configuration (NVSHMEM_HCA_LIST)
- For RoCE, ensure correct settings for:
  - NVSHMEM_IB_GID_INDEX
  - NVSHMEM_IB_TRAFFIC_CLASS
We strongly recommend properly enabling IBGDA usage to prevent potential unknown issues.

Mar 03 '25 07:03 haswelliris

I would like to run test_internode.py with IBGDA enabled because it looks dual-port RNIC is supported by going with IBGDA over RoCE network. Could you please give me a detailed configuration to enable IBGDA?

Many thanks.

Mar 03 '25 07:03 yanminjia

I used megatron-lm to test two H100s and 16 cards in RoCE network with ep=16. I also encountered the above bootstrap_net_recv:99: Message truncated: received 40 bytes instead of 8. I set IBGDA, but it prompts: WARN: init failed for remote transport: ibrc.

Mar 03 '25 08:03 Baibaifan

I would like to run test_internode.py with IBGDA enabled because it looks dual-port RNIC is supported by going with IBGDA over RoCE network. Could you please give me a detailed configuration to enable IBGDA?

Many thanks.

You can set envrionments

NVSHMEM_IB_ENABLE_IBGDA=1
NVSHMEM_IBGDA_NIC_HANDLER=gpu

to enable IBGDA.

Mar 03 '25 08:03 sphish

I used megatron-lm to test two H100s and 16 cards in RoCE network with ep=16. I also encountered the above bootstrap_net_recv:99: Message truncated: received 40 bytes instead of 8. I set IBGDA, but it prompts: WARN: init failed for remote transport: ibrc.

This appears to be an error during NVSHMEM bootstrap. Please verify your network configuration is correct. It's recommended to run the tests in NVSHMEM perftest first to validate your network setup.

Note that even when IBGDA is enabled, NVSHMEM will still create IBRC connections, so seeing this warning message makes sense. For more details, please refer to the NVSHMEM documentation.

Mar 03 '25 09:03 sphish

I used megatron-lm to test two H100s and 16 cards in RoCE network with ep=16. I also encountered the above bootstrap_net_recv:99: Message truncated: received 40 bytes instead of 8. I set IBGDA, but it prompts: WARN: init failed for remote transport: ibrc.

This appears to be an error during NVSHMEM bootstrap. Please verify your network configuration is correct. It's recommended to run the tests in NVSHMEM perftest first to validate your network setup.

Note that even when IBGDA is enabled, NVSHMEM will still create IBRC connections, so this warning message is expected. For more details, please refer to the NVSHMEM documentation.

My script:

NCCL_DEBUG=INFO MASTER_ADDR=xxx WORLD_SIZE=2 RANK=0 python tests/test_internode.py
NCCL_DEBUG=INFO MASTER_ADDR=xxx WORLD_SIZE=2 RANK=1 python tests/test_internode.py

RANK=0 result:

nvshmem_src/src/modules/bootstrap/uid/bootstrap_uid.cpp:bootstrap_net_recv:99: Message truncated : received 40 bytes instead of 8

nvshmem_src/src/modules/bootstrap/uid/bootstrap_uid.cpp:499: non-zero status: -3 nvshmem_src/src/host/topo/topo.cpp:477: non-zero status: -3 allgather of ipc handles failed
...
nvshmem_src/src/host/init/init.cu:992: non-zero status: 7 building transport map failed

nvshmem_src/src/host/init/init.cu:nvshmemi_check_state_and_init:1074: nvshmem initialization failed, exiting

nvshmem_src/src/host/init/init.cu:992: non-zero status: 7 building transport map failed

RANK=1 result:

nvshmem_src/src/modules/transport/ibrc/ibrc.cpp:nvshmemt_init:1850: neither nv_peer_mem, or nvidia_peermem detected. Skipping transport.

nvshmem_src/src/host/topo/topo.cpp:469: [GPU 0] Peer GPU 1 is not accessible, exiting ...
nvshmem_src/src/host/init/init.cu:992: non-zero status: 3 building transport map failed
...

WARN: init failed for remote transport: ibrc

Mar 03 '25 09:03 Baibaifan

@Baibaifan The message neither nv_peer_mem nor nvidia_peermem detected indicates that your system environment does not currently support GPU Direct RDMA. To resolve this, please try loading the GDR kernel module by running one of the following commands:

modprobe nv_peer_mem
# or
modprobe nvidia_peermem

This should enable GPU Direct RDMA functionality.

Mar 03 '25 10:03 haswelliris

modprobe nvidia_peermem

@haswelliris After I successfully set modprobe nv_peer_mem, repeating the above command appears:

rank0: There is no error output for rank0.

rank1:

Caught signal 11 (Segmentation fault: address not mapped to object at address 0x11000008)
==== backtrace (tid:  84663) ====
 0 0x0000000000042520 __sigaction()  ???:0
 1 0x0000000000010a13 process_recv()  :0
 2 0x00000000000112e5 progress_recv()  :0
 3 0x00000000000113dc nvshmemt_ibrc_progress()  :0
 4 0x000000000020256c progress_transports()  ???:0
 5 0x0000000000202c52 nvshmemi_proxy_progress()  ???:0
 6 0x0000000000094ac3 pthread_condattr_setpshared()  ???:0
 7 0x0000000000126850 __xmknodat()  ???:0
=================================

Mar 03 '25 10:03 Baibaifan

I would like to run test_internode.py with IBGDA enabled because it looks dual-port RNIC is supported by going with IBGDA over RoCE network. Could you please give me a detailed configuration to enable IBGDA? Many thanks.

You can set envrionments

NVSHMEM_IB_ENABLE_IBGDA=1 NVSHMEM_IBGDA_NIC_HANDLER=gpu to enable IBGDA.

Many thanks, will try it.

Mar 03 '25 10:03 yanminjia

Is there any way to determine if IBGDA is correctly enabled, because the performance seems no difference weather I set NVSHMEM_IB_ENABLE_IBGDA=1 or 0, and the result looks ok. And is it necessary to compile libgdsync(https://github.com/gpudirect/libgdsync) before compiling nvshmem?

Mar 03 '25 11:03 ghghliu

I would like to run test_internode.py with IBGDA enabled because it looks dual-port RNIC is supported by going with IBGDA over RoCE network. Could you please give me a detailed configuration to enable IBGDA? Many thanks.

You can set envrionments

NVSHMEM_IB_ENABLE_IBGDA=1 NVSHMEM_IBGDA_NIC_HANDLER=gpu to enable IBGDA.

Many thanks for your kindly response. It looks not work because ibrc.cxx:progress_send(...) is called for transferring data by checking a log message added to this function (ibrc.cxx:progress_send(...)). Maybe, any other configuration missed to enable IBGDA?

Mar 03 '25 12:03 yanminjia

I would like to run test_internode.py with IBGDA enabled because it looks dual-port RNIC is supported by going with IBGDA over RoCE network. Could you please give me a detailed configuration to enable IBGDA? Many thanks.

You can set envrionments NVSHMEM_IB_ENABLE_IBGDA=1 NVSHMEM_IBGDA_NIC_HANDLER=gpu to enable IBGDA.

Many thanks for your kindly response. It looks not work because ibrc.cxx:progress_send(...) is called for transferring data by checking a log message added to this function (ibrc.cxx:progress_send(...)). Maybe, any other configuration missed to enable IBGDA?

By debugging this issue, ibgda.cc:nvshmemt_init(...) is failed with the error message as following:

WARN: device mlx5_1 cannot allocate buffer on the specified memory type. Skipping...

This problem is caused by mlx5dv_devx_umem_reg(...) failure.

Any suggestion would be appreciated.

Thanks.

Mar 03 '25 13:03 yanminjia

I would like to run test_internode.py with IBGDA enabled because it looks dual-port RNIC is supported by going with IBGDA over RoCE network. Could you please give me a detailed configuration to enable IBGDA? Many thanks.

You can set envrionments NVSHMEM_IB_ENABLE_IBGDA=1 NVSHMEM_IBGDA_NIC_HANDLER=gpu to enable IBGDA.

Many thanks for your kindly response. It looks not work because ibrc.cxx:progress_send(...) is called for transferring data by checking a log message added to this function (ibrc.cxx:progress_send(...)). Maybe, any other configuration missed to enable IBGDA?

By debugging this issue, ibgda.cc:nvshmemt_init(...) is failed with the error message as following:

WARN: device mlx5_1 cannot allocate buffer on the specified memory type. Skipping...

This problem is caused by mlx5dv_devx_umem_reg(...) failure.

Any suggestion would be appreciated.

Thanks.

We encountered the same problem as above mlx5dv_devx_umem_reg function failed using dma-buf or nv_peer_mem. Can someone give the version of the kernel that IBGDA can work on and the ofed driver version number of the Mellanox nic? Great debt of gratitude.

May 19 '25 11:05 NDk9856

@NDk9856 We use linux kernel version 5.15 and OFED version 5.8.1.0.1 with nvidia_peermem.

May 20 '25 01:05 sphish

@NDk9856 We use linux kernel version 5.15 and OFED version 5.8.1.0.1 with nvidia_peermem.

Thank you so much @sphish. We fixed the mlx5dv_devx_umem_reg failed problem by disabling nv_peer_mem module then enable nvidia_peermem module manually.We work on Ubuntu 22.04 LTS with kernel version 5.15.0-140, MOFED 5.8-6. Hope helpful for someone else.

May 21 '25 04:05 NDk9856

Is setting IBGDA necessary for test_internode.py?