test_low_latency failed
I am experiencing an issue with NVSHMEM failing to initialize due to transport errors. The error message indicates that NVSHMEM is unable to detect the system topology and cannot initialize any transport layers. However, test_intranode.py passed successfully... I would like to know how to resolve this problem.
System Information GPU Model: H100 (8 GPUs, single node) OS: Ubuntu 22.04 CUDA Version: 12.5 NVSHMEM Version: 3.2.5
Error Log
WARN: init failed for remote transport: ibrc
/workspace/nvshmem_src/src/host/transport/transport.cpp:nvshmemi_transport_init:275: init failed for transport: IBGDA
WARN: init failed for remote transport: ibrc
WARN: init failed for remote transport: ibrc
WARN: init failed for remote transport: ibrc
/workspace/nvshmem_src/src/host/transport/transport.cpp:nvshmemi_transport_init:287: Unable to initialize any transports. returning error.
/workspace/nvshmem_src/src/host/init/init.cu:989: non-zero status: 7 nvshmem detect topo failed
/workspace/nvshmem_src/src/host/transport/transport.cpp:nvshmemi_transport_init:275: init failed for transport: IBGDA
/workspace/nvshmem_src/src/host/transport/transport.cpp:nvshmemi_transport_init:275: /workspace/nvshmem_src/src/host/transport/transport.cpp:nvshmemi_transport_init:275: init failed for transport: IBGDAinit failed for transport: IBGDA
/workspace/nvshmem_src/src/host/transport/transport.cpp:nvshmemi_transport_init:287: Unable to initialize any transports. returning error.
/workspace/nvshmem_src/src/host/init/init.cu:989: non-zero status: 7 nvshmem detect topo failed
/workspace/nvshmem_src/src/host/transport/transport.cpp:nvshmemi_transport_init:287: Unable to initialize any transports. returning error./workspace/nvshmem_src/src/host/transport/transport.cpp:nvshmemi_transport_init:287:
Unable to initialize any transports. returning error.
/workspace/nvshmem_src/src/host/init/init.cu:989: non-zero status: 7 /workspace/nvshmem_src/src/host/init/init.cu:989: non-zero status: 7 nvshmem detect topo failed
nvshmem detect topo failed
WARN: init failed for remote transport: ibrc
WARN: init failed for remote transport: ibrc
WARN: init failed for remote transport: ibrc
WARN: init failed for remote transport: ibrc
/workspace/nvshmem_src/src/host/transport/transport.cpp:nvshmemi_transport_init:275: init failed for transport: IBGDA
/workspace/nvshmem_src/src/host/transport/transport.cpp:nvshmemi_transport_init:275: init failed for transport: IBGDA
/workspace/nvshmem_src/src/host/transport/transport.cpp:nvshmemi_transport_init:287: Unable to initialize any transports. returning error.
/workspace/nvshmem_src/src/host/init/init.cu:989: non-zero status: 7 nvshmem detect topo failed
/workspace/nvshmem_src/src/host/transport/transport.cpp:nvshmemi_transport_init:287: Unable to initialize any transports. returning error.
/workspace/nvshmem_src/src/host/init/init.cu:989: non-zero status: 7 nvshmem detect topo failed
/workspace/nvshmem_src/src/host/transport/transport.cpp:nvshmemi_transport_init:275: init failed for transport: IBGDA
/workspace/nvshmem_src/src/host/transport/transport.cpp:nvshmemi_transport_init:275: init failed for transport: IBGDA
/workspace/nvshmem_src/src/host/transport/transport.cpp:nvshmemi_transport_init:287: Unable to initialize any transports. returning error.
/workspace/nvshmem_src/src/host/init/init.cu:989: non-zero status: 7 nvshmem detect topo failed
/workspace/nvshmem_src/src/host/transport/transport.cpp:nvshmemi_transport_init:287: Unable to initialize any transports. returning error.
/workspace/nvshmem_src/src/host/init/init.cu:989: non-zero status: 7 nvshmem detect topo failed
WARN: init failed for remote transport: ibrc
WARN: init failed for remote transport: ibrc
WARN: init failed for remote transport: ibrc
WARN: init failed for remote transport: ibrc
WARN: init failed for remote transport: ibrc
WARN: init failed for remote transport: ibrc
WARN: init failed for remote transport: ibrc
WARN: init failed for remote transport: ibrc
/workspace/nvshmem_src/src/host/transport/transport.cpp:nvshmemi_transport_init:275: /workspace/nvshmem_src/src/host/transport/transport.cpp:nvshmemi_transport_init:275: init failed for transport: IBGDAinit failed for transport: IBGDA/workspace/nvshmem_src/src/host/transport/transport.cpp:nvshmemi_transport_init:275:
init failed for transport: IBGDA
/workspace/nvshmem_src/src/host/transport/transport.cpp:nvshmemi_transport_init:275: init failed for transport: IBGDA
/workspace/nvshmem_src/src/host/transport/transport.cpp:nvshmemi_transport_init:275: /workspace/nvshmem_src/src/host/transport/transport.cpp:nvshmemi_transport_init:287: /workspace/nvshmem_src/src/host/transport/transport.cpp:nvshmemi_transport_init:275: /workspace/nvshmem_src/src/host/transport/transport.cpp:nvshmemi_transport_init:287: /workspace/nvshmem_src/src/host/transport/transport.cpp:nvshmemi_transport_init:287: init failed for transport: IBGDAUnable to initialize any transports. returning error.init failed for transport: IBGDAUnable to initialize any transports. returning error./workspace/nvshmem_src/src/host/transport/transport.cpp:nvshmemi_transport_init:275: /workspace/nvshmem_src/src/host/transport/transport.cpp:nvshmemi_transport_init:275: Unable to initialize any transports. returning error.
/workspace/nvshmem_src/src/host/init/init.cu:989: non-zero status: 7
/workspace/nvshmem_src/src/host/transport/transport.cpp:nvshmemi_transport_init:287: init failed for transport: IBGDAinit failed for transport: IBGDA
nvshmem detect topo failed
/workspace/nvshmem_src/src/host/init/init.cu:989: non-zero status: 7 Unable to initialize any transports. returning error.
/workspace/nvshmem_src/src/host/init/init.cu:989: non-zero status: 7
nvshmem detect topo failed
/workspace/nvshmem_src/src/host/init/init.cu:989: non-zero status: 7 nvshmem detect topo failed
/workspace/nvshmem_src/src/host/init/init.cu:nvshmemi_check_state_and_init:1074:
nvshmem detect topo failed
nvshmem initialization failed, exiting
/workspace/nvshmem_src/src/host/init/init.cu:nvshmemi_check_state_and_init:1074: /workspace/nvshmem_src/src/host/init/init.cu:nvshmemi_check_state_and_init:1074:
/workspace/nvshmem_src/src/host/init/init.cu:nvshmemi_check_state_and_init:1074: nvshmem initialization failed, exiting
nvshmem initialization failed, exiting
nvshmem initialization failed, exiting
/workspace/nvshmem_src/src/host/transport/transport.cpp:nvshmemi_transport_init:287: /workspace/nvshmem_src/src/host/transport/transport.cpp:nvshmemi_transport_init:287: Unable to initialize any transports. returning error.Unable to initialize any transports. returning error.
/workspace/nvshmem_src/src/host/init/init.cu:989: non-zero status: 7 /workspace/nvshmem_src/src/host/init/init.cu:989: non-zero status: 7 nvshmem detect topo failed
nvshmem detect topo failed
/workspace/nvshmem_src/src/host/init/init.cu:nvshmemi_check_state_and_init:1074: /workspace/nvshmem_src/src/host/init/init.cu:nvshmemi_check_state_and_init:1074: nvshmem initialization failed, exiting
nvshmem initialization failed, exiting
/workspace/nvshmem_src/src/host/transport/transport.cpp:nvshmemi_transport_init:287: Unable to initialize any transports. returning error.
/workspace/nvshmem_src/src/host/init/init.cu:989: non-zero status: 7 nvshmem detect topo failed
/workspace/nvshmem_src/src/host/init/init.cu:nvshmemi_check_state_and_init:1074: nvshmem initialization failed, exiting
/workspace/nvshmem_src/src/host/transport/transport.cpp:nvshmemi_transport_init:287:
Unable to initialize any transports. returning error.
/workspace/nvshmem_src/src/host/init/init.cu:989: non-zero status: 7 nvshmem detect topo failed
/workspace/nvshmem_src/src/host/init/init.cu:nvshmemi_check_state_and_init:1074: nvshmem initialization failed, exiting
W0307 07:36:56.817000 22906 torch/multiprocessing/spawn.py:169] Terminating process 22985 via signal SIGTERM
W0307 07:36:56.817000 22906 torch/multiprocessing/spawn.py:169] Terminating process 22987 via signal SIGTERM
What is your network hardware configuration? Could you please run nvidia-smi topo -mp and ibv_devinfo and share the results?
I'm seeing a similar issue:
root@22f186c3783d:/workspace#
root@22f186c3783d:/workspace# nvidia-smi topo -mp
GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7 NIC0 NIC1 NIC2 NIC3 NIC4 NIC5 NIC6 NIC7 NIC8 CPU Affinity NUMA Affinity GPU NUMA ID
GPU0 X PHB PHB PHB SYS SYS SYS SYS NODE PHB PHB PHB PHB SYS SYS SYS SYS 0-87 0 N/A
GPU1 PHB X PHB PHB SYS SYS SYS SYS NODE PHB PHB PHB PHB SYS SYS SYS SYS 0-87 0 N/A
GPU2 PHB PHB X PHB SYS SYS SYS SYS NODE PHB PHB PHB PHB SYS SYS SYS SYS 0-87 0 N/A
GPU3 PHB PHB PHB X SYS SYS SYS SYS NODE PHB PHB PHB PHB SYS SYS SYS SYS 0-87 0 N/A
GPU4 SYS SYS SYS SYS X PHB PHB PHB SYS SYS SYS SYS SYS PHB PHB PHB PHB 88-175 1 N/A
GPU5 SYS SYS SYS SYS PHB X PHB PHB SYS SYS SYS SYS SYS PHB PHB PHB PHB 88-175 1 N/A
GPU6 SYS SYS SYS SYS PHB PHB X PHB SYS SYS SYS SYS SYS PHB PHB PHB PHB 88-175 1 N/A
GPU7 SYS SYS SYS SYS PHB PHB PHB X SYS SYS SYS SYS SYS PHB PHB PHB PHB 88-175 1 N/A
NIC0 NODE NODE NODE NODE SYS SYS SYS SYS X NODE NODE NODE NODE SYS SYS SYS SYS
NIC1 PHB PHB PHB PHB SYS SYS SYS SYS NODE X PHB PHB PHB SYS SYS SYS SYS
NIC2 PHB PHB PHB PHB SYS SYS SYS SYS NODE PHB X PHB PHB SYS SYS SYS SYS
NIC3 PHB PHB PHB PHB SYS SYS SYS SYS NODE PHB PHB X PHB SYS SYS SYS SYS
NIC4 PHB PHB PHB PHB SYS SYS SYS SYS NODE PHB PHB PHB X SYS SYS SYS SYS
NIC5 SYS SYS SYS SYS PHB PHB PHB PHB SYS SYS SYS SYS SYS X PHB PHB PHB
NIC6 SYS SYS SYS SYS PHB PHB PHB PHB SYS SYS SYS SYS SYS PHB X PHB PHB
NIC7 SYS SYS SYS SYS PHB PHB PHB PHB SYS SYS SYS SYS SYS PHB PHB X PHB
NIC8 SYS SYS SYS SYS PHB PHB PHB PHB SYS SYS SYS SYS SYS PHB PHB PHB X
Legend:
X = Self
SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
PIX = Connection traversing at most a single PCIe bridge
NIC Legend:
NIC0: mlx5_0
NIC1: mlx5_1
NIC2: mlx5_2
NIC3: mlx5_3
NIC4: mlx5_4
NIC5: mlx5_5
NIC6: mlx5_6
NIC7: mlx5_7
NIC8: mlx5_8
root@22f186c3783d:/workspace# ibv_devinfo
hca_id: mlx5_0
transport: InfiniBand (0)
fw_ver: 28.38.1002
node_guid: 3eea:72ff:fe24:32af
sys_image_guid: 58a2:e103:0048:66de
vendor_id: 0x02c9
vendor_part_id: 4126
hw_ver: 0x0
board_id: MT_0000001108
phys_port_cnt: 1
port: 1
state: PORT_ACTIVE (4)
max_mtu: 4096 (5)
active_mtu: 1024 (3)
sm_lid: 0
port_lid: 0
port_lmc: 0x00
link_layer: Ethernet
hca_id: mlx5_1
transport: InfiniBand (0)
fw_ver: 28.37.1700
node_guid: d4fb:b330:a54f:0277
sys_image_guid: 946d:ae03:00f0:0b4e
vendor_id: 0x02c9
vendor_part_id: 4126
hw_ver: 0x0
board_id: MT_0000000970
phys_port_cnt: 1
port: 1
state: PORT_ACTIVE (4)
max_mtu: 4096 (5)
active_mtu: 4096 (5)
sm_lid: 1
port_lid: 1689
port_lmc: 0x00
link_layer: InfiniBand
hca_id: mlx5_2
transport: InfiniBand (0)
fw_ver: 28.37.1700
node_guid: a879:2436:7090:e75b
sys_image_guid: 946d:ae03:00f0:063e
vendor_id: 0x02c9
vendor_part_id: 4126
hw_ver: 0x0
board_id: MT_0000000970
phys_port_cnt: 1
port: 1
state: PORT_ACTIVE (4)
max_mtu: 4096 (5)
active_mtu: 4096 (5)
sm_lid: 1
port_lid: 1691
port_lmc: 0x00
link_layer: InfiniBand
hca_id: mlx5_3
transport: InfiniBand (0)
fw_ver: 28.37.1700
node_guid: 2dc3:190f:3d85:1cb6
sys_image_guid: 946d:ae03:00f0:0b6a
vendor_id: 0x02c9
vendor_part_id: 4126
hw_ver: 0x0
board_id: MT_0000000970
phys_port_cnt: 1
port: 1
state: PORT_ACTIVE (4)
max_mtu: 4096 (5)
active_mtu: 4096 (5)
sm_lid: 1
port_lid: 1690
port_lmc: 0x00
link_layer: InfiniBand
hca_id: mlx5_4
transport: InfiniBand (0)
fw_ver: 28.37.1700
node_guid: e70f:f6b9:f338:c9b6
sys_image_guid: 946d:ae03:00f0:0302
vendor_id: 0x02c9
vendor_part_id: 4126
hw_ver: 0x0
board_id: MT_0000000970
phys_port_cnt: 1
port: 1
state: PORT_ACTIVE (4)
max_mtu: 4096 (5)
active_mtu: 4096 (5)
sm_lid: 1
port_lid: 1692
port_lmc: 0x00
link_layer: InfiniBand
hca_id: mlx5_5
transport: InfiniBand (0)
fw_ver: 28.37.1700
node_guid: 4ea0:6489:d37a:7cf7
sys_image_guid: 946d:ae03:00fc:eaf6
vendor_id: 0x02c9
vendor_part_id: 4126
hw_ver: 0x0
board_id: MT_0000000970
phys_port_cnt: 1
port: 1
state: PORT_ACTIVE (4)
max_mtu: 4096 (5)
active_mtu: 4096 (5)
sm_lid: 1
port_lid: 1693
port_lmc: 0x00
link_layer: InfiniBand
hca_id: mlx5_6
transport: InfiniBand (0)
fw_ver: 28.37.1700
node_guid: ac9a:fa6f:97fa:a093
sys_image_guid: 946d:ae03:00fc:ec8c
vendor_id: 0x02c9
vendor_part_id: 4126
hw_ver: 0x0
board_id: MT_0000000970
phys_port_cnt: 1
port: 1
state: PORT_ACTIVE (4)
max_mtu: 4096 (5)
active_mtu: 4096 (5)
sm_lid: 1
port_lid: 1694
port_lmc: 0x00
link_layer: InfiniBand
hca_id: mlx5_7
transport: InfiniBand (0)
fw_ver: 28.37.1700
node_guid: fef9:7fce:e85c:939f
sys_image_guid: 946d:ae03:00f0:0b68
vendor_id: 0x02c9
vendor_part_id: 4126
hw_ver: 0x0
board_id: MT_0000000970
phys_port_cnt: 1
port: 1
state: PORT_ACTIVE (4)
max_mtu: 4096 (5)
active_mtu: 4096 (5)
sm_lid: 1
port_lid: 1695
port_lmc: 0x00
link_layer: InfiniBand
hca_id: mlx5_8
transport: InfiniBand (0)
fw_ver: 28.37.1700
node_guid: ae8f:1005:af4b:5ea7
sys_image_guid: 946d:ae03:00f0:0b46
vendor_id: 0x02c9
vendor_part_id: 4126
hw_ver: 0x0
board_id: MT_0000000970
phys_port_cnt: 1
port: 1
state: PORT_ACTIVE (4)
max_mtu: 4096 (5)
active_mtu: 4096 (5)
sm_lid: 1
port_lid: 1696
port_lmc: 0x00
link_layer: InfiniBand
I'm seeing a similar issue:
root@22f186c3783d:/workspace# root@22f186c3783d:/workspace# nvidia-smi topo -mp GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7 NIC0 NIC1 NIC2 NIC3 NIC4 NIC5 NIC6 NIC7 NIC8 CPU Affinity NUMA Affinity GPU NUMA ID GPU0 X PHB PHB PHB SYS SYS SYS SYS NODE PHB PHB PHB PHB SYS SYS SYS SYS 0-87 0 N/A GPU1 PHB X PHB PHB SYS SYS SYS SYS NODE PHB PHB PHB PHB SYS SYS SYS SYS 0-87 0 N/A GPU2 PHB PHB X PHB SYS SYS SYS SYS NODE PHB PHB PHB PHB SYS SYS SYS SYS 0-87 0 N/A GPU3 PHB PHB PHB X SYS SYS SYS SYS NODE PHB PHB PHB PHB SYS SYS SYS SYS 0-87 0 N/A GPU4 SYS SYS SYS SYS X PHB PHB PHB SYS SYS SYS SYS SYS PHB PHB PHB PHB 88-175 1 N/A GPU5 SYS SYS SYS SYS PHB X PHB PHB SYS SYS SYS SYS SYS PHB PHB PHB PHB 88-175 1 N/A GPU6 SYS SYS SYS SYS PHB PHB X PHB SYS SYS SYS SYS SYS PHB PHB PHB PHB 88-175 1 N/A GPU7 SYS SYS SYS SYS PHB PHB PHB X SYS SYS SYS SYS SYS PHB PHB PHB PHB 88-175 1 N/A NIC0 NODE NODE NODE NODE SYS SYS SYS SYS X NODE NODE NODE NODE SYS SYS SYS SYS NIC1 PHB PHB PHB PHB SYS SYS SYS SYS NODE X PHB PHB PHB SYS SYS SYS SYS NIC2 PHB PHB PHB PHB SYS SYS SYS SYS NODE PHB X PHB PHB SYS SYS SYS SYS NIC3 PHB PHB PHB PHB SYS SYS SYS SYS NODE PHB PHB X PHB SYS SYS SYS SYS NIC4 PHB PHB PHB PHB SYS SYS SYS SYS NODE PHB PHB PHB X SYS SYS SYS SYS NIC5 SYS SYS SYS SYS PHB PHB PHB PHB SYS SYS SYS SYS SYS X PHB PHB PHB NIC6 SYS SYS SYS SYS PHB PHB PHB PHB SYS SYS SYS SYS SYS PHB X PHB PHB NIC7 SYS SYS SYS SYS PHB PHB PHB PHB SYS SYS SYS SYS SYS PHB PHB X PHB NIC8 SYS SYS SYS SYS PHB PHB PHB PHB SYS SYS SYS SYS SYS PHB PHB PHB X Legend: X = Self SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI) NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU) PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge) PIX = Connection traversing at most a single PCIe bridge NIC Legend: NIC0: mlx5_0 NIC1: mlx5_1 NIC2: mlx5_2 NIC3: mlx5_3 NIC4: mlx5_4 NIC5: mlx5_5 NIC6: mlx5_6 NIC7: mlx5_7 NIC8: mlx5_8 root@22f186c3783d:/workspace# ibv_devinfo hca_id: mlx5_0 transport: InfiniBand (0) fw_ver: 28.38.1002 node_guid: 3eea:72ff:fe24:32af sys_image_guid: 58a2:e103:0048:66de vendor_id: 0x02c9 vendor_part_id: 4126 hw_ver: 0x0 board_id: MT_0000001108 phys_port_cnt: 1 port: 1 state: PORT_ACTIVE (4) max_mtu: 4096 (5) active_mtu: 1024 (3) sm_lid: 0 port_lid: 0 port_lmc: 0x00 link_layer: Ethernet hca_id: mlx5_1 transport: InfiniBand (0) fw_ver: 28.37.1700 node_guid: d4fb:b330:a54f:0277 sys_image_guid: 946d:ae03:00f0:0b4e vendor_id: 0x02c9 vendor_part_id: 4126 hw_ver: 0x0 board_id: MT_0000000970 phys_port_cnt: 1 port: 1 state: PORT_ACTIVE (4) max_mtu: 4096 (5) active_mtu: 4096 (5) sm_lid: 1 port_lid: 1689 port_lmc: 0x00 link_layer: InfiniBand hca_id: mlx5_2 transport: InfiniBand (0) fw_ver: 28.37.1700 node_guid: a879:2436:7090:e75b sys_image_guid: 946d:ae03:00f0:063e vendor_id: 0x02c9 vendor_part_id: 4126 hw_ver: 0x0 board_id: MT_0000000970 phys_port_cnt: 1 port: 1 state: PORT_ACTIVE (4) max_mtu: 4096 (5) active_mtu: 4096 (5) sm_lid: 1 port_lid: 1691 port_lmc: 0x00 link_layer: InfiniBand hca_id: mlx5_3 transport: InfiniBand (0) fw_ver: 28.37.1700 node_guid: 2dc3:190f:3d85:1cb6 sys_image_guid: 946d:ae03:00f0:0b6a vendor_id: 0x02c9 vendor_part_id: 4126 hw_ver: 0x0 board_id: MT_0000000970 phys_port_cnt: 1 port: 1 state: PORT_ACTIVE (4) max_mtu: 4096 (5) active_mtu: 4096 (5) sm_lid: 1 port_lid: 1690 port_lmc: 0x00 link_layer: InfiniBand hca_id: mlx5_4 transport: InfiniBand (0) fw_ver: 28.37.1700 node_guid: e70f:f6b9:f338:c9b6 sys_image_guid: 946d:ae03:00f0:0302 vendor_id: 0x02c9 vendor_part_id: 4126 hw_ver: 0x0 board_id: MT_0000000970 phys_port_cnt: 1 port: 1 state: PORT_ACTIVE (4) max_mtu: 4096 (5) active_mtu: 4096 (5) sm_lid: 1 port_lid: 1692 port_lmc: 0x00 link_layer: InfiniBand hca_id: mlx5_5 transport: InfiniBand (0) fw_ver: 28.37.1700 node_guid: 4ea0:6489:d37a:7cf7 sys_image_guid: 946d:ae03:00fc:eaf6 vendor_id: 0x02c9 vendor_part_id: 4126 hw_ver: 0x0 board_id: MT_0000000970 phys_port_cnt: 1 port: 1 state: PORT_ACTIVE (4) max_mtu: 4096 (5) active_mtu: 4096 (5) sm_lid: 1 port_lid: 1693 port_lmc: 0x00 link_layer: InfiniBand hca_id: mlx5_6 transport: InfiniBand (0) fw_ver: 28.37.1700 node_guid: ac9a:fa6f:97fa:a093 sys_image_guid: 946d:ae03:00fc:ec8c vendor_id: 0x02c9 vendor_part_id: 4126 hw_ver: 0x0 board_id: MT_0000000970 phys_port_cnt: 1 port: 1 state: PORT_ACTIVE (4) max_mtu: 4096 (5) active_mtu: 4096 (5) sm_lid: 1 port_lid: 1694 port_lmc: 0x00 link_layer: InfiniBand hca_id: mlx5_7 transport: InfiniBand (0) fw_ver: 28.37.1700 node_guid: fef9:7fce:e85c:939f sys_image_guid: 946d:ae03:00f0:0b68 vendor_id: 0x02c9 vendor_part_id: 4126 hw_ver: 0x0 board_id: MT_0000000970 phys_port_cnt: 1 port: 1 state: PORT_ACTIVE (4) max_mtu: 4096 (5) active_mtu: 4096 (5) sm_lid: 1 port_lid: 1695 port_lmc: 0x00 link_layer: InfiniBand hca_id: mlx5_8 transport: InfiniBand (0) fw_ver: 28.37.1700 node_guid: ae8f:1005:af4b:5ea7 sys_image_guid: 946d:ae03:00f0:0b46 vendor_id: 0x02c9 vendor_part_id: 4126 hw_ver: 0x0 board_id: MT_0000000970 phys_port_cnt: 1 port: 1 state: PORT_ACTIVE (4) max_mtu: 4096 (5) active_mtu: 4096 (5) sm_lid: 1 port_lid: 1696 port_lmc: 0x00 link_layer: InfiniBand
@BigValen It appears that nvshmem cannot initialize ibrc transport, which is typically related to network configuration issues. However, the ibv_devinfo and nvidia-smi outputs you provided look normal. Could you try running ib_write_bw and nvshmem's shmem_put_bw to see if they work properly? This will help us determine if the issue is specific to nvshmem or if there might be a more general RDMA connectivity problem.
@sphish Same issue. Any help?
@sphish Same issue. Any help?
@liusy58 Can you run the NVSHMEM's shmem_put_bw test, and will you encounter the same issue?
@sphish emmm, some features are not supported on my machine, I will try to fix it. Thank you a lot~~
@sphish Hi, output of shmem_put_bw is shown below. I cannot resolve this, could you please give me some guidance?
/opt/nvshmem/bin/perftest/device/pt-to-pt/shmem_put_bw
Runtime options after parsing command line arguments
min_size: 4, max_size: 4194304, step_factor: 2, iterations: 10, warmup iterations: 5, number of ctas: 32, threads per cta: 256 stride: 1, datatype: int, reduce_op: sum, threadgroup_scope: all_scopes, atomic_op: inc, dir: write, report_msgrate: 0, bidirectional: 0, putget_issue :on_stream
Note: Above is full list of options, any given test will use only a subset of these variables.
mype: 0 mype_node: 0 device name: NVIDIA H20 bus id: 8
/home/nvshmem_src/src/modules/transport/ibrc/ibrc.cpp:nvshmemt_init:1851: neither nv_peer_mem, or nvidia_peermem detected. Skipping transport.
/home/nvshmem_src/src/modules/transport/ibgda/ibgda.cpp:nvshmemt_init:3626: neither nv_peer_mem, or nvidia_peermem detected. Skipping transport.
/home/nvshmem_src/src/host/transport/transport.cpp:nvshmemi_transport_init:275: init failed for transport: IBGDA
This test requires exactly two processes
Segmentation fault (core dumped)
nvidia-smi topo -mp
GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7 NIC0 NIC1 NIC2 NIC3 CPU Affinity NUMA Affinity GPU NUMA ID
GPU0 X NODE NODE NODE SYS SYS SYS SYS NODE NODE SYS SYS 0-47,96-143 0 N/A
GPU1 NODE X NODE NODE SYS SYS SYS SYS PIX NODE SYS SYS 0-47,96-143 0 N/A
GPU2 NODE NODE X NODE SYS SYS SYS SYS NODE NODE SYS SYS 0-47,96-143 0 N/A
GPU3 NODE NODE NODE X SYS SYS SYS SYS NODE PIX SYS SYS 0-47,96-143 0 N/A
GPU4 SYS SYS SYS SYS X NODE NODE NODE SYS SYS PIX NODE 48-95,144-191 1 N/A
GPU5 SYS SYS SYS SYS NODE X NODE NODE SYS SYS NODE NODE 48-95,144-191 1 N/A
GPU6 SYS SYS SYS SYS NODE NODE X NODE SYS SYS NODE PIX 48-95,144-191 1 N/A
GPU7 SYS SYS SYS SYS NODE NODE NODE X SYS SYS NODE NODE 48-95,144-191 1 N/A
NIC0 NODE PIX NODE NODE SYS SYS SYS SYS X NODE SYS SYS
NIC1 NODE NODE NODE PIX SYS SYS SYS SYS NODE X SYS SYS
NIC2 SYS SYS SYS SYS PIX NODE NODE NODE SYS SYS X NODE
NIC3 SYS SYS SYS SYS NODE NODE PIX NODE SYS SYS NODE X
Legend:
X = Self
SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
PIX = Connection traversing at most a single PCIe bridge
NIC Legend:
NIC0: mlx5_bond_0
NIC1: mlx5_bond_1
NIC2: mlx5_bond_2
NIC3: mlx5_bond_3
ibv_devinfo
hca_id: mlx5_bond_0
transport: InfiniBand (0)
fw_ver: 32.39.3804
node_guid: 5c25:7303:00f0:052a
sys_image_guid: 5c25:7303:00f0:052a
vendor_id: 0x02c9
vendor_part_id: 41692
hw_ver: 0x1
board_id: MT_0000001093
phys_port_cnt: 1
port: 1
state: PORT_ACTIVE (4)
max_mtu: 4096 (5)
active_mtu: 4096 (5)
sm_lid: 0
port_lid: 0
port_lmc: 0x00
link_layer: Ethernet
hca_id: mlx5_bond_1
transport: InfiniBand (0)
fw_ver: 32.39.3804
node_guid: 5c25:7303:00f0:07ea
sys_image_guid: 5c25:7303:00f0:07ea
vendor_id: 0x02c9
vendor_part_id: 41692
hw_ver: 0x1
board_id: MT_0000001093
phys_port_cnt: 1
port: 1
state: PORT_ACTIVE (4)
max_mtu: 4096 (5)
active_mtu: 4096 (5)
sm_lid: 0
port_lid: 0
port_lmc: 0x00
link_layer: Ethernet
hca_id: mlx5_bond_2
transport: InfiniBand (0)
fw_ver: 32.39.3804
node_guid: 5c25:7303:00f0:0800
sys_image_guid: 5c25:7303:00f0:0800
vendor_id: 0x02c9
vendor_part_id: 41692
hw_ver: 0x1
board_id: MT_0000001093
phys_port_cnt: 1
port: 1
state: PORT_ACTIVE (4)
max_mtu: 4096 (5)
active_mtu: 4096 (5)
sm_lid: 0
port_lid: 0
port_lmc: 0x00
link_layer: Ethernet
hca_id: mlx5_bond_3
transport: InfiniBand (0)
fw_ver: 32.39.3804
node_guid: 5c25:7303:00f0:0556
sys_image_guid: 5c25:7303:00f0:0556
vendor_id: 0x02c9
vendor_part_id: 41692
hw_ver: 0x1
board_id: MT_0000001093
phys_port_cnt: 1
port: 1
state: PORT_ACTIVE (4)
max_mtu: 4096 (5)
active_mtu: 4096 (5)
sm_lid: 0
port_lid: 0
port_lmc: 0x00
link_layer: Ethernet
@sphish Hi, output of
shmem_put_bwis shown below. I cannot resolve this, could you please give me some guidance?/opt/nvshmem/bin/perftest/device/pt-to-pt/shmem_put_bw Runtime options after parsing command line arguments min_size: 4, max_size: 4194304, step_factor: 2, iterations: 10, warmup iterations: 5, number of ctas: 32, threads per cta: 256 stride: 1, datatype: int, reduce_op: sum, threadgroup_scope: all_scopes, atomic_op: inc, dir: write, report_msgrate: 0, bidirectional: 0, putget_issue :on_stream Note: Above is full list of options, any given test will use only a subset of these variables. mype: 0 mype_node: 0 device name: NVIDIA H20 bus id: 8 /home/nvshmem_src/src/modules/transport/ibrc/ibrc.cpp:nvshmemt_init:1851: neither nv_peer_mem, or nvidia_peermem detected. Skipping transport. /home/nvshmem_src/src/modules/transport/ibgda/ibgda.cpp:nvshmemt_init:3626: neither nv_peer_mem, or nvidia_peermem detected. Skipping transport. /home/nvshmem_src/src/host/transport/transport.cpp:nvshmemi_transport_init:275: init failed for transport: IBGDA This test requires exactly two processes Segmentation fault (core dumped)
@liusy58 You need load nvidia_peermem kernel module.
Thank you~
@sphish Same issue. Any help?
@liusy58 Can you run the NVSHMEM's
shmem_put_bwtest, and will you encounter the same issue?
After running the command shmem_put_bw, I encountered the following error. Could you give me further guidance? Thanks a lot.
/opt/nvshmem/bin/perftest/device/pt-to-pt/shmem_put_bw
Runtime options after parsing command line arguments
min_size: 4, max_size: 4194304, step_factor: 2, iterations: 10, warmup iterations: 5, number of ctas: 32, threads per cta: 256 stride: 1, datatype: int, reduce_op: sum, threadgroup_scope: all_scopes, atomic_op: inc, dir: write, report_msgrate: 0, bidirectional: 0, putget_issue :on_stream
Note: Above is full list of options, any given test will use only a subset of these variables.
mype: 0 mype_node: 0 device name: NVIDIA H800 bus id: 25
This test requires exactly two processes
[/xxx/nvshmem_src/perftest/common/utils.cu:408] cuda failed with invalid argument
@sphish Same issue. Any help?
@liusy58 Can you run the NVSHMEM's
shmem_put_bwtest, and will you encounter the same issue?After running the command
shmem_put_bw, I encountered the following error. Could you give me further guidance? Thanks a lot./opt/nvshmem/bin/perftest/device/pt-to-pt/shmem_put_bw Runtime options after parsing command line arguments min_size: 4, max_size: 4194304, step_factor: 2, iterations: 10, warmup iterations: 5, number of ctas: 32, threads per cta: 256 stride: 1, datatype: int, reduce_op: sum, threadgroup_scope: all_scopes, atomic_op: inc, dir: write, report_msgrate: 0, bidirectional: 0, putget_issue :on_stream Note: Above is full list of options, any given test will use only a subset of these variables. mype: 0 mype_node: 0 device name: NVIDIA H800 bus id: 25 This test requires exactly two processes [/xxx/nvshmem_src/perftest/common/utils.cu:408] cuda failed with invalid argument
I suspect this is related to the CUDA driver version.
@sphish
Hi, I got a similar issue. When testing ./shmem_put_bw, got an error below.
Runtime options after parsing command line arguments
min_size: 4, max_size: 4194304, step_factor: 2, iterations: 10, warmup iterations: 5, number of ctas: 32, threads per cta: 256 stride: 1, datatype: int, reduce_op: sum, threadgroup_scope: all_scopes, atomic_op: inc, dir: write, report_msgrate: 0, bidirectional: 0, putget_issue :on_stream
Note: Above is full list of options, any given test will use only a subset of these variables.
mype: 0 mype_node: 0 device name: NVIDIA H100 80GB HBM3 bus id: 10
WARN: cudaHostRegister with IoMemory failed with error=800. We may need to use a fallback path.
WARN: ibgda_nic_mem_gpu_map failed. We may need to use the CPU fallback path.
WARN: ibgda_alloc_and_map_qp_uar with GPU as handler failed. We may need to enter the CPU fallback path.
WARN: GPU cannot map UAR of device mlx5_0. Skipping...
WARN: cudaHostRegister with IoMemory failed with error=800. We may need to use a fallback path.
WARN: ibgda_nic_mem_gpu_map failed. We may need to use the CPU fallback path.
WARN: ibgda_alloc_and_map_qp_uar with GPU as handler failed. We may need to enter the CPU fallback path.
WARN: GPU cannot map UAR of device mlx5_1. Skipping...
WARN: cudaHostRegister with IoMemory failed with error=800. We may need to use a fallback path.
WARN: ibgda_nic_mem_gpu_map failed. We may need to use the CPU fallback path.
WARN: ibgda_alloc_and_map_qp_uar with GPU as handler failed. We may need to enter the CPU fallback path.
WARN: GPU cannot map UAR of device mlx5_2. Skipping...
WARN: cudaHostRegister with IoMemory failed with error=800. We may need to use a fallback path.
WARN: ibgda_nic_mem_gpu_map failed. We may need to use the CPU fallback path.
WARN: ibgda_alloc_and_map_qp_uar with GPU as handler failed. We may need to enter the CPU fallback path.
WARN: GPU cannot map UAR of device mlx5_3. Skipping...
WARN: cudaHostRegister with IoMemory failed with error=800. We may need to use a fallback path.
WARN: ibgda_nic_mem_gpu_map failed. We may need to use the CPU fallback path.
WARN: ibgda_alloc_and_map_qp_uar with GPU as handler failed. We may need to enter the CPU fallback path.
WARN: GPU cannot map UAR of device mlx5_4. Skipping...
/home/dpsk_a2a/deepep-nvshmem/src/host/transport/transport.cpp:nvshmemi_transport_init:275: init failed for transport: IBGDA
@koanho Can you check if the nvidia-peermem module is correctly installed and loaded?
Thank you for reply @sphish. I think nvidia-peermem is correctly installed and loaded.
Singularity> modinfo nvidia-peermem
filename: /lib/modules/5.14.0-284.11.1.el9_2.x86_64/extra/nvidia-peermem.ko
version: 550.54.15
license: Dual BSD/GPL
description: NVIDIA GPU memory plug-in
author: Yishai Hadas
rhelversion: 9.2
srcversion: B13C9DFD8CD4E8BE2B5D362
depends: nvidia,ib_core
retpoline: Y
name: nvidia_peermem
vermagic: 5.14.0-284.11.1.el9_2.x86_64 SMP preempt mod_unload modversions
parm: peerdirect_support:Set level of support for Peer-direct, 0 [default] or 1 [legacy, for example MLNX_OFED 4.9 LTS] (int)
Singularity> lsmod | grep nvidia_peermem
nvidia_peermem 20480 0
ib_core 491520 25 rdma_cm,ib_ipoib,nvidia_peermem,iw_cm,ib_umad,rdma_ucm,ib_uverbs,mlx5_ib,ib_cm
nvidia 8626176 1106 nvidia_uvm,nvidia_peermem,nvidia_fs,gdrdrv,nvidia_modeset
@koanho Have you modified drvier config? https://github.com/deepseek-ai/DeepEP/tree/main/third-party#4-configure-nvidia-driver
Thank you @sphish. I couldn't modify the driver configuration because I don't have root permissions on my training cluster 😞 It seems the error may have occurred because IBGDA is not properly enabled. Is IBGDA necessary to use DeepEP, right?
Is IBGDA necessary to use DeepEP, right?
@koanho If you want to use low latency mode, Yes. If you only want to use the normal mode for training, you can use old version DeepEP, which use IBRC transport.
If you only want to use the normal mode for training, you can use old version DeepEP, which use IBRC transport.
@sphish which commit uses IBRC transport? Thanks
I have the same error, the error is as follows
/sgl-workspace/nvshmem/src/modules/transport/common/transport_gdr_common.cpp 73 GDR driver version: (2, 4)
/sgl-workspace/nvshmem/src/modules/transport/ibgda/ibgda.cpp:nvshmemt_init:3626: neither nv_peer_mem, or nvidia_peermem detected. Skipping transport.
/sgl-workspace/nvshmem/src/host/transport/transport.cpp:nvshmemi_transport_init:275: init failed for transport: IBGDA
/sgl-workspace/nvshmem/src/modules/transport/ibrc/ibrc.cpp 1652 Enumerated IB devices in the system - device id=7 (of 10), name=mlx5_7, num_ports=1
/sgl-workspace/nvshmem/src/host/transport/transport.cpp:nvshmemi_transport_init:287: Unable to initialize any transports. returning error.
/sgl-workspace/nvshmem/src/host/init/init.cu:995: non-zero status: 7 nvshmem detect topo failed
modinfo nvidia-peermem
filename: /lib/modules/6.8.0-54-generic/updates/dkms/nvidia-peermem.ko
version: 570.86.10
license: Linux-OpenIB
description: NVIDIA GPU memory plug-in
author: Yishai Hadas
srcversion: 3FE468926DDE98F050252DF
depends: nvidia,ib_uverbs
retpoline: Y
name: nvidia_peermem
vermagic: 6.8.0-54-generic SMP preempt mod_unload modversions
parm: peerdirect_support:Set level of support for Peer-direct, 0 [default] or 1 [legacy, for example MLNX_OFED 4.9 LTS] (int)
parm: persistent_api_support:Set level of support for persistent APIs, 0 [legacy] or 1 [default] (int)
lsmod | grep nvidia_peermem
nvidia_peermem 16384 0
nvidia 89829376 97 nvidia_uvm,nvidia_peermem,gdrdrv,nvidia_modeset
ib_uverbs 200704 3 nvidia_peermem,rdma_ucm,mlx5_ib
I've been working on this for a long time, is there any good solution?
I've found a "good old version" that works with "IBGDA disabled" machines, which is https://github.com/deepseek-ai/DeepEP/commit/a84a24808fb0ea732f49b874cc456a69dde69076
I've found a "good old version" that works with "IBGDA disabled" machines, which is a84a248
Thank you. I just looked at the modified patch and found that it is the same as what I am using now. I don’t know what went wrong?
In fact, I saw the sglang issue mentioning the dockerfile for creating this environment, but the above problems occurred.
I've found a "good old version" that works with "IBGDA disabled" machines, which is a84a248
@vinjn I've tested commit a84a248, but encountered a runtime error during DeepEP setup. It seems that DeepEP cannot be set up properly when NVSHMEM is installed with IBGDA disabled. Could you help me resolve this issue?
47.98 nvlink error : Undefined reference to 'nvshmemi_ibgda_device_state_d' in '/usr/src/DeepEP/build/temp.linux-x86_64-cpython-312/csrc/kernels/internode_ll.o'
47.98 ninja: build stopped: subcommand failed.
47.99 Traceback (most recent call last):
47.99 File "/usr/local/lib/python3.12/dist-packages/torch/utils/cpp_extension.py", line 2220, in _run_ninja_build
47.99 subprocess.run(
47.99 File "/usr/lib/python3.12/subprocess.py", line 571, in run
47.99 raise CalledProcessError(retcode, process.args,
47.99 subprocess.CalledProcessError: Command '['ninja', '-v']' returned non-zero exit status 1.
I have the same error, the error is as follows
/sgl-workspace/nvshmem/src/modules/transport/common/transport_gdr_common.cpp 73 GDR driver version: (2, 4) /sgl-workspace/nvshmem/src/modules/transport/ibgda/ibgda.cpp:nvshmemt_init:3626: neither nv_peer_mem, or nvidia_peermem detected. Skipping transport. /sgl-workspace/nvshmem/src/host/transport/transport.cpp:nvshmemi_transport_init:275: init failed for transport: IBGDA /sgl-workspace/nvshmem/src/modules/transport/ibrc/ibrc.cpp 1652 Enumerated IB devices in the system - device id=7 (of 10), name=mlx5_7, num_ports=1 /sgl-workspace/nvshmem/src/host/transport/transport.cpp:nvshmemi_transport_init:287: Unable to initialize any transports. returning error. /sgl-workspace/nvshmem/src/host/init/init.cu:995: non-zero status: 7 nvshmem detect topo failedmodinfo nvidia-peermem
filename: /lib/modules/6.8.0-54-generic/updates/dkms/nvidia-peermem.ko version: 570.86.10 license: Linux-OpenIB description: NVIDIA GPU memory plug-in author: Yishai Hadas srcversion: 3FE468926DDE98F050252DF depends: nvidia,ib_uverbs retpoline: Y name: nvidia_peermem vermagic: 6.8.0-54-generic SMP preempt mod_unload modversions parm: peerdirect_support:Set level of support for Peer-direct, 0 [default] or 1 [legacy, for example MLNX_OFED 4.9 LTS] (int) parm: persistent_api_support:Set level of support for persistent APIs, 0 [legacy] or 1 [default] (int)lsmod | grep nvidia_peermem
nvidia_peermem 16384 0 nvidia 89829376 97 nvidia_uvm,nvidia_peermem,gdrdrv,nvidia_modeset ib_uverbs 200704 3 nvidia_peermem,rdma_ucm,mlx5_ibI've been working on this for a long time, is there any good solution?
I found that when I executed shmem_put_bw, a Segmentation fault error would be reported, but when I loaded and executed modprobe nvidia_peermem, this command did not work either.
./shmem_put_bw
Runtime options after parsing command line arguments
min_size: 4, max_size: 4194304, step_factor: 2, iterations: 10, warmup iterations: 5, number of ctas: 32, threads per cta: 256 stride: 1, datatype: int, reduce_op: sum, threadgroup_scope: all_scopes, atomic_op: inc, dir: write, report_msgrate: 0, bidirectional: 0, putget_issue :on_stream
Note: Above is full list of options, any given test will use only a subset of these variables.
mype: 0 mype_node: 0 device name: NVIDIA H20-3e bus id: 42
/sgl-workspace/nvshmem/src/modules/transport/ibrc/ibrc.cpp:nvshmemt_init:1851: neither nv_peer_mem, or nvidia_peermem detected. Skipping transport.
This test requires exactly two processes
Segmentation fault
@koanho Have you modified drvier config? https://github.com/deepseek-ai/DeepEP/tree/main/third-party#4-configure-nvidia-driver
@sphish I used the above method to configure the NVIDIA driver, but still got Segmentation fault. Is there any other solution?
Have you resolved this? I've been stuck here for days. @ch-tiger1
me too
@sphish Hi Excuse me, how can I determine whether my IB NIC or Driver supports the IBGDA feature? I used ConnectX-7, and OFED-23.10-2.1.3 According to the NVSHMEM code, if the GPU supports dmabuf, then nvidia-peermem is not required — is that correct? Does using DeepEP have any requirements on the Linux kernel version? Is GDRcopy necessary for IBGDA? What is its purpose?
status =
CUPFN(ibgda_cuda_syms,
cuDeviceGetAttribute(&flag, (CUdevice_attribute)CU_DEVICE_ATTRIBUTE_DMA_BUF_SUPPORTED,
gpu_device_id));
if (status != CUDA_SUCCESS) {
status = 0;
cudaGetLastError();
ibgda_state->cuda_support_dmabuf = false;
} else {
ibgda_state->cuda_support_dmabuf = (flag == 1);
}
ibgda_state->dmabuf_support_for_data_buffers = ibgda_state->cuda_support_dmabuf;
if (options->IB_DISABLE_DMABUF) {
ibgda_state->dmabuf_support_for_data_buffers = false;
}
if (ibgda_state->dmabuf_support_for_data_buffers == false) {
if (nvshmemt_ib_common_nv_peer_mem_available() != NVSHMEMX_SUCCESS) {
NVSHMEMI_ERROR_PRINT(
"neither nv_peer_mem, or nvidia_peermem detected. Skipping transport.\n");
status = NVSHMEMX_ERROR_INTERNAL;
goto out;
}
}
Thank you! I look forward to your response.
@sphish Same issue. Any help?
@liusy58 Can you run the NVSHMEM's
shmem_put_bwtest, and will you encounter the same issue?After running the command
shmem_put_bw, I encountered the following error. Could you give me further guidance? Thanks a lot./opt/nvshmem/bin/perftest/device/pt-to-pt/shmem_put_bw Runtime options after parsing command line arguments min_size: 4, max_size: 4194304, step_factor: 2, iterations: 10, warmup iterations: 5, number of ctas: 32, threads per cta: 256 stride: 1, datatype: int, reduce_op: sum, threadgroup_scope: all_scopes, atomic_op: inc, dir: write, report_msgrate: 0, bidirectional: 0, putget_issue :on_stream Note: Above is full list of options, any given test will use only a subset of these variables. mype: 0 mype_node: 0 device name: NVIDIA H800 bus id: 25 This test requires exactly two processes [/xxx/nvshmem_src/perftest/common/utils.cu:408] cuda failed with invalid argument
you need run this example use exactly two PE,so you can use nvshmrun to run it eg: nvshmrun -n 2 ./shmem_put_bw btw, nvshmrun is a progress launcher, you can install it follow install_hydra.sh in nvshmem project, good luck!
@kwu130 For information about which devices and driver versions support IBGDA, it’s best to consult NVIDIA directly. We are not entirely sure about the exact compatibility details either. You can refer to our environment setup here: https://github.com/deepseek-ai/DeepEP/issues/36#issuecomment-2892652482. I think that gdrcopy isn't required for IBGDA. However, the last time I tried to build without gdrcopy, I encountered some errors. I haven’t looked into this issue in detail yet.
@kwu130 For information about which devices and driver versions support IBGDA, it’s best to consult NVIDIA directly. We are not entirely sure about the exact compatibility details either. You can refer to our environment setup here: #36 (comment). I think that gdrcopy isn't required for IBGDA. However, the last time I tried to build without gdrcopy, I encountered some errors. I haven’t looked into this issue in detail yet.
copy that, thx very much.