DeepEP icon indicating copy to clipboard operation
DeepEP copied to clipboard

test_low_latency failed

Open hyesung84 opened this issue 9 months ago • 31 comments

I am experiencing an issue with NVSHMEM failing to initialize due to transport errors. The error message indicates that NVSHMEM is unable to detect the system topology and cannot initialize any transport layers. However, test_intranode.py passed successfully... I would like to know how to resolve this problem.

System Information GPU Model: H100 (8 GPUs, single node) OS: Ubuntu 22.04 CUDA Version: 12.5 NVSHMEM Version: 3.2.5

Error Log

WARN: init failed for remote transport: ibrc
/workspace/nvshmem_src/src/host/transport/transport.cpp:nvshmemi_transport_init:275: init failed for transport: IBGDA
WARN: init failed for remote transport: ibrc
WARN: init failed for remote transport: ibrc
WARN: init failed for remote transport: ibrc
/workspace/nvshmem_src/src/host/transport/transport.cpp:nvshmemi_transport_init:287: Unable to initialize any transports. returning error.
/workspace/nvshmem_src/src/host/init/init.cu:989: non-zero status: 7 nvshmem detect topo failed 

/workspace/nvshmem_src/src/host/transport/transport.cpp:nvshmemi_transport_init:275: init failed for transport: IBGDA
/workspace/nvshmem_src/src/host/transport/transport.cpp:nvshmemi_transport_init:275: /workspace/nvshmem_src/src/host/transport/transport.cpp:nvshmemi_transport_init:275: init failed for transport: IBGDAinit failed for transport: IBGDA

/workspace/nvshmem_src/src/host/transport/transport.cpp:nvshmemi_transport_init:287: Unable to initialize any transports. returning error.
/workspace/nvshmem_src/src/host/init/init.cu:989: non-zero status: 7 nvshmem detect topo failed 

/workspace/nvshmem_src/src/host/transport/transport.cpp:nvshmemi_transport_init:287: Unable to initialize any transports. returning error./workspace/nvshmem_src/src/host/transport/transport.cpp:nvshmemi_transport_init:287: 
Unable to initialize any transports. returning error.
/workspace/nvshmem_src/src/host/init/init.cu:989: non-zero status: 7 /workspace/nvshmem_src/src/host/init/init.cu:989: non-zero status: 7 nvshmem detect topo failed 
nvshmem detect topo failed 


WARN: init failed for remote transport: ibrc
WARN: init failed for remote transport: ibrc
WARN: init failed for remote transport: ibrc
WARN: init failed for remote transport: ibrc
/workspace/nvshmem_src/src/host/transport/transport.cpp:nvshmemi_transport_init:275: init failed for transport: IBGDA
/workspace/nvshmem_src/src/host/transport/transport.cpp:nvshmemi_transport_init:275: init failed for transport: IBGDA
/workspace/nvshmem_src/src/host/transport/transport.cpp:nvshmemi_transport_init:287: Unable to initialize any transports. returning error.
/workspace/nvshmem_src/src/host/init/init.cu:989: non-zero status: 7 nvshmem detect topo failed 

/workspace/nvshmem_src/src/host/transport/transport.cpp:nvshmemi_transport_init:287: Unable to initialize any transports. returning error.
/workspace/nvshmem_src/src/host/init/init.cu:989: non-zero status: 7 nvshmem detect topo failed 

/workspace/nvshmem_src/src/host/transport/transport.cpp:nvshmemi_transport_init:275: init failed for transport: IBGDA
/workspace/nvshmem_src/src/host/transport/transport.cpp:nvshmemi_transport_init:275: init failed for transport: IBGDA
/workspace/nvshmem_src/src/host/transport/transport.cpp:nvshmemi_transport_init:287: Unable to initialize any transports. returning error.
/workspace/nvshmem_src/src/host/init/init.cu:989: non-zero status: 7 nvshmem detect topo failed 

/workspace/nvshmem_src/src/host/transport/transport.cpp:nvshmemi_transport_init:287: Unable to initialize any transports. returning error.
/workspace/nvshmem_src/src/host/init/init.cu:989: non-zero status: 7 nvshmem detect topo failed 

WARN: init failed for remote transport: ibrc
WARN: init failed for remote transport: ibrc
WARN: init failed for remote transport: ibrc
WARN: init failed for remote transport: ibrc
WARN: init failed for remote transport: ibrc
WARN: init failed for remote transport: ibrc
WARN: init failed for remote transport: ibrc
WARN: init failed for remote transport: ibrc
/workspace/nvshmem_src/src/host/transport/transport.cpp:nvshmemi_transport_init:275: /workspace/nvshmem_src/src/host/transport/transport.cpp:nvshmemi_transport_init:275: init failed for transport: IBGDAinit failed for transport: IBGDA/workspace/nvshmem_src/src/host/transport/transport.cpp:nvshmemi_transport_init:275: 
init failed for transport: IBGDA

/workspace/nvshmem_src/src/host/transport/transport.cpp:nvshmemi_transport_init:275: init failed for transport: IBGDA
/workspace/nvshmem_src/src/host/transport/transport.cpp:nvshmemi_transport_init:275: /workspace/nvshmem_src/src/host/transport/transport.cpp:nvshmemi_transport_init:287: /workspace/nvshmem_src/src/host/transport/transport.cpp:nvshmemi_transport_init:275: /workspace/nvshmem_src/src/host/transport/transport.cpp:nvshmemi_transport_init:287: /workspace/nvshmem_src/src/host/transport/transport.cpp:nvshmemi_transport_init:287: init failed for transport: IBGDAUnable to initialize any transports. returning error.init failed for transport: IBGDAUnable to initialize any transports. returning error./workspace/nvshmem_src/src/host/transport/transport.cpp:nvshmemi_transport_init:275: /workspace/nvshmem_src/src/host/transport/transport.cpp:nvshmemi_transport_init:275: Unable to initialize any transports. returning error.


/workspace/nvshmem_src/src/host/init/init.cu:989: non-zero status: 7 
/workspace/nvshmem_src/src/host/transport/transport.cpp:nvshmemi_transport_init:287: init failed for transport: IBGDAinit failed for transport: IBGDA
nvshmem detect topo failed 
/workspace/nvshmem_src/src/host/init/init.cu:989: non-zero status: 7 Unable to initialize any transports. returning error.


/workspace/nvshmem_src/src/host/init/init.cu:989: non-zero status: 7 
nvshmem detect topo failed 
/workspace/nvshmem_src/src/host/init/init.cu:989: non-zero status: 7 nvshmem detect topo failed 
/workspace/nvshmem_src/src/host/init/init.cu:nvshmemi_check_state_and_init:1074: 
nvshmem detect topo failed 

nvshmem initialization failed, exiting 
/workspace/nvshmem_src/src/host/init/init.cu:nvshmemi_check_state_and_init:1074: /workspace/nvshmem_src/src/host/init/init.cu:nvshmemi_check_state_and_init:1074: 

/workspace/nvshmem_src/src/host/init/init.cu:nvshmemi_check_state_and_init:1074: nvshmem initialization failed, exiting 
nvshmem initialization failed, exiting 
nvshmem initialization failed, exiting 



/workspace/nvshmem_src/src/host/transport/transport.cpp:nvshmemi_transport_init:287: /workspace/nvshmem_src/src/host/transport/transport.cpp:nvshmemi_transport_init:287: Unable to initialize any transports. returning error.Unable to initialize any transports. returning error.

/workspace/nvshmem_src/src/host/init/init.cu:989: non-zero status: 7 /workspace/nvshmem_src/src/host/init/init.cu:989: non-zero status: 7 nvshmem detect topo failed 
nvshmem detect topo failed 


/workspace/nvshmem_src/src/host/init/init.cu:nvshmemi_check_state_and_init:1074: /workspace/nvshmem_src/src/host/init/init.cu:nvshmemi_check_state_and_init:1074: nvshmem initialization failed, exiting 
nvshmem initialization failed, exiting 


/workspace/nvshmem_src/src/host/transport/transport.cpp:nvshmemi_transport_init:287: Unable to initialize any transports. returning error.
/workspace/nvshmem_src/src/host/init/init.cu:989: non-zero status: 7 nvshmem detect topo failed 

/workspace/nvshmem_src/src/host/init/init.cu:nvshmemi_check_state_and_init:1074: nvshmem initialization failed, exiting 
/workspace/nvshmem_src/src/host/transport/transport.cpp:nvshmemi_transport_init:287: 
Unable to initialize any transports. returning error.
/workspace/nvshmem_src/src/host/init/init.cu:989: non-zero status: 7 nvshmem detect topo failed 

/workspace/nvshmem_src/src/host/init/init.cu:nvshmemi_check_state_and_init:1074: nvshmem initialization failed, exiting 

W0307 07:36:56.817000 22906 torch/multiprocessing/spawn.py:169] Terminating process 22985 via signal SIGTERM
W0307 07:36:56.817000 22906 torch/multiprocessing/spawn.py:169] Terminating process 22987 via signal SIGTERM

hyesung84 avatar Mar 07 '25 07:03 hyesung84

What is your network hardware configuration? Could you please run nvidia-smi topo -mp and ibv_devinfo and share the results?

sphish avatar Mar 09 '25 02:03 sphish

I'm seeing a similar issue:

root@22f186c3783d:/workspace#
root@22f186c3783d:/workspace# nvidia-smi topo -mp
        GPU0    GPU1    GPU2    GPU3    GPU4    GPU5    GPU6    GPU7    NIC0    NIC1    NIC2    NIC3    NIC4    NIC5    NIC6    NIC7    NIC8    CPU Affinity    NUMA Affinity        GPU NUMA ID
GPU0     X      PHB     PHB     PHB     SYS     SYS     SYS     SYS     NODE    PHB     PHB     PHB     PHB     SYS     SYS     SYS     SYS     0-87    0           N/A
GPU1    PHB      X      PHB     PHB     SYS     SYS     SYS     SYS     NODE    PHB     PHB     PHB     PHB     SYS     SYS     SYS     SYS     0-87    0           N/A
GPU2    PHB     PHB      X      PHB     SYS     SYS     SYS     SYS     NODE    PHB     PHB     PHB     PHB     SYS     SYS     SYS     SYS     0-87    0           N/A
GPU3    PHB     PHB     PHB      X      SYS     SYS     SYS     SYS     NODE    PHB     PHB     PHB     PHB     SYS     SYS     SYS     SYS     0-87    0           N/A
GPU4    SYS     SYS     SYS     SYS      X      PHB     PHB     PHB     SYS     SYS     SYS     SYS     SYS     PHB     PHB     PHB     PHB     88-175  1           N/A
GPU5    SYS     SYS     SYS     SYS     PHB      X      PHB     PHB     SYS     SYS     SYS     SYS     SYS     PHB     PHB     PHB     PHB     88-175  1           N/A
GPU6    SYS     SYS     SYS     SYS     PHB     PHB      X      PHB     SYS     SYS     SYS     SYS     SYS     PHB     PHB     PHB     PHB     88-175  1           N/A
GPU7    SYS     SYS     SYS     SYS     PHB     PHB     PHB      X      SYS     SYS     SYS     SYS     SYS     PHB     PHB     PHB     PHB     88-175  1           N/A
NIC0    NODE    NODE    NODE    NODE    SYS     SYS     SYS     SYS      X      NODE    NODE    NODE    NODE    SYS     SYS     SYS     SYS
NIC1    PHB     PHB     PHB     PHB     SYS     SYS     SYS     SYS     NODE     X      PHB     PHB     PHB     SYS     SYS     SYS     SYS
NIC2    PHB     PHB     PHB     PHB     SYS     SYS     SYS     SYS     NODE    PHB      X      PHB     PHB     SYS     SYS     SYS     SYS
NIC3    PHB     PHB     PHB     PHB     SYS     SYS     SYS     SYS     NODE    PHB     PHB      X      PHB     SYS     SYS     SYS     SYS
NIC4    PHB     PHB     PHB     PHB     SYS     SYS     SYS     SYS     NODE    PHB     PHB     PHB      X      SYS     SYS     SYS     SYS
NIC5    SYS     SYS     SYS     SYS     PHB     PHB     PHB     PHB     SYS     SYS     SYS     SYS     SYS      X      PHB     PHB     PHB
NIC6    SYS     SYS     SYS     SYS     PHB     PHB     PHB     PHB     SYS     SYS     SYS     SYS     SYS     PHB      X      PHB     PHB
NIC7    SYS     SYS     SYS     SYS     PHB     PHB     PHB     PHB     SYS     SYS     SYS     SYS     SYS     PHB     PHB      X      PHB
NIC8    SYS     SYS     SYS     SYS     PHB     PHB     PHB     PHB     SYS     SYS     SYS     SYS     SYS     PHB     PHB     PHB      X

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge

NIC Legend:

  NIC0: mlx5_0
  NIC1: mlx5_1
  NIC2: mlx5_2
  NIC3: mlx5_3
  NIC4: mlx5_4
  NIC5: mlx5_5
  NIC6: mlx5_6
  NIC7: mlx5_7
  NIC8: mlx5_8

root@22f186c3783d:/workspace# ibv_devinfo
hca_id: mlx5_0
        transport:                      InfiniBand (0)
        fw_ver:                         28.38.1002
        node_guid:                      3eea:72ff:fe24:32af
        sys_image_guid:                 58a2:e103:0048:66de
        vendor_id:                      0x02c9
        vendor_part_id:                 4126
        hw_ver:                         0x0
        board_id:                       MT_0000001108
        phys_port_cnt:                  1
                port:   1
                        state:                  PORT_ACTIVE (4)
                        max_mtu:                4096 (5)
                        active_mtu:             1024 (3)
                        sm_lid:                 0
                        port_lid:               0
                        port_lmc:               0x00
                        link_layer:             Ethernet

hca_id: mlx5_1
        transport:                      InfiniBand (0)
        fw_ver:                         28.37.1700
        node_guid:                      d4fb:b330:a54f:0277
        sys_image_guid:                 946d:ae03:00f0:0b4e
        vendor_id:                      0x02c9
        vendor_part_id:                 4126
        hw_ver:                         0x0
        board_id:                       MT_0000000970
        phys_port_cnt:                  1
                port:   1
                        state:                  PORT_ACTIVE (4)
                        max_mtu:                4096 (5)
                        active_mtu:             4096 (5)
                        sm_lid:                 1
                        port_lid:               1689
                        port_lmc:               0x00
                        link_layer:             InfiniBand

hca_id: mlx5_2
        transport:                      InfiniBand (0)
        fw_ver:                         28.37.1700
        node_guid:                      a879:2436:7090:e75b
        sys_image_guid:                 946d:ae03:00f0:063e
        vendor_id:                      0x02c9
        vendor_part_id:                 4126
        hw_ver:                         0x0
        board_id:                       MT_0000000970
        phys_port_cnt:                  1
                port:   1
                        state:                  PORT_ACTIVE (4)
                        max_mtu:                4096 (5)
                        active_mtu:             4096 (5)
                        sm_lid:                 1
                        port_lid:               1691
                        port_lmc:               0x00
                        link_layer:             InfiniBand

hca_id: mlx5_3
        transport:                      InfiniBand (0)
        fw_ver:                         28.37.1700
        node_guid:                      2dc3:190f:3d85:1cb6
        sys_image_guid:                 946d:ae03:00f0:0b6a
        vendor_id:                      0x02c9
        vendor_part_id:                 4126
        hw_ver:                         0x0
        board_id:                       MT_0000000970
        phys_port_cnt:                  1
                port:   1
                        state:                  PORT_ACTIVE (4)
                        max_mtu:                4096 (5)
                        active_mtu:             4096 (5)
                        sm_lid:                 1
                        port_lid:               1690
                        port_lmc:               0x00
                        link_layer:             InfiniBand

hca_id: mlx5_4
        transport:                      InfiniBand (0)
        fw_ver:                         28.37.1700
        node_guid:                      e70f:f6b9:f338:c9b6
        sys_image_guid:                 946d:ae03:00f0:0302
        vendor_id:                      0x02c9
        vendor_part_id:                 4126
        hw_ver:                         0x0
        board_id:                       MT_0000000970
        phys_port_cnt:                  1
                port:   1
                        state:                  PORT_ACTIVE (4)
                        max_mtu:                4096 (5)
                        active_mtu:             4096 (5)
                        sm_lid:                 1
                        port_lid:               1692
                        port_lmc:               0x00
                        link_layer:             InfiniBand

hca_id: mlx5_5
        transport:                      InfiniBand (0)
        fw_ver:                         28.37.1700
        node_guid:                      4ea0:6489:d37a:7cf7
        sys_image_guid:                 946d:ae03:00fc:eaf6
        vendor_id:                      0x02c9
        vendor_part_id:                 4126
        hw_ver:                         0x0
        board_id:                       MT_0000000970
        phys_port_cnt:                  1
                port:   1
                        state:                  PORT_ACTIVE (4)
                        max_mtu:                4096 (5)
                        active_mtu:             4096 (5)
                        sm_lid:                 1
                        port_lid:               1693
                        port_lmc:               0x00
                        link_layer:             InfiniBand

hca_id: mlx5_6
        transport:                      InfiniBand (0)
        fw_ver:                         28.37.1700
        node_guid:                      ac9a:fa6f:97fa:a093
        sys_image_guid:                 946d:ae03:00fc:ec8c
        vendor_id:                      0x02c9
        vendor_part_id:                 4126
        hw_ver:                         0x0
        board_id:                       MT_0000000970
        phys_port_cnt:                  1
                port:   1
                        state:                  PORT_ACTIVE (4)
                        max_mtu:                4096 (5)
                        active_mtu:             4096 (5)
                        sm_lid:                 1
                        port_lid:               1694
                        port_lmc:               0x00
                        link_layer:             InfiniBand

hca_id: mlx5_7
        transport:                      InfiniBand (0)
        fw_ver:                         28.37.1700
        node_guid:                      fef9:7fce:e85c:939f
        sys_image_guid:                 946d:ae03:00f0:0b68
        vendor_id:                      0x02c9
        vendor_part_id:                 4126
        hw_ver:                         0x0
        board_id:                       MT_0000000970
        phys_port_cnt:                  1
                port:   1
                        state:                  PORT_ACTIVE (4)
                        max_mtu:                4096 (5)
                        active_mtu:             4096 (5)
                        sm_lid:                 1
                        port_lid:               1695
                        port_lmc:               0x00
                        link_layer:             InfiniBand

hca_id: mlx5_8
        transport:                      InfiniBand (0)
        fw_ver:                         28.37.1700
        node_guid:                      ae8f:1005:af4b:5ea7
        sys_image_guid:                 946d:ae03:00f0:0b46
        vendor_id:                      0x02c9
        vendor_part_id:                 4126
        hw_ver:                         0x0
        board_id:                       MT_0000000970
        phys_port_cnt:                  1
                port:   1
                        state:                  PORT_ACTIVE (4)
                        max_mtu:                4096 (5)
                        active_mtu:             4096 (5)
                        sm_lid:                 1
                        port_lid:               1696
                        port_lmc:               0x00
                        link_layer:             InfiniBand

BigValen avatar Mar 24 '25 13:03 BigValen

I'm seeing a similar issue:

root@22f186c3783d:/workspace#
root@22f186c3783d:/workspace# nvidia-smi topo -mp
        GPU0    GPU1    GPU2    GPU3    GPU4    GPU5    GPU6    GPU7    NIC0    NIC1    NIC2    NIC3    NIC4    NIC5    NIC6    NIC7    NIC8    CPU Affinity    NUMA Affinity        GPU NUMA ID
GPU0     X      PHB     PHB     PHB     SYS     SYS     SYS     SYS     NODE    PHB     PHB     PHB     PHB     SYS     SYS     SYS     SYS     0-87    0           N/A
GPU1    PHB      X      PHB     PHB     SYS     SYS     SYS     SYS     NODE    PHB     PHB     PHB     PHB     SYS     SYS     SYS     SYS     0-87    0           N/A
GPU2    PHB     PHB      X      PHB     SYS     SYS     SYS     SYS     NODE    PHB     PHB     PHB     PHB     SYS     SYS     SYS     SYS     0-87    0           N/A
GPU3    PHB     PHB     PHB      X      SYS     SYS     SYS     SYS     NODE    PHB     PHB     PHB     PHB     SYS     SYS     SYS     SYS     0-87    0           N/A
GPU4    SYS     SYS     SYS     SYS      X      PHB     PHB     PHB     SYS     SYS     SYS     SYS     SYS     PHB     PHB     PHB     PHB     88-175  1           N/A
GPU5    SYS     SYS     SYS     SYS     PHB      X      PHB     PHB     SYS     SYS     SYS     SYS     SYS     PHB     PHB     PHB     PHB     88-175  1           N/A
GPU6    SYS     SYS     SYS     SYS     PHB     PHB      X      PHB     SYS     SYS     SYS     SYS     SYS     PHB     PHB     PHB     PHB     88-175  1           N/A
GPU7    SYS     SYS     SYS     SYS     PHB     PHB     PHB      X      SYS     SYS     SYS     SYS     SYS     PHB     PHB     PHB     PHB     88-175  1           N/A
NIC0    NODE    NODE    NODE    NODE    SYS     SYS     SYS     SYS      X      NODE    NODE    NODE    NODE    SYS     SYS     SYS     SYS
NIC1    PHB     PHB     PHB     PHB     SYS     SYS     SYS     SYS     NODE     X      PHB     PHB     PHB     SYS     SYS     SYS     SYS
NIC2    PHB     PHB     PHB     PHB     SYS     SYS     SYS     SYS     NODE    PHB      X      PHB     PHB     SYS     SYS     SYS     SYS
NIC3    PHB     PHB     PHB     PHB     SYS     SYS     SYS     SYS     NODE    PHB     PHB      X      PHB     SYS     SYS     SYS     SYS
NIC4    PHB     PHB     PHB     PHB     SYS     SYS     SYS     SYS     NODE    PHB     PHB     PHB      X      SYS     SYS     SYS     SYS
NIC5    SYS     SYS     SYS     SYS     PHB     PHB     PHB     PHB     SYS     SYS     SYS     SYS     SYS      X      PHB     PHB     PHB
NIC6    SYS     SYS     SYS     SYS     PHB     PHB     PHB     PHB     SYS     SYS     SYS     SYS     SYS     PHB      X      PHB     PHB
NIC7    SYS     SYS     SYS     SYS     PHB     PHB     PHB     PHB     SYS     SYS     SYS     SYS     SYS     PHB     PHB      X      PHB
NIC8    SYS     SYS     SYS     SYS     PHB     PHB     PHB     PHB     SYS     SYS     SYS     SYS     SYS     PHB     PHB     PHB      X

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge

NIC Legend:

  NIC0: mlx5_0
  NIC1: mlx5_1
  NIC2: mlx5_2
  NIC3: mlx5_3
  NIC4: mlx5_4
  NIC5: mlx5_5
  NIC6: mlx5_6
  NIC7: mlx5_7
  NIC8: mlx5_8

root@22f186c3783d:/workspace# ibv_devinfo
hca_id: mlx5_0
        transport:                      InfiniBand (0)
        fw_ver:                         28.38.1002
        node_guid:                      3eea:72ff:fe24:32af
        sys_image_guid:                 58a2:e103:0048:66de
        vendor_id:                      0x02c9
        vendor_part_id:                 4126
        hw_ver:                         0x0
        board_id:                       MT_0000001108
        phys_port_cnt:                  1
                port:   1
                        state:                  PORT_ACTIVE (4)
                        max_mtu:                4096 (5)
                        active_mtu:             1024 (3)
                        sm_lid:                 0
                        port_lid:               0
                        port_lmc:               0x00
                        link_layer:             Ethernet

hca_id: mlx5_1
        transport:                      InfiniBand (0)
        fw_ver:                         28.37.1700
        node_guid:                      d4fb:b330:a54f:0277
        sys_image_guid:                 946d:ae03:00f0:0b4e
        vendor_id:                      0x02c9
        vendor_part_id:                 4126
        hw_ver:                         0x0
        board_id:                       MT_0000000970
        phys_port_cnt:                  1
                port:   1
                        state:                  PORT_ACTIVE (4)
                        max_mtu:                4096 (5)
                        active_mtu:             4096 (5)
                        sm_lid:                 1
                        port_lid:               1689
                        port_lmc:               0x00
                        link_layer:             InfiniBand

hca_id: mlx5_2
        transport:                      InfiniBand (0)
        fw_ver:                         28.37.1700
        node_guid:                      a879:2436:7090:e75b
        sys_image_guid:                 946d:ae03:00f0:063e
        vendor_id:                      0x02c9
        vendor_part_id:                 4126
        hw_ver:                         0x0
        board_id:                       MT_0000000970
        phys_port_cnt:                  1
                port:   1
                        state:                  PORT_ACTIVE (4)
                        max_mtu:                4096 (5)
                        active_mtu:             4096 (5)
                        sm_lid:                 1
                        port_lid:               1691
                        port_lmc:               0x00
                        link_layer:             InfiniBand

hca_id: mlx5_3
        transport:                      InfiniBand (0)
        fw_ver:                         28.37.1700
        node_guid:                      2dc3:190f:3d85:1cb6
        sys_image_guid:                 946d:ae03:00f0:0b6a
        vendor_id:                      0x02c9
        vendor_part_id:                 4126
        hw_ver:                         0x0
        board_id:                       MT_0000000970
        phys_port_cnt:                  1
                port:   1
                        state:                  PORT_ACTIVE (4)
                        max_mtu:                4096 (5)
                        active_mtu:             4096 (5)
                        sm_lid:                 1
                        port_lid:               1690
                        port_lmc:               0x00
                        link_layer:             InfiniBand

hca_id: mlx5_4
        transport:                      InfiniBand (0)
        fw_ver:                         28.37.1700
        node_guid:                      e70f:f6b9:f338:c9b6
        sys_image_guid:                 946d:ae03:00f0:0302
        vendor_id:                      0x02c9
        vendor_part_id:                 4126
        hw_ver:                         0x0
        board_id:                       MT_0000000970
        phys_port_cnt:                  1
                port:   1
                        state:                  PORT_ACTIVE (4)
                        max_mtu:                4096 (5)
                        active_mtu:             4096 (5)
                        sm_lid:                 1
                        port_lid:               1692
                        port_lmc:               0x00
                        link_layer:             InfiniBand

hca_id: mlx5_5
        transport:                      InfiniBand (0)
        fw_ver:                         28.37.1700
        node_guid:                      4ea0:6489:d37a:7cf7
        sys_image_guid:                 946d:ae03:00fc:eaf6
        vendor_id:                      0x02c9
        vendor_part_id:                 4126
        hw_ver:                         0x0
        board_id:                       MT_0000000970
        phys_port_cnt:                  1
                port:   1
                        state:                  PORT_ACTIVE (4)
                        max_mtu:                4096 (5)
                        active_mtu:             4096 (5)
                        sm_lid:                 1
                        port_lid:               1693
                        port_lmc:               0x00
                        link_layer:             InfiniBand

hca_id: mlx5_6
        transport:                      InfiniBand (0)
        fw_ver:                         28.37.1700
        node_guid:                      ac9a:fa6f:97fa:a093
        sys_image_guid:                 946d:ae03:00fc:ec8c
        vendor_id:                      0x02c9
        vendor_part_id:                 4126
        hw_ver:                         0x0
        board_id:                       MT_0000000970
        phys_port_cnt:                  1
                port:   1
                        state:                  PORT_ACTIVE (4)
                        max_mtu:                4096 (5)
                        active_mtu:             4096 (5)
                        sm_lid:                 1
                        port_lid:               1694
                        port_lmc:               0x00
                        link_layer:             InfiniBand

hca_id: mlx5_7
        transport:                      InfiniBand (0)
        fw_ver:                         28.37.1700
        node_guid:                      fef9:7fce:e85c:939f
        sys_image_guid:                 946d:ae03:00f0:0b68
        vendor_id:                      0x02c9
        vendor_part_id:                 4126
        hw_ver:                         0x0
        board_id:                       MT_0000000970
        phys_port_cnt:                  1
                port:   1
                        state:                  PORT_ACTIVE (4)
                        max_mtu:                4096 (5)
                        active_mtu:             4096 (5)
                        sm_lid:                 1
                        port_lid:               1695
                        port_lmc:               0x00
                        link_layer:             InfiniBand

hca_id: mlx5_8
        transport:                      InfiniBand (0)
        fw_ver:                         28.37.1700
        node_guid:                      ae8f:1005:af4b:5ea7
        sys_image_guid:                 946d:ae03:00f0:0b46
        vendor_id:                      0x02c9
        vendor_part_id:                 4126
        hw_ver:                         0x0
        board_id:                       MT_0000000970
        phys_port_cnt:                  1
                port:   1
                        state:                  PORT_ACTIVE (4)
                        max_mtu:                4096 (5)
                        active_mtu:             4096 (5)
                        sm_lid:                 1
                        port_lid:               1696
                        port_lmc:               0x00
                        link_layer:             InfiniBand

@BigValen It appears that nvshmem cannot initialize ibrc transport, which is typically related to network configuration issues. However, the ibv_devinfo and nvidia-smi outputs you provided look normal. Could you try running ib_write_bw and nvshmem's shmem_put_bw to see if they work properly? This will help us determine if the issue is specific to nvshmem or if there might be a more general RDMA connectivity problem.

sphish avatar Mar 25 '25 01:03 sphish

@sphish Same issue. Any help?

liusy58 avatar Mar 31 '25 15:03 liusy58

@sphish Same issue. Any help?

@liusy58 Can you run the NVSHMEM's shmem_put_bw test, and will you encounter the same issue?

sphish avatar Apr 01 '25 01:04 sphish

@sphish emmm, some features are not supported on my machine, I will try to fix it. Thank you a lot~~

liusy58 avatar Apr 01 '25 02:04 liusy58

@sphish Hi, output of shmem_put_bw is shown below. I cannot resolve this, could you please give me some guidance?

/opt/nvshmem/bin/perftest/device/pt-to-pt/shmem_put_bw 
Runtime options after parsing command line arguments 
min_size: 4, max_size: 4194304, step_factor: 2, iterations: 10, warmup iterations: 5, number of ctas: 32, threads per cta: 256 stride: 1, datatype: int, reduce_op: sum, threadgroup_scope: all_scopes, atomic_op: inc, dir: write, report_msgrate: 0, bidirectional: 0, putget_issue :on_stream
Note: Above is full list of options, any given test will use only a subset of these variables.
mype: 0 mype_node: 0 device name: NVIDIA H20 bus id: 8 
/home/nvshmem_src/src/modules/transport/ibrc/ibrc.cpp:nvshmemt_init:1851: neither nv_peer_mem, or nvidia_peermem detected. Skipping transport.

/home/nvshmem_src/src/modules/transport/ibgda/ibgda.cpp:nvshmemt_init:3626: neither nv_peer_mem, or nvidia_peermem detected. Skipping transport.

/home/nvshmem_src/src/host/transport/transport.cpp:nvshmemi_transport_init:275: init failed for transport: IBGDA
This test requires exactly two processes 
Segmentation fault (core dumped)
nvidia-smi topo -mp
        GPU0    GPU1    GPU2    GPU3    GPU4    GPU5    GPU6    GPU7    NIC0    NIC1    NIC2    NIC3    CPU Affinity NUMA Affinity   GPU NUMA ID
GPU0     X      NODE    NODE    NODE    SYS     SYS     SYS     SYS     NODE    NODE    SYS     SYS     0-47,96-143  0               N/A
GPU1    NODE     X      NODE    NODE    SYS     SYS     SYS     SYS     PIX     NODE    SYS     SYS     0-47,96-143  0               N/A
GPU2    NODE    NODE     X      NODE    SYS     SYS     SYS     SYS     NODE    NODE    SYS     SYS     0-47,96-143  0               N/A
GPU3    NODE    NODE    NODE     X      SYS     SYS     SYS     SYS     NODE    PIX     SYS     SYS     0-47,96-143  0               N/A
GPU4    SYS     SYS     SYS     SYS      X      NODE    NODE    NODE    SYS     SYS     PIX     NODE    48-95,144-191        1               N/A
GPU5    SYS     SYS     SYS     SYS     NODE     X      NODE    NODE    SYS     SYS     NODE    NODE    48-95,144-191        1               N/A
GPU6    SYS     SYS     SYS     SYS     NODE    NODE     X      NODE    SYS     SYS     NODE    PIX     48-95,144-191        1               N/A
GPU7    SYS     SYS     SYS     SYS     NODE    NODE    NODE     X      SYS     SYS     NODE    NODE    48-95,144-191        1               N/A
NIC0    NODE    PIX     NODE    NODE    SYS     SYS     SYS     SYS      X      NODE    SYS     SYS
NIC1    NODE    NODE    NODE    PIX     SYS     SYS     SYS     SYS     NODE     X      SYS     SYS
NIC2    SYS     SYS     SYS     SYS     PIX     NODE    NODE    NODE    SYS     SYS      X      NODE
NIC3    SYS     SYS     SYS     SYS     NODE    NODE    PIX     NODE    SYS     SYS     NODE     X 

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge

NIC Legend:

  NIC0: mlx5_bond_0
  NIC1: mlx5_bond_1
  NIC2: mlx5_bond_2
  NIC3: mlx5_bond_3
ibv_devinfo
hca_id: mlx5_bond_0
        transport:                      InfiniBand (0)
        fw_ver:                         32.39.3804
        node_guid:                      5c25:7303:00f0:052a
        sys_image_guid:                 5c25:7303:00f0:052a
        vendor_id:                      0x02c9
        vendor_part_id:                 41692
        hw_ver:                         0x1
        board_id:                       MT_0000001093
        phys_port_cnt:                  1
                port:   1
                        state:                  PORT_ACTIVE (4)
                        max_mtu:                4096 (5)
                        active_mtu:             4096 (5)
                        sm_lid:                 0
                        port_lid:               0
                        port_lmc:               0x00
                        link_layer:             Ethernet

hca_id: mlx5_bond_1
        transport:                      InfiniBand (0)
        fw_ver:                         32.39.3804
        node_guid:                      5c25:7303:00f0:07ea
        sys_image_guid:                 5c25:7303:00f0:07ea
        vendor_id:                      0x02c9
        vendor_part_id:                 41692
        hw_ver:                         0x1
        board_id:                       MT_0000001093
        phys_port_cnt:                  1
                port:   1
                        state:                  PORT_ACTIVE (4)
                        max_mtu:                4096 (5)
                        active_mtu:             4096 (5)
                        sm_lid:                 0
                        port_lid:               0
                        port_lmc:               0x00
                        link_layer:             Ethernet

hca_id: mlx5_bond_2
        transport:                      InfiniBand (0)
        fw_ver:                         32.39.3804
        node_guid:                      5c25:7303:00f0:0800
        sys_image_guid:                 5c25:7303:00f0:0800
        vendor_id:                      0x02c9
        vendor_part_id:                 41692
        hw_ver:                         0x1
        board_id:                       MT_0000001093
        phys_port_cnt:                  1
                port:   1
                        state:                  PORT_ACTIVE (4)
                        max_mtu:                4096 (5)
                        active_mtu:             4096 (5)
                        sm_lid:                 0
                        port_lid:               0
                        port_lmc:               0x00
                        link_layer:             Ethernet

hca_id: mlx5_bond_3
        transport:                      InfiniBand (0)
        fw_ver:                         32.39.3804
        node_guid:                      5c25:7303:00f0:0556
        sys_image_guid:                 5c25:7303:00f0:0556
        vendor_id:                      0x02c9
        vendor_part_id:                 41692
        hw_ver:                         0x1
        board_id:                       MT_0000001093
        phys_port_cnt:                  1
                port:   1
                        state:                  PORT_ACTIVE (4)
                        max_mtu:                4096 (5)
                        active_mtu:             4096 (5)
                        sm_lid:                 0
                        port_lid:               0
                        port_lmc:               0x00
                        link_layer:             Ethernet

liusy58 avatar Apr 02 '25 09:04 liusy58

@sphish Hi, output of shmem_put_bw is shown below. I cannot resolve this, could you please give me some guidance?

/opt/nvshmem/bin/perftest/device/pt-to-pt/shmem_put_bw 
Runtime options after parsing command line arguments 
min_size: 4, max_size: 4194304, step_factor: 2, iterations: 10, warmup iterations: 5, number of ctas: 32, threads per cta: 256 stride: 1, datatype: int, reduce_op: sum, threadgroup_scope: all_scopes, atomic_op: inc, dir: write, report_msgrate: 0, bidirectional: 0, putget_issue :on_stream
Note: Above is full list of options, any given test will use only a subset of these variables.
mype: 0 mype_node: 0 device name: NVIDIA H20 bus id: 8 
/home/nvshmem_src/src/modules/transport/ibrc/ibrc.cpp:nvshmemt_init:1851: neither nv_peer_mem, or nvidia_peermem detected. Skipping transport.

/home/nvshmem_src/src/modules/transport/ibgda/ibgda.cpp:nvshmemt_init:3626: neither nv_peer_mem, or nvidia_peermem detected. Skipping transport.

/home/nvshmem_src/src/host/transport/transport.cpp:nvshmemi_transport_init:275: init failed for transport: IBGDA
This test requires exactly two processes 
Segmentation fault (core dumped)

@liusy58 You need load nvidia_peermem kernel module.

sphish avatar Apr 02 '25 15:04 sphish

Thank you~

liusy58 avatar Apr 03 '25 02:04 liusy58

@sphish Same issue. Any help?

@liusy58 Can you run the NVSHMEM's shmem_put_bw test, and will you encounter the same issue?

After running the command shmem_put_bw, I encountered the following error. Could you give me further guidance? Thanks a lot.

/opt/nvshmem/bin/perftest/device/pt-to-pt/shmem_put_bw 
Runtime options after parsing command line arguments 
min_size: 4, max_size: 4194304, step_factor: 2, iterations: 10, warmup iterations: 5, number of ctas: 32, threads per cta: 256 stride: 1, datatype: int, reduce_op: sum, threadgroup_scope: all_scopes, atomic_op: inc, dir: write, report_msgrate: 0, bidirectional: 0, putget_issue :on_stream
Note: Above is full list of options, any given test will use only a subset of these variables.
mype: 0 mype_node: 0 device name: NVIDIA H800 bus id: 25 
This test requires exactly two processes 
[/xxx/nvshmem_src/perftest/common/utils.cu:408] cuda failed with invalid argument

Cydia2018 avatar Apr 24 '25 09:04 Cydia2018

@sphish Same issue. Any help?

@liusy58 Can you run the NVSHMEM's shmem_put_bw test, and will you encounter the same issue?

After running the command shmem_put_bw, I encountered the following error. Could you give me further guidance? Thanks a lot.

/opt/nvshmem/bin/perftest/device/pt-to-pt/shmem_put_bw 
Runtime options after parsing command line arguments 
min_size: 4, max_size: 4194304, step_factor: 2, iterations: 10, warmup iterations: 5, number of ctas: 32, threads per cta: 256 stride: 1, datatype: int, reduce_op: sum, threadgroup_scope: all_scopes, atomic_op: inc, dir: write, report_msgrate: 0, bidirectional: 0, putget_issue :on_stream
Note: Above is full list of options, any given test will use only a subset of these variables.
mype: 0 mype_node: 0 device name: NVIDIA H800 bus id: 25 
This test requires exactly two processes 
[/xxx/nvshmem_src/perftest/common/utils.cu:408] cuda failed with invalid argument

I suspect this is related to the CUDA driver version.

sphish avatar Apr 25 '25 02:04 sphish

@sphish Hi, I got a similar issue. When testing ./shmem_put_bw, got an error below.

Runtime options after parsing command line arguments 
min_size: 4, max_size: 4194304, step_factor: 2, iterations: 10, warmup iterations: 5, number of ctas: 32, threads per cta: 256 stride: 1, datatype: int, reduce_op: sum, threadgroup_scope: all_scopes, atomic_op: inc, dir: write, report_msgrate: 0, bidirectional: 0, putget_issue :on_stream
Note: Above is full list of options, any given test will use only a subset of these variables.
mype: 0 mype_node: 0 device name: NVIDIA H100 80GB HBM3 bus id: 10 
WARN: cudaHostRegister with IoMemory failed with error=800. We may need to use a fallback path.

WARN: ibgda_nic_mem_gpu_map failed. We may need to use the CPU fallback path.

WARN: ibgda_alloc_and_map_qp_uar with GPU as handler failed. We may need to enter the CPU fallback path.

WARN: GPU cannot map UAR of device mlx5_0. Skipping...

WARN: cudaHostRegister with IoMemory failed with error=800. We may need to use a fallback path.

WARN: ibgda_nic_mem_gpu_map failed. We may need to use the CPU fallback path.

WARN: ibgda_alloc_and_map_qp_uar with GPU as handler failed. We may need to enter the CPU fallback path.

WARN: GPU cannot map UAR of device mlx5_1. Skipping...

WARN: cudaHostRegister with IoMemory failed with error=800. We may need to use a fallback path.

WARN: ibgda_nic_mem_gpu_map failed. We may need to use the CPU fallback path.

WARN: ibgda_alloc_and_map_qp_uar with GPU as handler failed. We may need to enter the CPU fallback path.

WARN: GPU cannot map UAR of device mlx5_2. Skipping...

WARN: cudaHostRegister with IoMemory failed with error=800. We may need to use a fallback path.

WARN: ibgda_nic_mem_gpu_map failed. We may need to use the CPU fallback path.

WARN: ibgda_alloc_and_map_qp_uar with GPU as handler failed. We may need to enter the CPU fallback path.

WARN: GPU cannot map UAR of device mlx5_3. Skipping...

WARN: cudaHostRegister with IoMemory failed with error=800. We may need to use a fallback path.

WARN: ibgda_nic_mem_gpu_map failed. We may need to use the CPU fallback path.

WARN: ibgda_alloc_and_map_qp_uar with GPU as handler failed. We may need to enter the CPU fallback path.

WARN: GPU cannot map UAR of device mlx5_4. Skipping...

/home/dpsk_a2a/deepep-nvshmem/src/host/transport/transport.cpp:nvshmemi_transport_init:275: init failed for transport: IBGDA

koanho avatar Apr 27 '25 05:04 koanho

@koanho Can you check if the nvidia-peermem module is correctly installed and loaded?

sphish avatar Apr 27 '25 06:04 sphish

Thank you for reply @sphish. I think nvidia-peermem is correctly installed and loaded.

Singularity> modinfo nvidia-peermem
filename:       /lib/modules/5.14.0-284.11.1.el9_2.x86_64/extra/nvidia-peermem.ko
version:        550.54.15
license:        Dual BSD/GPL
description:    NVIDIA GPU memory plug-in
author:         Yishai Hadas
rhelversion:    9.2
srcversion:     B13C9DFD8CD4E8BE2B5D362
depends:        nvidia,ib_core
retpoline:      Y
name:           nvidia_peermem
vermagic:       5.14.0-284.11.1.el9_2.x86_64 SMP preempt mod_unload modversions 
parm:           peerdirect_support:Set level of support for Peer-direct, 0 [default] or 1 [legacy, for example MLNX_OFED 4.9 LTS] (int)
Singularity> lsmod | grep nvidia_peermem
nvidia_peermem         20480  0
ib_core               491520  25 rdma_cm,ib_ipoib,nvidia_peermem,iw_cm,ib_umad,rdma_ucm,ib_uverbs,mlx5_ib,ib_cm
nvidia               8626176  1106 nvidia_uvm,nvidia_peermem,nvidia_fs,gdrdrv,nvidia_modeset

koanho avatar Apr 27 '25 06:04 koanho

@koanho Have you modified drvier config? https://github.com/deepseek-ai/DeepEP/tree/main/third-party#4-configure-nvidia-driver

sphish avatar Apr 27 '25 07:04 sphish

Thank you @sphish. I couldn't modify the driver configuration because I don't have root permissions on my training cluster 😞 It seems the error may have occurred because IBGDA is not properly enabled. Is IBGDA necessary to use DeepEP, right?

koanho avatar Apr 27 '25 08:04 koanho

Is IBGDA necessary to use DeepEP, right?

@koanho If you want to use low latency mode, Yes. If you only want to use the normal mode for training, you can use old version DeepEP, which use IBRC transport.

sphish avatar Apr 27 '25 08:04 sphish

If you only want to use the normal mode for training, you can use old version DeepEP, which use IBRC transport.

@sphish which commit uses IBRC transport? Thanks

vinjn avatar May 08 '25 03:05 vinjn

I have the same error, the error is as follows

/sgl-workspace/nvshmem/src/modules/transport/common/transport_gdr_common.cpp 73 GDR driver version: (2, 4)
/sgl-workspace/nvshmem/src/modules/transport/ibgda/ibgda.cpp:nvshmemt_init:3626: neither nv_peer_mem, or nvidia_peermem detected. Skipping transport.

/sgl-workspace/nvshmem/src/host/transport/transport.cpp:nvshmemi_transport_init:275: init failed for transport: IBGDA
/sgl-workspace/nvshmem/src/modules/transport/ibrc/ibrc.cpp 1652 Enumerated IB devices in the system - device id=7 (of 10), name=mlx5_7, num_ports=1
/sgl-workspace/nvshmem/src/host/transport/transport.cpp:nvshmemi_transport_init:287: Unable to initialize any transports. returning error.
/sgl-workspace/nvshmem/src/host/init/init.cu:995: non-zero status: 7 nvshmem detect topo failed

modinfo nvidia-peermem

filename:       /lib/modules/6.8.0-54-generic/updates/dkms/nvidia-peermem.ko
version:        570.86.10
license:        Linux-OpenIB
description:    NVIDIA GPU memory plug-in
author:         Yishai Hadas
srcversion:     3FE468926DDE98F050252DF
depends:        nvidia,ib_uverbs
retpoline:      Y
name:           nvidia_peermem
vermagic:       6.8.0-54-generic SMP preempt mod_unload modversions
parm:           peerdirect_support:Set level of support for Peer-direct, 0 [default] or 1 [legacy, for example MLNX_OFED 4.9 LTS] (int)
parm:           persistent_api_support:Set level of support for persistent APIs, 0 [legacy] or 1 [default] (int)

lsmod | grep nvidia_peermem

nvidia_peermem         16384  0
nvidia              89829376  97 nvidia_uvm,nvidia_peermem,gdrdrv,nvidia_modeset
ib_uverbs             200704  3 nvidia_peermem,rdma_ucm,mlx5_ib

I've been working on this for a long time, is there any good solution?

ch-tiger1 avatar May 09 '25 03:05 ch-tiger1

I've found a "good old version" that works with "IBGDA disabled" machines, which is https://github.com/deepseek-ai/DeepEP/commit/a84a24808fb0ea732f49b874cc456a69dde69076

vinjn avatar May 09 '25 03:05 vinjn

I've found a "good old version" that works with "IBGDA disabled" machines, which is a84a248

Thank you. I just looked at the modified patch and found that it is the same as what I am using now. I don’t know what went wrong?

Image

In fact, I saw the sglang issue mentioning the dockerfile for creating this environment, but the above problems occurred.

ch-tiger1 avatar May 09 '25 03:05 ch-tiger1

I've found a "good old version" that works with "IBGDA disabled" machines, which is a84a248

@vinjn I've tested commit a84a248, but encountered a runtime error during DeepEP setup. It seems that DeepEP cannot be set up properly when NVSHMEM is installed with IBGDA disabled. Could you help me resolve this issue?

47.98 nvlink error   : Undefined reference to 'nvshmemi_ibgda_device_state_d' in '/usr/src/DeepEP/build/temp.linux-x86_64-cpython-312/csrc/kernels/internode_ll.o'                                                                                                      
47.98 ninja: build stopped: subcommand failed.                                                                                                                                                                                                                          
47.99 Traceback (most recent call last):                                                                                                                                                                                                                                
47.99   File "/usr/local/lib/python3.12/dist-packages/torch/utils/cpp_extension.py", line 2220, in _run_ninja_build                                                                                                                                                     
47.99     subprocess.run(                                                                                                                                                                                                                                               
47.99   File "/usr/lib/python3.12/subprocess.py", line 571, in run                                                                                                                                                                                                      
47.99     raise CalledProcessError(retcode, process.args,                                                                                                                                                                                                               
47.99 subprocess.CalledProcessError: Command '['ninja', '-v']' returned non-zero exit status 1.           

koanho avatar May 09 '25 06:05 koanho

I have the same error, the error is as follows

/sgl-workspace/nvshmem/src/modules/transport/common/transport_gdr_common.cpp 73 GDR driver version: (2, 4)
/sgl-workspace/nvshmem/src/modules/transport/ibgda/ibgda.cpp:nvshmemt_init:3626: neither nv_peer_mem, or nvidia_peermem detected. Skipping transport.

/sgl-workspace/nvshmem/src/host/transport/transport.cpp:nvshmemi_transport_init:275: init failed for transport: IBGDA
/sgl-workspace/nvshmem/src/modules/transport/ibrc/ibrc.cpp 1652 Enumerated IB devices in the system - device id=7 (of 10), name=mlx5_7, num_ports=1
/sgl-workspace/nvshmem/src/host/transport/transport.cpp:nvshmemi_transport_init:287: Unable to initialize any transports. returning error.
/sgl-workspace/nvshmem/src/host/init/init.cu:995: non-zero status: 7 nvshmem detect topo failed

modinfo nvidia-peermem

filename:       /lib/modules/6.8.0-54-generic/updates/dkms/nvidia-peermem.ko
version:        570.86.10
license:        Linux-OpenIB
description:    NVIDIA GPU memory plug-in
author:         Yishai Hadas
srcversion:     3FE468926DDE98F050252DF
depends:        nvidia,ib_uverbs
retpoline:      Y
name:           nvidia_peermem
vermagic:       6.8.0-54-generic SMP preempt mod_unload modversions
parm:           peerdirect_support:Set level of support for Peer-direct, 0 [default] or 1 [legacy, for example MLNX_OFED 4.9 LTS] (int)
parm:           persistent_api_support:Set level of support for persistent APIs, 0 [legacy] or 1 [default] (int)

lsmod | grep nvidia_peermem

nvidia_peermem         16384  0
nvidia              89829376  97 nvidia_uvm,nvidia_peermem,gdrdrv,nvidia_modeset
ib_uverbs             200704  3 nvidia_peermem,rdma_ucm,mlx5_ib

I've been working on this for a long time, is there any good solution?

I found that when I executed shmem_put_bw, a Segmentation fault error would be reported, but when I loaded and executed modprobe nvidia_peermem, this command did not work either.

./shmem_put_bw


Runtime options after parsing command line arguments
min_size: 4, max_size: 4194304, step_factor: 2, iterations: 10, warmup iterations: 5, number of ctas: 32, threads per cta: 256 stride: 1, datatype: int, reduce_op: sum, threadgroup_scope: all_scopes, atomic_op: inc, dir: write, report_msgrate: 0, bidirectional: 0, putget_issue :on_stream
Note: Above is full list of options, any given test will use only a subset of these variables.
mype: 0 mype_node: 0 device name: NVIDIA H20-3e bus id: 42
/sgl-workspace/nvshmem/src/modules/transport/ibrc/ibrc.cpp:nvshmemt_init:1851: neither nv_peer_mem, or nvidia_peermem detected. Skipping transport.

This test requires exactly two processes
Segmentation fault

ch-tiger1 avatar May 09 '25 09:05 ch-tiger1

@koanho Have you modified drvier config? https://github.com/deepseek-ai/DeepEP/tree/main/third-party#4-configure-nvidia-driver

@sphish I used the above method to configure the NVIDIA driver, but still got Segmentation fault. Is there any other solution?

ch-tiger1 avatar May 09 '25 10:05 ch-tiger1

Have you resolved this? I've been stuck here for days. @ch-tiger1

Kevin-XiongC avatar May 15 '25 08:05 Kevin-XiongC

me too

wwj-2017-1117 avatar May 21 '25 09:05 wwj-2017-1117

@sphish Hi Excuse me, how can I determine whether my IB NIC or Driver supports the IBGDA feature? I used ConnectX-7, and OFED-23.10-2.1.3 According to the NVSHMEM code, if the GPU supports dmabuf, then nvidia-peermem is not required — is that correct? Does using DeepEP have any requirements on the Linux kernel version? Is GDRcopy necessary for IBGDA? What is its purpose?

   status =
        CUPFN(ibgda_cuda_syms,
              cuDeviceGetAttribute(&flag, (CUdevice_attribute)CU_DEVICE_ATTRIBUTE_DMA_BUF_SUPPORTED,
                                   gpu_device_id));
    if (status != CUDA_SUCCESS) {
        status = 0;
        cudaGetLastError();
        ibgda_state->cuda_support_dmabuf = false;
    } else {
        ibgda_state->cuda_support_dmabuf = (flag == 1);
    }

    ibgda_state->dmabuf_support_for_data_buffers = ibgda_state->cuda_support_dmabuf;
    if (options->IB_DISABLE_DMABUF) {
        ibgda_state->dmabuf_support_for_data_buffers = false;
    }

    if (ibgda_state->dmabuf_support_for_data_buffers == false) {
        if (nvshmemt_ib_common_nv_peer_mem_available() != NVSHMEMX_SUCCESS) {
            NVSHMEMI_ERROR_PRINT(
                "neither nv_peer_mem, or nvidia_peermem detected. Skipping transport.\n");
            status = NVSHMEMX_ERROR_INTERNAL;
            goto out;
        }
    }

Thank you! I look forward to your response.

kwu130 avatar Jun 07 '25 04:06 kwu130

@sphish Same issue. Any help?

@liusy58 Can you run the NVSHMEM's shmem_put_bw test, and will you encounter the same issue?

After running the command shmem_put_bw, I encountered the following error. Could you give me further guidance? Thanks a lot.

/opt/nvshmem/bin/perftest/device/pt-to-pt/shmem_put_bw 
Runtime options after parsing command line arguments 
min_size: 4, max_size: 4194304, step_factor: 2, iterations: 10, warmup iterations: 5, number of ctas: 32, threads per cta: 256 stride: 1, datatype: int, reduce_op: sum, threadgroup_scope: all_scopes, atomic_op: inc, dir: write, report_msgrate: 0, bidirectional: 0, putget_issue :on_stream
Note: Above is full list of options, any given test will use only a subset of these variables.
mype: 0 mype_node: 0 device name: NVIDIA H800 bus id: 25 
This test requires exactly two processes 
[/xxx/nvshmem_src/perftest/common/utils.cu:408] cuda failed with invalid argument

you need run this example use exactly two PE,so you can use nvshmrun to run it eg: nvshmrun -n 2 ./shmem_put_bw btw, nvshmrun is a progress launcher, you can install it follow install_hydra.sh in nvshmem project, good luck!

kwu130 avatar Jun 07 '25 06:06 kwu130

@kwu130 For information about which devices and driver versions support IBGDA, it’s best to consult NVIDIA directly. We are not entirely sure about the exact compatibility details either. You can refer to our environment setup here: https://github.com/deepseek-ai/DeepEP/issues/36#issuecomment-2892652482. I think that gdrcopy isn't required for IBGDA. However, the last time I tried to build without gdrcopy, I encountered some errors. I haven’t looked into this issue in detail yet.

sphish avatar Jun 09 '25 01:06 sphish

@kwu130 For information about which devices and driver versions support IBGDA, it’s best to consult NVIDIA directly. We are not entirely sure about the exact compatibility details either. You can refer to our environment setup here: #36 (comment). I think that gdrcopy isn't required for IBGDA. However, the last time I tried to build without gdrcopy, I encountered some errors. I haven’t looked into this issue in detail yet.

copy that, thx very much.

kwu130 avatar Jun 12 '25 08:06 kwu130