DeepEP icon indicating copy to clipboard operation
DeepEP copied to clipboard

Test test_low_latency.py failed on H100 with ROCE

Open ImbaPlayer opened this issue 10 months ago • 38 comments

Issue Description

I'm working on an H100 GPU cluster with RoCE drivers properly installed on the network interface cards. While the test_intranode.py script runs successfully and produces expected results, the test_low_latency consistently fails with errors. Technical Details:

  • NVSHMEM version installed: 3.1.7-1 (following the README instructions)
  • Suspected compatibility issue: Potential mismatch between NVSHMEM version and RoCE configuration

I would greatly appreciate any assistance or insights to resolve this. Below are the specific error messages for reference:

Actual Result

(base) root@ubuntu:/work/DeepEP-main# python tests/test_low_latency.py
local_rank:1, ip:127.0.0.1 port:3004
world_size:2, rank:1
local_rank:0, ip:127.0.0.1 port:3004
world_size:2, rank:0
setting....
setting....
setted...
rank:0, num_ranks:2
Allocating buffer size: 2116.292096 MB ...

/work/nvshmem_src_3.1.7-1/nvshmem_src/src/modules/transport/ibrc/ibrc.cpp:417: non-zero status: 110 /work/nvshmem_src_3.1.7-1/nvshmem_src/src/modules/transport/ibrc/ibrc.cpp:417: non-zero status: 110 ibv_modify_qp failed
ibv_modify_qp failed


/work/nvshmem_src_3.1.7-1/nvshmem_src/src/modules/transport/ibrc/ibrc.cpp:1432: non-zero status: 7 /work/nvshmem_src_3.1.7-1/nvshmem_src/src/modules/transport/ibrc/ibrc.cpp:1432: non-zero status: 7 ep_connect failed
ep_connect failed


/work/nvshmem_src_3.1.7-1/nvshmem_src/src/modules/transport/ibrc/ibrc.cpp:1499: non-zero status: 7 /work/nvshmem_src_3.1.7-1/nvshmem_src/src/modules/transport/ibrc/ibrc.cpp:1499: non-zero status: 7 transport create connect failed
transport create connect failed


/work/nvshmem_src_3.1.7-1/nvshmem_src/src/host/transport/transport.cpp:400: non-zero status: 7 /work/nvshmem_src_3.1.7-1/nvshmem_src/src/host/transport/transport.cpp:400: non-zero status: 7 connect EPS failed
connect EPS failed


/work/nvshmem_src_3.1.7-1/nvshmem_src/src/host/init/init.cu:1001: non-zero status: 7 /work/nvshmem_src_3.1.7-1/nvshmem_src/src/host/init/init.cu:1001: non-zero status: 7 nvshmem setup connections failed
nvshmem setup connections failed


/work/nvshmem_src_3.1.7-1/nvshmem_src/src/modules/transport/ibrc/ibrc.cpp:417: non-zero status: 110 ibv_modify_qp failed

/work/nvshmem_src_3.1.7-1/nvshmem_src/src/modules/transport/ibrc/ibrc.cpp:1432: non-zero status: 7 ep_connect failed

/work/nvshmem_src_3.1.7-1/nvshmem_src/src/modules/transport/ibrc/ibrc.cpp:1499: non-zero status: 7 transport create connect failed

/work/nvshmem_src_3.1.7-1/nvshmem_src/src/host/transport/transport.cpp:400: non-zero status: 7 connect EPS failed
/work/nvshmem_src_3.1.7-1/nvshmem_src/src/modules/transport/ibrc/ibrc.cpp:417: non-zero status: 110
ibv_modify_qp failed
/work/nvshmem_src_3.1.7-1/nvshmem_src/src/host/init/init.cu:1001: non-zero status: 7
nvshmem setup connections failed
/work/nvshmem_src_3.1.7-1/nvshmem_src/src/modules/transport/ibrc/ibrc.cpp:1432: non-zero status: 7
ep_connect failed
/work/nvshmem_src_3.1.7-1/nvshmem_src/src/host/init/init.cu:nvshmemi_check_state_and_init:1074:
nvshmem initialization failed, exiting
/work/nvshmem_src_3.1.7-1/nvshmem_src/src/modules/transport/ibrc/ibrc.cpp:1499: non-zero status: 7
transport create connect failed

/work/nvshmem_src_3.1.7-1/nvshmem_src/src/host/transport/transport.cpp:400: non-zero status: 7 connect EPS failed

/work/nvshmem_src_3.1.7-1/nvshmem_src/src/host/init/init.cu:1001: non-zero status: 7 nvshmem setup connections failed

/work/nvshmem_src_3.1.7-1/nvshmem_src/src/host/init/init.cu:nvshmemi_check_state_and_init:1074: nvshmem initialization failed, exiting

W0303 20:33:45.904000 12712 site-packages/torch/multiprocessing/spawn.py:169] Terminating process 12777 via signal SIGTERM
Traceback (most recent call last):
  File "/work/DeepEP-main/tests/test_low_latency.py", line 164, in <module>
    torch.multiprocessing.spawn(test_loop, args=(num_processes,), nprocs=num_processes)
  File "/root/miniconda3/lib/python3.12/site-packages/torch/multiprocessing/spawn.py", line 340, in spawn
    return start_processes(fn, args, nprocs, join, daemon, start_method="spawn")
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/miniconda3/lib/python3.12/site-packages/torch/multiprocessing/spawn.py", line 296, in start_processes
    while not context.join():
              ^^^^^^^^^^^^^^
  File "/root/miniconda3/lib/python3.12/site-packages/torch/multiprocessing/spawn.py", line 204, in join
    raise ProcessExitedException(
torch.multiprocessing.spawn.ProcessExitedException: process 1 terminated with exit code 255

Env

Ubuntu2204

Linux ubuntu 5.15.0-25-generic #25-Ubuntu SMP Wed Mar 30 15:54:22 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux

nvidia-smi

+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 560.35.05              Driver Version: 560.35.05      CUDA Version: 12.6     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA H100 80GB HBM3          Off |   00000000:18:00.0 Off |                    0 |
| N/A   22C    P0             68W /  700W |       1MiB /  81559MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   1  NVIDIA H100 80GB HBM3          Off |   00000000:2A:00.0 Off |                    0 |
| N/A   25C    P0             71W /  700W |       1MiB /  81559MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   2  NVIDIA H100 80GB HBM3          Off |   00000000:3A:00.0 Off |                    0 |
| N/A   24C    P0             69W /  700W |       1MiB /  81559MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   3  NVIDIA H100 80GB HBM3          Off |   00000000:5D:00.0 Off |                    0 |
| N/A   22C    P0             70W /  700W |       1MiB /  81559MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   4  NVIDIA H100 80GB HBM3          Off |   00000000:9A:00.0 Off |                    0 |
| N/A   24C    P0             71W /  700W |       1MiB /  81559MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   5  NVIDIA H100 80GB HBM3          Off |   00000000:AB:00.0 Off |                    0 |
| N/A   25C    P0             72W /  700W |       1MiB /  81559MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   6  NVIDIA H100 80GB HBM3          Off |   00000000:BA:00.0 Off |                    0 |
| N/A   23C    P0             70W /  700W |       1MiB /  81559MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   7  NVIDIA H100 80GB HBM3          Off |   00000000:DB:00.0 Off |                    0 |
| N/A   22C    P0             69W /  700W |       1MiB /  81559MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+

ibv_devinfo

hca_id: mlx5_0
        transport:                      InfiniBand (0)
        fw_ver:                         28.42.1000
        node_guid:                      a088:c203:0059:910c
        sys_image_guid:                 a088:c203:0059:910c
        vendor_id:                      0x02c9
        vendor_part_id:                 4129
        hw_ver:                         0x0
        board_id:                       MT_0000000838
        phys_port_cnt:                  1
                port:   1
                        state:                  PORT_ACTIVE (4)
                        max_mtu:                4096 (5)
                        active_mtu:             4096 (5)
                        sm_lid:                 0
                        port_lid:               0
                        port_lmc:               0x00
                        link_layer:             Ethernet

hca_id: mlx5_1
        transport:                      InfiniBand (0)
        fw_ver:                         28.42.1000
        node_guid:                      a088:c203:0050:a72c
        sys_image_guid:                 a088:c203:0050:a72c
        vendor_id:                      0x02c9
        vendor_part_id:                 4129
        hw_ver:                         0x0
        board_id:                       MT_0000000838
        phys_port_cnt:                  1
                port:   1
                        state:                  PORT_ACTIVE (4)
                        max_mtu:                4096 (5)
                        active_mtu:             4096 (5)
                        sm_lid:                 0
                        port_lid:               0
                        port_lmc:               0x00
                        link_layer:             Ethernet

hca_id: mlx5_2
        transport:                      InfiniBand (0)
        fw_ver:                         28.42.1000
        node_guid:                      a088:c203:007e:1dba
        sys_image_guid:                 a088:c203:007e:1dba
        vendor_id:                      0x02c9
        vendor_part_id:                 4129
        hw_ver:                         0x0
        board_id:                       MT_0000000838
        phys_port_cnt:                  1
                port:   1
                        state:                  PORT_ACTIVE (4)
                        max_mtu:                4096 (5)
                        active_mtu:             4096 (5)
                        sm_lid:                 0
                        port_lid:               0
                        port_lmc:               0x00
                        link_layer:             Ethernet

hca_id: mlx5_3
        transport:                      InfiniBand (0)
        fw_ver:                         16.35.4030
        node_guid:                      e8eb:d303:0055:750a
        sys_image_guid:                 e8eb:d303:0055:750a
        vendor_id:                      0x02c9
        vendor_part_id:                 4119
        hw_ver:                         0x0
        board_id:                       MT_0000000425
        phys_port_cnt:                  1
                port:   1
                        state:                  PORT_ACTIVE (4)
                        max_mtu:                4096 (5)
                        active_mtu:             1024 (3)
                        sm_lid:                 0
                        port_lid:               0
                        port_lmc:               0x00
                        link_layer:             Ethernet

hca_id: mlx5_4
        transport:                      InfiniBand (0)
        fw_ver:                         16.35.4030
        node_guid:                      e8eb:d303:0055:750b
        sys_image_guid:                 e8eb:d303:0055:750a
        vendor_id:                      0x02c9
        vendor_part_id:                 4119
        hw_ver:                         0x0
        board_id:                       MT_0000000425
        phys_port_cnt:                  1
                port:   1
                        state:                  PORT_DOWN (1)
                        max_mtu:                4096 (5)
                        active_mtu:             1024 (3)
                        sm_lid:                 0
                        port_lid:               0
                        port_lmc:               0x00
                        link_layer:             Ethernet

hca_id: mlx5_5
        transport:                      InfiniBand (0)
        fw_ver:                         28.42.1000
        node_guid:                      a088:c203:0060:27e6
        sys_image_guid:                 a088:c203:0060:27e6
        vendor_id:                      0x02c9
        vendor_part_id:                 4129
        hw_ver:                         0x0
        board_id:                       MT_0000000838
        phys_port_cnt:                  1
                port:   1
                        state:                  PORT_ACTIVE (4)
                        max_mtu:                4096 (5)
                        active_mtu:             4096 (5)
                        sm_lid:                 0
                        port_lid:               0
                        port_lmc:               0x00
                        link_layer:             Ethernet

hca_id: mlx5_6
        transport:                      InfiniBand (0)
        fw_ver:                         28.42.1000
        node_guid:                      a088:c203:007e:1c3a
        sys_image_guid:                 a088:c203:007e:1c3a
        vendor_id:                      0x02c9
        vendor_part_id:                 4129
        hw_ver:                         0x0
        board_id:                       MT_0000000838
        phys_port_cnt:                  1
                port:   1
                        state:                  PORT_ACTIVE (4)
                        max_mtu:                4096 (5)
                        active_mtu:             4096 (5)
                        sm_lid:                 0
                        port_lid:               0
                        port_lmc:               0x00
                        link_layer:             Ethernet

hca_id: mlx5_7
        transport:                      InfiniBand (0)
        fw_ver:                         28.42.1000
        node_guid:                      a088:c203:0060:2b1e
        sys_image_guid:                 a088:c203:0060:2b1e
        vendor_id:                      0x02c9
        vendor_part_id:                 4129
        hw_ver:                         0x0
        board_id:                       MT_0000000838
        phys_port_cnt:                  1
                port:   1
                        state:                  PORT_ACTIVE (4)
                        max_mtu:                4096 (5)
                        active_mtu:             4096 (5)
                        sm_lid:                 0
                        port_lid:               0
                        port_lmc:               0x00
                        link_layer:             Ethernet

hca_id: mlx5_8
        transport:                      InfiniBand (0)
        fw_ver:                         28.42.1000
        node_guid:                      a088:c203:007d:ab62
        sys_image_guid:                 a088:c203:007d:ab62
        vendor_id:                      0x02c9
        vendor_part_id:                 4129
        hw_ver:                         0x0
        board_id:                       MT_0000000838
        phys_port_cnt:                  1
                port:   1
                        state:                  PORT_ACTIVE (4)
                        max_mtu:                4096 (5)
                        active_mtu:             4096 (5)
                        sm_lid:                 0
                        port_lid:               0
                        port_lmc:               0x00
                        link_layer:             Ethernet

hca_id: mlx5_9
        transport:                      InfiniBand (0)
        fw_ver:                         28.42.1000
        node_guid:                      a088:c203:007d:ab9a
        sys_image_guid:                 a088:c203:007d:ab9a
        vendor_id:                      0x02c9
        vendor_part_id:                 4129
        hw_ver:                         0x0
        board_id:                       MT_0000000838
        phys_port_cnt:                  1
                port:   1
                        state:                  PORT_ACTIVE (4)
                        max_mtu:                4096 (5)
                        active_mtu:             4096 (5)
                        sm_lid:                 0
                        port_lid:               0
                        port_lmc:               0x00
                        link_layer:             Ethernet

ImbaPlayer avatar Mar 03 '25 12:03 ImbaPlayer

This is a bug in NVSHMEM 3.1.7 and can be resolved by using NVSHMEM 3.2.5. https://github.com/deepseek-ai/DeepEP/issues/17#issuecomment-2684327121

sphish avatar Mar 04 '25 01:03 sphish

This is a bug in NVSHMEM 3.1.7 and can be resolved by using NVSHMEM 3.2.5. #17 (comment)

How to deal with conflicts in deepep patch packages?

Baibaifan avatar Mar 04 '25 02:03 Baibaifan

This is a bug in NVSHMEM 3.1.7 and can be resolved by using NVSHMEM 3.2.5. #17 (comment)

How to deal with conflicts in deepep patch packages?

For the current patch, the conflict is caused by a commit modifying the CMake file. You can skip this commit. I will later upload a new patch compatible with NVSHMEM 3.2.5.

sphish avatar Mar 04 '25 03:03 sphish

This is a bug in NVSHMEM 3.1.7 and can be resolved by using NVSHMEM 3.2.5. #17 (comment)

How to deal with conflicts in deepep patch packages?

For the current patch, the conflict is caused by a commit modifying the CMake file. You can skip this commit. I will later upload a new patch compatible with NVSHMEM 3.2.5.

I tried to modify the CMakeLists.txt file, but CMakeLists.txt has been refactored, and I don't know how to modify the original patch 140 and 165. I hope you can give me some suggestions. I'll wait for you to update the patch. @sphish

Baibaifan avatar Mar 04 '25 08:03 Baibaifan

I have created a new branch that updates NVSHMEM to version 3.2.5. However, I don't have a RoCE environment for verification. Can you test this out? @Baibaifan Branch: https://github.com/deepseek-ai/DeepEP/tree/roce-support

sphish avatar Mar 05 '25 08:03 sphish

I have created a new branch that updates NVSHMEM to version 3.2.5. However, I don't have a RoCE environment for verification. Can you test this out? @Baibaifan Branch: https://github.com/deepseek-ai/DeepEP/tree/roce-support

ok, I will try it and does the cmake configuration need to be modified? @sphish

CUDA_HOME=/path/to/cuda \
GDRCOPY_HOME=/path/to/gdrcopy \
NVSHMEM_SHMEM_SUPPORT=0 \
NVSHMEM_UCX_SUPPORT=0 \
NVSHMEM_USE_NCCL=0 \
NVSHMEM_MPI_SUPPORT=0 \
NVSHMEM_IBGDA_SUPPORT=1 \
NVSHMEM_PMIX_SUPPORT=0 \
NVSHMEM_TIMEOUT_DEVICE_POLLING=0 \
NVSHMEM_USE_GDRCOPY=1 \
cmake -S . -B build/ -DCMAKE_INSTALL_PREFIX=/path/to/your/dir/to/install

Baibaifan avatar Mar 05 '25 08:03 Baibaifan

I have created a new branch that updates NVSHMEM to version 3.2.5. However, I don't have a RoCE environment for verification. Can you test this out? @Baibaifan Branch: https://github.com/deepseek-ai/DeepEP/tree/roce-support

ok, I will try it and does the cmake configuration need to be modified? @sphish

CUDA_HOME=/path/to/cuda \
GDRCOPY_HOME=/path/to/gdrcopy \
NVSHMEM_SHMEM_SUPPORT=0 \
NVSHMEM_UCX_SUPPORT=0 \
NVSHMEM_USE_NCCL=0 \
NVSHMEM_MPI_SUPPORT=0 \
NVSHMEM_IBGDA_SUPPORT=1 \
NVSHMEM_PMIX_SUPPORT=0 \
NVSHMEM_TIMEOUT_DEVICE_POLLING=0 \
NVSHMEM_USE_GDRCOPY=1 \
cmake -S . -B build/ -DCMAKE_INSTALL_PREFIX=/path/to/your/dir/to/install

Nope.

sphish avatar Mar 05 '25 08:03 sphish

I have created a new branch that updates NVSHMEM to version 3.2.5. However, I don't have a RoCE environment for verification. Can you test this out? @Baibaifan Branch: https://github.com/deepseek-ai/DeepEP/tree/roce-support

I have the same problem on one 8*H100 cluster node for test_low_latency.py, and update NVSHMEM to version 3.2.5 with the newest patch. However, it doesn't work, it lacks for, with the same logs, i.e., /work/nvshmem_src_3.1.7-1/nvshmem_src/src/modules/transport/ibrc/ibrc.cpp:417: non-zero status: 110 /work/nvshmem_src_3.1.7-1/nvshmem_src/src/modules/transport/ibrc/ibrc.cpp:417: non-zero status: 110 ibv_modify_qp failed ibv_modify_qp failed. However, I have tried the ib_write_bw cmd successfully. Any kind suggestion for this problem?

kunfupanda-hw avatar Mar 05 '25 13:03 kunfupanda-hw

I have created a new branch that updates NVSHMEM to version 3.2.5. However, I don't have a RoCE environment for verification. Can you test this out? @Baibaifan Branch: https://github.com/deepseek-ai/DeepEP/tree/roce-support

I have the same problem on one 8*H100 cluster node for test_low_latency.py, and update NVSHMEM to version 3.2.5 with the newest patch. However, it doesn't work, it lacks for, with the same logs, i.e., /work/nvshmem_src_3.1.7-1/nvshmem_src/src/modules/transport/ibrc/ibrc.cpp:417: non-zero status: 110 /work/nvshmem_src_3.1.7-1/nvshmem_src/src/modules/transport/ibrc/ibrc.cpp:417: non-zero status: 110 ibv_modify_qp failed ibv_modify_qp failed. However, I have tried the ib_write_bw cmd successfully. Any kind suggestion for this problem?

Could you provide the logs from your NVSHMEM 3.2.5 run? It appears you're still using NVSHMEM 3.1.7.

sphish avatar Mar 06 '25 02:03 sphish

I have created a new branch that updates NVSHMEM to version 3.2.5. However, I don't have a RoCE environment for verification. Can you test this out? @Baibaifan Branch: https://github.com/deepseek-ai/DeepEP/tree/roce-support

ok, I will try it and does the cmake configuration need to be modified? @sphish

CUDA_HOME=/path/to/cuda \
GDRCOPY_HOME=/path/to/gdrcopy \
NVSHMEM_SHMEM_SUPPORT=0 \
NVSHMEM_UCX_SUPPORT=0 \
NVSHMEM_USE_NCCL=0 \
NVSHMEM_MPI_SUPPORT=0 \
NVSHMEM_IBGDA_SUPPORT=1 \
NVSHMEM_PMIX_SUPPORT=0 \
NVSHMEM_TIMEOUT_DEVICE_POLLING=0 \
NVSHMEM_USE_GDRCOPY=1 \
cmake -S . -B build/ -DCMAKE_INSTALL_PREFIX=/path/to/your/dir/to/install

Nope.

Hi @sphish, The process works, but the performance does not seem to meet expectations.

env:

  1. H100 80GB HBM3 *8/HPC
  2. 4 * CX7 NICs 400gb/s/HPC
  3. RoCE

rank0 NCCL_DEBUG=INFO MASTER_ADDR=xxx WORLD_SIZE=2 RANK=0 python tests/test_internode.py

rank1 NCCL_DEBUG=INFO MASTER_ADDR=xxx WORLD_SIZE=2 RANK=1 python tests/test_internode.py

[tuning] Best dispatch (FP8): SMs 24, NVL chunk 28, RDMA chunk 8: 4.09 GB/s (RDMA), 13.36 GB/s (NVL)
[tuning] Best dispatch (BF16): SMs 24, NVL chunk 28, RDMA chunk 32: 4.57 GB/s (RDMA), 14.91 GB/s (NVL)
[tuning] Best combine: SMs 24, NVL chunk 3, RDMA chunk 32: 2.97 GB/s (RDMA), 9.69 GB/s (NVL)

The difference between the above speed and the official data is greater than the difference between the network cards. How can I test the performance with perftest? Is there any reference?

Baibaifan avatar Mar 06 '25 02:03 Baibaifan

I have created a new branch that updates NVSHMEM to version 3.2.5. However, I don't have a RoCE environment for verification. Can you test this out? @Baibaifan Branch: https://github.com/deepseek-ai/DeepEP/tree/roce-support

I have the same problem on one 8*H100 cluster node for test_low_latency.py, and update NVSHMEM to version 3.2.5 with the newest patch. However, it doesn't work, it lacks for, with the same logs, i.e., /work/nvshmem_src_3.1.7-1/nvshmem_src/src/modules/transport/ibrc/ibrc.cpp:417: non-zero status: 110 /work/nvshmem_src_3.1.7-1/nvshmem_src/src/modules/transport/ibrc/ibrc.cpp:417: non-zero status: 110 ibv_modify_qp failed ibv_modify_qp failed. However, I have tried the ib_write_bw cmd successfully. Any kind suggestion for this problem?

Could you provide the logs from your NVSHMEM 3.2.5 run? It appears you're still using NVSHMEM 3.1.7.

sorry for that old logs. The newest logs are as below, with some my comment logs.

(base) root@ubuntu:/work/DeepEP-main# python tests/test_low_latency.py 
local_rank:5, ip:127.0.0.1 port:3004
world_size:8, rank:5
local_rank:4, ip:127.0.0.1 port:3004
world_size:8, rank:4
local_rank:6, ip:127.0.0.1 port:3004
world_size:8, rank:6
local_rank:7, ip:127.0.0.1 port:3004
world_size:8, rank:7
local_rank:0, ip:127.0.0.1 port:3004
world_size:8, rank:0
local_rank:1, ip:127.0.0.1 port:3004
world_size:8, rank:1
local_rank:2, ip:127.0.0.1 port:3004
world_size:8, rank:2
local_rank:3, ip:127.0.0.1 port:3004
world_size:8, rank:3
setting....
setting....
setting....
setting....
setting....
setting....
setting....
setting....
setted...
rank:2, num_ranks:8
before init buffer
buffer.__init__
before cpp.Buffer
setted...
rank:0, num_ranks:8
Allocating buffer size: 2116.2912 MB ...
before init buffer
buffer.__init__
before cpp.Buffer
after cpp.Buffer
setted...
rank:1, num_ranks:8
before init buffer
buffer.__init__
before cpp.Buffer
setted...
rank:5, num_ranks:8
before init buffer
buffer.__init__
before cpp.Buffer
after cpp.Buffer
setted...
rank:4, num_ranks:8
before init buffer
buffer.__init__
before cpp.Buffer
after cpp.Buffer
after cpp.Buffer
setted...
rank:7, num_ranks:8
before init buffer
buffer.__init__
before cpp.Buffer
after cpp.Buffer
setted...
rank:3, num_ranks:8
before init buffer
buffer.__init__
before cpp.Buffer
after cpp.Buffer
after cpp.Buffer
setted...
rank:6, num_ranks:8
before init buffer
buffer.__init__
before cpp.Buffer
after cpp.Buffer
finish all_gather object
finish all_gather object
finish all_gather object
finish all_gather object
finish all_gather object
finish all_gather object
finish all_gather object
finish all_gather object
before runtime.sync
before runtime.sync
before runtime.sync
before runtime.sync
before runtime.sync
before runtime.sync
before runtime.sync
before runtime.sync
/work/nvshmem_src_3.2.5-1/nvshmem_src/src/modules/transport/ibrc/ibrc.cpp:418: non-zero status: 110 ibv_modify_qp failed 
/work/nvshmem_src_3.2.5-1/nvshmem_src/src/modules/transport/ibrc/ibrc.cpp:418: non-zero status: 110 
ibv_modify_qp failed 
/work/nvshmem_src_3.2.5-1/nvshmem_src/src/modules/transport/ibrc/ibrc.cpp:1433: non-zero status: 7 ep_connect failed 


/work/nvshmem_src_3.2.5-1/nvshmem_src/src/modules/transport/ibrc/ibrc.cpp:1433: non-zero status: 7 /work/nvshmem_src_3.2.5-1/nvshmem_src/src/modules/transport/ibrc/ibrc.cpp:1500: non-zero status: 7 ep_connect failed 
transport create connect failed 


/work/nvshmem_src_3.2.5-1/nvshmem_src/src/modules/transport/ibrc/ibrc.cpp:1500: non-zero status: 7 transport create connect failed 

/work/nvshmem_src_3.2.5-1/nvshmem_src/src/host/transport/transport.cpp:400: non-zero status: 7 connect EPS failed 
/work/nvshmem_src_3.2.5-1/nvshmem_src/src/host/transport/transport.cpp:400: non-zero status: 7 
connect EPS failed 
/work/nvshmem_src_3.2.5-1/nvshmem_src/src/host/init/init.cu:1007: non-zero status: 7 
nvshmem setup connections failed 
/work/nvshmem_src_3.2.5-1/nvshmem_src/src/host/init/init.cu:1007: non-zero status: 7 
nvshmem setup connections failed 

/work/nvshmem_src_3.2.5-1/nvshmem_src/src/modules/transport/ibrc/ibrc.cpp:418: non-zero status: 110 /work/nvshmem_src_3.2.5-1/nvshmem_src/src/modules/transport/ibrc/ibrc.cpp:418: non-zero status: 110 /work/nvshmem_src_3.2.5-1/nvshmem_src/src/modules/transport/ibrc/ibrc.cpp:418: non-zero status: 110 /work/nvshmem_src_3.2.5-1/nvshmem_src/src/modules/transport/ibrc/ibrc.cpp:418: non-zero status: 110 ibv_modify_qp failed 
/work/nvshmem_src_3.2.5-1/nvshmem_src/src/modules/transport/ibrc/ibrc.cpp:418: non-zero status: 110 /work/nvshmem_src_3.2.5-1/nvshmem_src/src/modules/transport/ibrc/ibrc.cpp:418: non-zero status: 110 ibv_modify_qp failed 
ibv_modify_qp failed 
ibv_modify_qp failed 

ibv_modify_qp failed 
ibv_modify_qp failed 



/work/nvshmem_src_3.2.5-1/nvshmem_src/src/modules/transport/ibrc/ibrc.cpp:1433: non-zero status: 7 

/work/nvshmem_src_3.2.5-1/nvshmem_src/src/modules/transport/ibrc/ibrc.cpp:1433: non-zero status: 7 /work/nvshmem_src_3.2.5-1/nvshmem_src/src/modules/transport/ibrc/ibrc.cpp:1433: non-zero status: 7 /work/nvshmem_src_3.2.5-1/nvshmem_src/src/modules/transport/ibrc/ibrc.cpp:1433: non-zero status: 7 ep_connect failed 
/work/nvshmem_src_3.2.5-1/nvshmem_src/src/modules/transport/ibrc/ibrc.cpp:1433: non-zero status: 7 /work/nvshmem_src_3.2.5-1/nvshmem_src/src/modules/transport/ibrc/ibrc.cpp:1433: non-zero status: 7 ep_connect failed 
ep_connect failed 
ep_connect failed 

ep_connect failed 
ep_connect failed 


/work/nvshmem_src_3.2.5-1/nvshmem_src/src/modules/transport/ibrc/ibrc.cpp:1500: non-zero status: 7 
/work/nvshmem_src_3.2.5-1/nvshmem_src/src/modules/transport/ibrc/ibrc.cpp:1500: non-zero status: 7 

/work/nvshmem_src_3.2.5-1/nvshmem_src/src/modules/transport/ibrc/ibrc.cpp:1500: non-zero status: 7 transport create connect failed 
/work/nvshmem_src_3.2.5-1/nvshmem_src/src/modules/transport/ibrc/ibrc.cpp:1500: non-zero status: 7 transport create connect failed 
/work/nvshmem_src_3.2.5-1/nvshmem_src/src/modules/transport/ibrc/ibrc.cpp:1500: non-zero status: 7 /work/nvshmem_src_3.2.5-1/nvshmem_src/src/modules/transport/ibrc/ibrc.cpp:1500: non-zero status: 7 transport create connect failed 

transport create connect failed 

transport create connect failed 
transport create connect failed 


/work/nvshmem_src_3.2.5-1/nvshmem_src/src/host/transport/transport.cpp:400: non-zero status: 7 
/work/nvshmem_src_3.2.5-1/nvshmem_src/src/host/transport/transport.cpp:400: non-zero status: 7 
/work/nvshmem_src_3.2.5-1/nvshmem_src/src/host/transport/transport.cpp:400: non-zero status: 7 connect EPS failed 
connect EPS failed 
/work/nvshmem_src_3.2.5-1/nvshmem_src/src/host/transport/transport.cpp:400: non-zero status: 7 connect EPS failed 
/work/nvshmem_src_3.2.5-1/nvshmem_src/src/host/transport/transport.cpp:400: non-zero status: 7 /work/nvshmem_src_3.2.5-1/nvshmem_src/src/host/transport/transport.cpp:400: non-zero status: 7 connect EPS failed 


connect EPS failed 

connect EPS failed 

/work/nvshmem_src_3.2.5-1/nvshmem_src/src/host/init/init.cu:1007: non-zero status: 7 /work/nvshmem_src_3.2.5-1/nvshmem_src/src/host/init/init.cu:1007: non-zero status: 7 
/work/nvshmem_src_3.2.5-1/nvshmem_src/src/host/init/init.cu:1007: non-zero status: 7 
/work/nvshmem_src_3.2.5-1/nvshmem_src/src/host/init/init.cu:1007: non-zero status: 7 nvshmem setup connections failed 
nvshmem setup connections failed 
/work/nvshmem_src_3.2.5-1/nvshmem_src/src/host/init/init.cu:1007: non-zero status: 7 nvshmem setup connections failed 
/work/nvshmem_src_3.2.5-1/nvshmem_src/src/host/init/init.cu:1007: non-zero status: 7 nvshmem setup connections failed 


nvshmem setup connections failed 

nvshmem setup connections failed 



/work/nvshmem_src_3.2.5-1/nvshmem_src/src/modules/transport/ibrc/ibrc.cpp:418: non-zero status: 110 /work/nvshmem_src_3.2.5-1/nvshmem_src/src/modules/transport/ibrc/ibrc.cpp:418: non-zero status: 110 /work/nvshmem_src_3.2.5-1/nvshmem_src/src/modules/transport/ibrc/ibrc.cpp:418: non-zero status: 110 ibv_modify_qp failed 
ibv_modify_qp failed 
/work/nvshmem_src_3.2.5-1/nvshmem_src/src/modules/transport/ibrc/ibrc.cpp:418: non-zero status: 110 ibv_modify_qp failed 


/work/nvshmem_src_3.2.5-1/nvshmem_src/src/modules/transport/ibrc/ibrc.cpp:418: non-zero status: 110 ibv_modify_qp failed 


/work/nvshmem_src_3.2.5-1/nvshmem_src/src/modules/transport/ibrc/ibrc.cpp:1433: non-zero status: 7 /work/nvshmem_src_3.2.5-1/nvshmem_src/src/modules/transport/ibrc/ibrc.cpp:1433: non-zero status: 7 /work/nvshmem_src_3.2.5-1/nvshmem_src/src/modules/transport/ibrc/ibrc.cpp:1433: non-zero status: 7 ibv_modify_qp failed 
/work/nvshmem_src_3.2.5-1/nvshmem_src/src/modules/transport/ibrc/ibrc.cpp:1433: non-zero status: 7 ep_connect failed 
ep_connect failed 
ep_connect failed 

ep_connect failed 



/work/nvshmem_src_3.2.5-1/nvshmem_src/src/modules/transport/ibrc/ibrc.cpp:1500: non-zero status: 7 /work/nvshmem_src_3.2.5-1/nvshmem_src/src/modules/transport/ibrc/ibrc.cpp:1433: non-zero status: 7 
/work/nvshmem_src_3.2.5-1/nvshmem_src/src/modules/transport/ibrc/ibrc.cpp:1500: non-zero status: 7 transport create connect failed 
/work/nvshmem_src_3.2.5-1/nvshmem_src/src/modules/transport/ibrc/ibrc.cpp:1500: non-zero status: 7 ep_connect failed 
/work/nvshmem_src_3.2.5-1/nvshmem_src/src/modules/transport/ibrc/ibrc.cpp:1500: non-zero status: 7 transport create connect failed 

transport create connect failed 

/work/nvshmem_src_3.2.5-1/nvshmem_src/src/modules/transport/ibrc/ibrc.cpp:418: non-zero status: 110 transport create connect failed 

/work/nvshmem_src_3.2.5-1/nvshmem_src/src/modules/transport/ibrc/ibrc.cpp:418: non-zero status: 110 /work/nvshmem_src_3.2.5-1/nvshmem_src/src/modules/transport/ibrc/ibrc.cpp:418: non-zero status: 110 
/work/nvshmem_src_3.2.5-1/nvshmem_src/src/host/transport/transport.cpp:400: non-zero status: 7 /work/nvshmem_src_3.2.5-1/nvshmem_src/src/modules/transport/ibrc/ibrc.cpp:1500: non-zero status: 7 ibv_modify_qp failed 

/work/nvshmem_src_3.2.5-1/nvshmem_src/src/host/transport/transport.cpp:400: non-zero status: 7 ibv_modify_qp failed 
ibv_modify_qp failed 
connect EPS failed 
/work/nvshmem_src_3.2.5-1/nvshmem_src/src/host/transport/transport.cpp:400: non-zero status: 7 transport create connect failed 

connect EPS failed 
/work/nvshmem_src_3.2.5-1/nvshmem_src/src/host/transport/transport.cpp:400: non-zero status: 7 


connect EPS failed 
/work/nvshmem_src_3.2.5-1/nvshmem_src/src/modules/transport/ibrc/ibrc.cpp:1433: non-zero status: 7 

/work/nvshmem_src_3.2.5-1/nvshmem_src/src/modules/transport/ibrc/ibrc.cpp:1433: non-zero status: 7 connect EPS failed 
/work/nvshmem_src_3.2.5-1/nvshmem_src/src/modules/transport/ibrc/ibrc.cpp:1433: non-zero status: 7 /work/nvshmem_src_3.2.5-1/nvshmem_src/src/host/init/init.cu:1007: non-zero status: 7 
ep_connect failed 
/work/nvshmem_src_3.2.5-1/nvshmem_src/src/host/init/init.cu:1007: non-zero status: 7 /work/nvshmem_src_3.2.5-1/nvshmem_src/src/host/transport/transport.cpp:400: non-zero status: 7 ep_connect failed 

ep_connect failed 
nvshmem setup connections failed 
/work/nvshmem_src_3.2.5-1/nvshmem_src/src/host/init/init.cu:1007: non-zero status: 7 
nvshmem setup connections failed 
connect EPS failed 

/work/nvshmem_src_3.2.5-1/nvshmem_src/src/host/init/init.cu:1007: non-zero status: 7 

nvshmem setup connections failed 
/work/nvshmem_src_3.2.5-1/nvshmem_src/src/modules/transport/ibrc/ibrc.cpp:1500: non-zero status: 7 

/work/nvshmem_src_3.2.5-1/nvshmem_src/src/modules/transport/ibrc/ibrc.cpp:1500: non-zero status: 7 nvshmem setup connections failed 
/work/nvshmem_src_3.2.5-1/nvshmem_src/src/modules/transport/ibrc/ibrc.cpp:1500: non-zero status: 7 /work/nvshmem_src_3.2.5-1/nvshmem_src/src/host/init/init.cu:nvshmemi_check_state_and_init:1080: transport create connect failed 

transport create connect failed 
/work/nvshmem_src_3.2.5-1/nvshmem_src/src/host/init/init.cu:nvshmemi_check_state_and_init:1080: /work/nvshmem_src_3.2.5-1/nvshmem_src/src/host/init/init.cu:1007: non-zero status: 7 transport create connect failed 

nvshmem initialization failed, exiting 

/work/nvshmem_src_3.2.5-1/nvshmem_src/src/host/init/init.cu:nvshmemi_check_state_and_init:1080: 
nvshmem initialization failed, exiting 
nvshmem setup connections failed 


/work/nvshmem_src_3.2.5-1/nvshmem_src/src/host/init/init.cu:nvshmemi_check_state_and_init:1080: /work/nvshmem_src_3.2.5-1/nvshmem_src/src/host/transport/transport.cpp:400: non-zero status: 7 nvshmem initialization failed, exiting 

/work/nvshmem_src_3.2.5-1/nvshmem_src/src/host/transport/transport.cpp:400: non-zero status: 7 
/work/nvshmem_src_3.2.5-1/nvshmem_src/src/host/transport/transport.cpp:400: non-zero status: 7 nvshmem initialization failed, exiting 
connect EPS failed 


connect EPS failed 
/work/nvshmem_src_3.2.5-1/nvshmem_src/src/host/init/init.cu:nvshmemi_check_state_and_init:1080: connect EPS failed 

/work/nvshmem_src_3.2.5-1/nvshmem_src/src/host/init/init.cu:1007: non-zero status: 7 
nvshmem initialization failed, exiting 

nvshmem setup connections failed 
/work/nvshmem_src_3.2.5-1/nvshmem_src/src/host/init/init.cu:1007: non-zero status: 7 
/work/nvshmem_src_3.2.5-1/nvshmem_src/src/host/init/init.cu:1007: non-zero status: 7 
nvshmem setup connections failed 
nvshmem setup connections failed 
/work/nvshmem_src_3.2.5-1/nvshmem_src/src/host/init/init.cu:nvshmemi_check_state_and_init:1080: 

nvshmem initialization failed, exiting 
/work/nvshmem_src_3.2.5-1/nvshmem_src/src/host/init/init.cu:nvshmemi_check_state_and_init:1080: /work/nvshmem_src_3.2.5-1/nvshmem_src/src/host/init/init.cu:nvshmemi_check_state_and_init:1080: 
nvshmem initialization failed, exiting 
nvshmem initialization failed, exiting 


W0305 11:55:57.242000 19054 site-packages/torch/multiprocessing/spawn.py:169] Terminating process 19119 via signal SIGTERM
W0305 11:55:57.244000 19054 site-packages/torch/multiprocessing/spawn.py:169] Terminating process 19120 via signal SIGTERM
W0305 11:55:57.245000 19054 site-packages/torch/multiprocessing/spawn.py:169] Terminating process 19121 via signal SIGTERM
W0305 11:55:57.245000 19054 site-packages/torch/multiprocessing/spawn.py:169] Terminating process 19122 via signal SIGTERM
W0305 11:55:57.245000 19054 site-packages/torch/multiprocessing/spawn.py:169] Terminating process 19124 via signal SIGTERM
W0305 11:55:57.246000 19054 site-packages/torch/multiprocessing/spawn.py:169] Terminating process 19125 via signal SIGTERM
W0305 11:55:57.246000 19054 site-packages/torch/multiprocessing/spawn.py:169] Terminating process 19126 via signal SIGTERM
Traceback (most recent call last):
  File "/work/DeepEP-main/tests/test_low_latency.py", line 164, in <module>
    torch.multiprocessing.spawn(test_loop, args=(num_processes,), nprocs=num_processes)
  File "/root/miniconda3/lib/python3.12/site-packages/torch/multiprocessing/spawn.py", line 340, in spawn
    return start_processes(fn, args, nprocs, join, daemon, start_method="spawn")
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/miniconda3/lib/python3.12/site-packages/torch/multiprocessing/spawn.py", line 296, in start_processes
    while not context.join():
              ^^^^^^^^^^^^^^
  File "/root/miniconda3/lib/python3.12/site-packages/torch/multiprocessing/spawn.py", line 204, in join
    raise ProcessExitedException(
torch.multiprocessing.spawn.ProcessExitedException: process 4 terminated with exit code 255

By following the logs, we found the calltrace was test_loop() --> buffer = deep_ep.Buffer() --> __init__ --> self.runtime.sync(device_ids, ipc_handles, root_unique_id) in DeepEP and then nvshmemt_ibrc_connect_endpoints --> nvshmemt_ibrc_ep_connect --> ep_connect --> status = ftable.modify_qp(ep->qp, &attr, flags); in nvshmem_src_3.2.5-1. The status 110 is usually ETIMEDOUT

kunfupanda-hw avatar Mar 06 '25 03:03 kunfupanda-hw

I have created a new branch that updates NVSHMEM to version 3.2.5. However, I don't have a RoCE environment for verification. Can you test this out? @Baibaifan Branch: https://github.com/deepseek-ai/DeepEP/tree/roce-support

I have the same problem on one 8*H100 cluster node for test_low_latency.py, and update NVSHMEM to version 3.2.5 with the newest patch. However, it doesn't work, it lacks for, with the same logs, i.e., /work/nvshmem_src_3.1.7-1/nvshmem_src/src/modules/transport/ibrc/ibrc.cpp:417: non-zero status: 110 /work/nvshmem_src_3.1.7-1/nvshmem_src/src/modules/transport/ibrc/ibrc.cpp:417: non-zero status: 110 ibv_modify_qp failed ibv_modify_qp failed. However, I have tried the ib_write_bw cmd successfully. Any kind suggestion for this problem?

Could you provide the logs from your NVSHMEM 3.2.5 run? It appears you're still using NVSHMEM 3.1.7.

sorry for that old logs. The newest logs are as below, with some my comment logs.

(base) root@ubuntu:/work/DeepEP-main# python tests/test_low_latency.py 
local_rank:5, ip:127.0.0.1 port:3004
world_size:8, rank:5
local_rank:4, ip:127.0.0.1 port:3004
world_size:8, rank:4
local_rank:6, ip:127.0.0.1 port:3004
world_size:8, rank:6
local_rank:7, ip:127.0.0.1 port:3004
world_size:8, rank:7
local_rank:0, ip:127.0.0.1 port:3004
world_size:8, rank:0
local_rank:1, ip:127.0.0.1 port:3004
world_size:8, rank:1
local_rank:2, ip:127.0.0.1 port:3004
world_size:8, rank:2
local_rank:3, ip:127.0.0.1 port:3004
world_size:8, rank:3
setting....
setting....
setting....
setting....
setting....
setting....
setting....
setting....
setted...
rank:2, num_ranks:8
before init buffer
buffer.__init__
before cpp.Buffer
setted...
rank:0, num_ranks:8
Allocating buffer size: 2116.2912 MB ...
before init buffer
buffer.__init__
before cpp.Buffer
after cpp.Buffer
setted...
rank:1, num_ranks:8
before init buffer
buffer.__init__
before cpp.Buffer
setted...
rank:5, num_ranks:8
before init buffer
buffer.__init__
before cpp.Buffer
after cpp.Buffer
setted...
rank:4, num_ranks:8
before init buffer
buffer.__init__
before cpp.Buffer
after cpp.Buffer
after cpp.Buffer
setted...
rank:7, num_ranks:8
before init buffer
buffer.__init__
before cpp.Buffer
after cpp.Buffer
setted...
rank:3, num_ranks:8
before init buffer
buffer.__init__
before cpp.Buffer
after cpp.Buffer
after cpp.Buffer
setted...
rank:6, num_ranks:8
before init buffer
buffer.__init__
before cpp.Buffer
after cpp.Buffer
finish all_gather object
finish all_gather object
finish all_gather object
finish all_gather object
finish all_gather object
finish all_gather object
finish all_gather object
finish all_gather object
before runtime.sync
before runtime.sync
before runtime.sync
before runtime.sync
before runtime.sync
before runtime.sync
before runtime.sync
before runtime.sync
/work/nvshmem_src_3.2.5-1/nvshmem_src/src/modules/transport/ibrc/ibrc.cpp:418: non-zero status: 110 ibv_modify_qp failed 
/work/nvshmem_src_3.2.5-1/nvshmem_src/src/modules/transport/ibrc/ibrc.cpp:418: non-zero status: 110 
ibv_modify_qp failed 
/work/nvshmem_src_3.2.5-1/nvshmem_src/src/modules/transport/ibrc/ibrc.cpp:1433: non-zero status: 7 ep_connect failed 


/work/nvshmem_src_3.2.5-1/nvshmem_src/src/modules/transport/ibrc/ibrc.cpp:1433: non-zero status: 7 /work/nvshmem_src_3.2.5-1/nvshmem_src/src/modules/transport/ibrc/ibrc.cpp:1500: non-zero status: 7 ep_connect failed 
transport create connect failed 


/work/nvshmem_src_3.2.5-1/nvshmem_src/src/modules/transport/ibrc/ibrc.cpp:1500: non-zero status: 7 transport create connect failed 

/work/nvshmem_src_3.2.5-1/nvshmem_src/src/host/transport/transport.cpp:400: non-zero status: 7 connect EPS failed 
/work/nvshmem_src_3.2.5-1/nvshmem_src/src/host/transport/transport.cpp:400: non-zero status: 7 
connect EPS failed 
/work/nvshmem_src_3.2.5-1/nvshmem_src/src/host/init/init.cu:1007: non-zero status: 7 
nvshmem setup connections failed 
/work/nvshmem_src_3.2.5-1/nvshmem_src/src/host/init/init.cu:1007: non-zero status: 7 
nvshmem setup connections failed 

/work/nvshmem_src_3.2.5-1/nvshmem_src/src/modules/transport/ibrc/ibrc.cpp:418: non-zero status: 110 /work/nvshmem_src_3.2.5-1/nvshmem_src/src/modules/transport/ibrc/ibrc.cpp:418: non-zero status: 110 /work/nvshmem_src_3.2.5-1/nvshmem_src/src/modules/transport/ibrc/ibrc.cpp:418: non-zero status: 110 /work/nvshmem_src_3.2.5-1/nvshmem_src/src/modules/transport/ibrc/ibrc.cpp:418: non-zero status: 110 ibv_modify_qp failed 
/work/nvshmem_src_3.2.5-1/nvshmem_src/src/modules/transport/ibrc/ibrc.cpp:418: non-zero status: 110 /work/nvshmem_src_3.2.5-1/nvshmem_src/src/modules/transport/ibrc/ibrc.cpp:418: non-zero status: 110 ibv_modify_qp failed 
ibv_modify_qp failed 
ibv_modify_qp failed 

ibv_modify_qp failed 
ibv_modify_qp failed 



/work/nvshmem_src_3.2.5-1/nvshmem_src/src/modules/transport/ibrc/ibrc.cpp:1433: non-zero status: 7 

/work/nvshmem_src_3.2.5-1/nvshmem_src/src/modules/transport/ibrc/ibrc.cpp:1433: non-zero status: 7 /work/nvshmem_src_3.2.5-1/nvshmem_src/src/modules/transport/ibrc/ibrc.cpp:1433: non-zero status: 7 /work/nvshmem_src_3.2.5-1/nvshmem_src/src/modules/transport/ibrc/ibrc.cpp:1433: non-zero status: 7 ep_connect failed 
/work/nvshmem_src_3.2.5-1/nvshmem_src/src/modules/transport/ibrc/ibrc.cpp:1433: non-zero status: 7 /work/nvshmem_src_3.2.5-1/nvshmem_src/src/modules/transport/ibrc/ibrc.cpp:1433: non-zero status: 7 ep_connect failed 
ep_connect failed 
ep_connect failed 

ep_connect failed 
ep_connect failed 


/work/nvshmem_src_3.2.5-1/nvshmem_src/src/modules/transport/ibrc/ibrc.cpp:1500: non-zero status: 7 
/work/nvshmem_src_3.2.5-1/nvshmem_src/src/modules/transport/ibrc/ibrc.cpp:1500: non-zero status: 7 

/work/nvshmem_src_3.2.5-1/nvshmem_src/src/modules/transport/ibrc/ibrc.cpp:1500: non-zero status: 7 transport create connect failed 
/work/nvshmem_src_3.2.5-1/nvshmem_src/src/modules/transport/ibrc/ibrc.cpp:1500: non-zero status: 7 transport create connect failed 
/work/nvshmem_src_3.2.5-1/nvshmem_src/src/modules/transport/ibrc/ibrc.cpp:1500: non-zero status: 7 /work/nvshmem_src_3.2.5-1/nvshmem_src/src/modules/transport/ibrc/ibrc.cpp:1500: non-zero status: 7 transport create connect failed 

transport create connect failed 

transport create connect failed 
transport create connect failed 


/work/nvshmem_src_3.2.5-1/nvshmem_src/src/host/transport/transport.cpp:400: non-zero status: 7 
/work/nvshmem_src_3.2.5-1/nvshmem_src/src/host/transport/transport.cpp:400: non-zero status: 7 
/work/nvshmem_src_3.2.5-1/nvshmem_src/src/host/transport/transport.cpp:400: non-zero status: 7 connect EPS failed 
connect EPS failed 
/work/nvshmem_src_3.2.5-1/nvshmem_src/src/host/transport/transport.cpp:400: non-zero status: 7 connect EPS failed 
/work/nvshmem_src_3.2.5-1/nvshmem_src/src/host/transport/transport.cpp:400: non-zero status: 7 /work/nvshmem_src_3.2.5-1/nvshmem_src/src/host/transport/transport.cpp:400: non-zero status: 7 connect EPS failed 


connect EPS failed 

connect EPS failed 

/work/nvshmem_src_3.2.5-1/nvshmem_src/src/host/init/init.cu:1007: non-zero status: 7 /work/nvshmem_src_3.2.5-1/nvshmem_src/src/host/init/init.cu:1007: non-zero status: 7 
/work/nvshmem_src_3.2.5-1/nvshmem_src/src/host/init/init.cu:1007: non-zero status: 7 
/work/nvshmem_src_3.2.5-1/nvshmem_src/src/host/init/init.cu:1007: non-zero status: 7 nvshmem setup connections failed 
nvshmem setup connections failed 
/work/nvshmem_src_3.2.5-1/nvshmem_src/src/host/init/init.cu:1007: non-zero status: 7 nvshmem setup connections failed 
/work/nvshmem_src_3.2.5-1/nvshmem_src/src/host/init/init.cu:1007: non-zero status: 7 nvshmem setup connections failed 


nvshmem setup connections failed 

nvshmem setup connections failed 



/work/nvshmem_src_3.2.5-1/nvshmem_src/src/modules/transport/ibrc/ibrc.cpp:418: non-zero status: 110 /work/nvshmem_src_3.2.5-1/nvshmem_src/src/modules/transport/ibrc/ibrc.cpp:418: non-zero status: 110 /work/nvshmem_src_3.2.5-1/nvshmem_src/src/modules/transport/ibrc/ibrc.cpp:418: non-zero status: 110 ibv_modify_qp failed 
ibv_modify_qp failed 
/work/nvshmem_src_3.2.5-1/nvshmem_src/src/modules/transport/ibrc/ibrc.cpp:418: non-zero status: 110 ibv_modify_qp failed 


/work/nvshmem_src_3.2.5-1/nvshmem_src/src/modules/transport/ibrc/ibrc.cpp:418: non-zero status: 110 ibv_modify_qp failed 


/work/nvshmem_src_3.2.5-1/nvshmem_src/src/modules/transport/ibrc/ibrc.cpp:1433: non-zero status: 7 /work/nvshmem_src_3.2.5-1/nvshmem_src/src/modules/transport/ibrc/ibrc.cpp:1433: non-zero status: 7 /work/nvshmem_src_3.2.5-1/nvshmem_src/src/modules/transport/ibrc/ibrc.cpp:1433: non-zero status: 7 ibv_modify_qp failed 
/work/nvshmem_src_3.2.5-1/nvshmem_src/src/modules/transport/ibrc/ibrc.cpp:1433: non-zero status: 7 ep_connect failed 
ep_connect failed 
ep_connect failed 

ep_connect failed 



/work/nvshmem_src_3.2.5-1/nvshmem_src/src/modules/transport/ibrc/ibrc.cpp:1500: non-zero status: 7 /work/nvshmem_src_3.2.5-1/nvshmem_src/src/modules/transport/ibrc/ibrc.cpp:1433: non-zero status: 7 
/work/nvshmem_src_3.2.5-1/nvshmem_src/src/modules/transport/ibrc/ibrc.cpp:1500: non-zero status: 7 transport create connect failed 
/work/nvshmem_src_3.2.5-1/nvshmem_src/src/modules/transport/ibrc/ibrc.cpp:1500: non-zero status: 7 ep_connect failed 
/work/nvshmem_src_3.2.5-1/nvshmem_src/src/modules/transport/ibrc/ibrc.cpp:1500: non-zero status: 7 transport create connect failed 

transport create connect failed 

/work/nvshmem_src_3.2.5-1/nvshmem_src/src/modules/transport/ibrc/ibrc.cpp:418: non-zero status: 110 transport create connect failed 

/work/nvshmem_src_3.2.5-1/nvshmem_src/src/modules/transport/ibrc/ibrc.cpp:418: non-zero status: 110 /work/nvshmem_src_3.2.5-1/nvshmem_src/src/modules/transport/ibrc/ibrc.cpp:418: non-zero status: 110 
/work/nvshmem_src_3.2.5-1/nvshmem_src/src/host/transport/transport.cpp:400: non-zero status: 7 /work/nvshmem_src_3.2.5-1/nvshmem_src/src/modules/transport/ibrc/ibrc.cpp:1500: non-zero status: 7 ibv_modify_qp failed 

/work/nvshmem_src_3.2.5-1/nvshmem_src/src/host/transport/transport.cpp:400: non-zero status: 7 ibv_modify_qp failed 
ibv_modify_qp failed 
connect EPS failed 
/work/nvshmem_src_3.2.5-1/nvshmem_src/src/host/transport/transport.cpp:400: non-zero status: 7 transport create connect failed 

connect EPS failed 
/work/nvshmem_src_3.2.5-1/nvshmem_src/src/host/transport/transport.cpp:400: non-zero status: 7 


connect EPS failed 
/work/nvshmem_src_3.2.5-1/nvshmem_src/src/modules/transport/ibrc/ibrc.cpp:1433: non-zero status: 7 

/work/nvshmem_src_3.2.5-1/nvshmem_src/src/modules/transport/ibrc/ibrc.cpp:1433: non-zero status: 7 connect EPS failed 
/work/nvshmem_src_3.2.5-1/nvshmem_src/src/modules/transport/ibrc/ibrc.cpp:1433: non-zero status: 7 /work/nvshmem_src_3.2.5-1/nvshmem_src/src/host/init/init.cu:1007: non-zero status: 7 
ep_connect failed 
/work/nvshmem_src_3.2.5-1/nvshmem_src/src/host/init/init.cu:1007: non-zero status: 7 /work/nvshmem_src_3.2.5-1/nvshmem_src/src/host/transport/transport.cpp:400: non-zero status: 7 ep_connect failed 

ep_connect failed 
nvshmem setup connections failed 
/work/nvshmem_src_3.2.5-1/nvshmem_src/src/host/init/init.cu:1007: non-zero status: 7 
nvshmem setup connections failed 
connect EPS failed 

/work/nvshmem_src_3.2.5-1/nvshmem_src/src/host/init/init.cu:1007: non-zero status: 7 

nvshmem setup connections failed 
/work/nvshmem_src_3.2.5-1/nvshmem_src/src/modules/transport/ibrc/ibrc.cpp:1500: non-zero status: 7 

/work/nvshmem_src_3.2.5-1/nvshmem_src/src/modules/transport/ibrc/ibrc.cpp:1500: non-zero status: 7 nvshmem setup connections failed 
/work/nvshmem_src_3.2.5-1/nvshmem_src/src/modules/transport/ibrc/ibrc.cpp:1500: non-zero status: 7 /work/nvshmem_src_3.2.5-1/nvshmem_src/src/host/init/init.cu:nvshmemi_check_state_and_init:1080: transport create connect failed 

transport create connect failed 
/work/nvshmem_src_3.2.5-1/nvshmem_src/src/host/init/init.cu:nvshmemi_check_state_and_init:1080: /work/nvshmem_src_3.2.5-1/nvshmem_src/src/host/init/init.cu:1007: non-zero status: 7 transport create connect failed 

nvshmem initialization failed, exiting 

/work/nvshmem_src_3.2.5-1/nvshmem_src/src/host/init/init.cu:nvshmemi_check_state_and_init:1080: 
nvshmem initialization failed, exiting 
nvshmem setup connections failed 


/work/nvshmem_src_3.2.5-1/nvshmem_src/src/host/init/init.cu:nvshmemi_check_state_and_init:1080: /work/nvshmem_src_3.2.5-1/nvshmem_src/src/host/transport/transport.cpp:400: non-zero status: 7 nvshmem initialization failed, exiting 

/work/nvshmem_src_3.2.5-1/nvshmem_src/src/host/transport/transport.cpp:400: non-zero status: 7 
/work/nvshmem_src_3.2.5-1/nvshmem_src/src/host/transport/transport.cpp:400: non-zero status: 7 nvshmem initialization failed, exiting 
connect EPS failed 


connect EPS failed 
/work/nvshmem_src_3.2.5-1/nvshmem_src/src/host/init/init.cu:nvshmemi_check_state_and_init:1080: connect EPS failed 

/work/nvshmem_src_3.2.5-1/nvshmem_src/src/host/init/init.cu:1007: non-zero status: 7 
nvshmem initialization failed, exiting 

nvshmem setup connections failed 
/work/nvshmem_src_3.2.5-1/nvshmem_src/src/host/init/init.cu:1007: non-zero status: 7 
/work/nvshmem_src_3.2.5-1/nvshmem_src/src/host/init/init.cu:1007: non-zero status: 7 
nvshmem setup connections failed 
nvshmem setup connections failed 
/work/nvshmem_src_3.2.5-1/nvshmem_src/src/host/init/init.cu:nvshmemi_check_state_and_init:1080: 

nvshmem initialization failed, exiting 
/work/nvshmem_src_3.2.5-1/nvshmem_src/src/host/init/init.cu:nvshmemi_check_state_and_init:1080: /work/nvshmem_src_3.2.5-1/nvshmem_src/src/host/init/init.cu:nvshmemi_check_state_and_init:1080: 
nvshmem initialization failed, exiting 
nvshmem initialization failed, exiting 


W0305 11:55:57.242000 19054 site-packages/torch/multiprocessing/spawn.py:169] Terminating process 19119 via signal SIGTERM
W0305 11:55:57.244000 19054 site-packages/torch/multiprocessing/spawn.py:169] Terminating process 19120 via signal SIGTERM
W0305 11:55:57.245000 19054 site-packages/torch/multiprocessing/spawn.py:169] Terminating process 19121 via signal SIGTERM
W0305 11:55:57.245000 19054 site-packages/torch/multiprocessing/spawn.py:169] Terminating process 19122 via signal SIGTERM
W0305 11:55:57.245000 19054 site-packages/torch/multiprocessing/spawn.py:169] Terminating process 19124 via signal SIGTERM
W0305 11:55:57.246000 19054 site-packages/torch/multiprocessing/spawn.py:169] Terminating process 19125 via signal SIGTERM
W0305 11:55:57.246000 19054 site-packages/torch/multiprocessing/spawn.py:169] Terminating process 19126 via signal SIGTERM
Traceback (most recent call last):
  File "/work/DeepEP-main/tests/test_low_latency.py", line 164, in <module>
    torch.multiprocessing.spawn(test_loop, args=(num_processes,), nprocs=num_processes)
  File "/root/miniconda3/lib/python3.12/site-packages/torch/multiprocessing/spawn.py", line 340, in spawn
    return start_processes(fn, args, nprocs, join, daemon, start_method="spawn")
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/miniconda3/lib/python3.12/site-packages/torch/multiprocessing/spawn.py", line 296, in start_processes
    while not context.join():
              ^^^^^^^^^^^^^^
  File "/root/miniconda3/lib/python3.12/site-packages/torch/multiprocessing/spawn.py", line 204, in join
    raise ProcessExitedException(
torch.multiprocessing.spawn.ProcessExitedException: process 4 terminated with exit code 255

By following the logs, we found the calltrace was test_loop() --> buffer = deep_ep.Buffer() --> __init__ --> self.runtime.sync(device_ids, ipc_handles, root_unique_id) in DeepEP and then nvshmemt_ibrc_connect_endpoints --> nvshmemt_ibrc_ep_connect --> ep_connect --> status = ftable.modify_qp(ep->qp, &attr, flags); in nvshmem_src_3.2.5-1. The status 110 is usually ETIMEDOUT

I also encountered this problem, but tests/test_internode.py can be run, but the performance is worse. Is it related to the fact that I have 4 NICs per machine? @sphish

Baibaifan avatar Mar 06 '25 06:03 Baibaifan

Hi @sphish, The process works, but the performance does not seem to meet expectations.

env:

  1. H100 80GB HBM3 *8/HPC
  2. 4 * CX7 NICs 400gb/s/HPC
  3. RoCE

rank0 NCCL_DEBUG=INFO MASTER_ADDR=xxx WORLD_SIZE=2 RANK=0 python tests/test_internode.py

rank1 NCCL_DEBUG=INFO MASTER_ADDR=xxx WORLD_SIZE=2 RANK=1 python tests/test_internode.py

[tuning] Best dispatch (FP8): SMs 24, NVL chunk 28, RDMA chunk 8: 4.09 GB/s (RDMA), 13.36 GB/s (NVL)
[tuning] Best dispatch (BF16): SMs 24, NVL chunk 28, RDMA chunk 32: 4.57 GB/s (RDMA), 14.91 GB/s (NVL)
[tuning] Best combine: SMs 24, NVL chunk 3, RDMA chunk 32: 2.97 GB/s (RDMA), 9.69 GB/s (NVL)

The difference between the above speed and the official data is greater than the difference between the network cards. How can I test the performance with perftest? Is there any reference?

Hi, @Baibaifan, are MASTER_ADDRs of rank0/1 same? Are they IP of CX7 cards or host IP?

kunfupanda-hw avatar Mar 06 '25 09:03 kunfupanda-hw

Hi @sphish, The process works, but the performance does not seem to meet expectations. env:

  1. H100 80GB HBM3 *8/HPC
  2. 4 * CX7 NICs 400gb/s/HPC
  3. RoCE

rank0 NCCL_DEBUG=INFO MASTER_ADDR=xxx WORLD_SIZE=2 RANK=0 python tests/test_internode.py rank1 NCCL_DEBUG=INFO MASTER_ADDR=xxx WORLD_SIZE=2 RANK=1 python tests/test_internode.py

[tuning] Best dispatch (FP8): SMs 24, NVL chunk 28, RDMA chunk 8: 4.09 GB/s (RDMA), 13.36 GB/s (NVL)
[tuning] Best dispatch (BF16): SMs 24, NVL chunk 28, RDMA chunk 32: 4.57 GB/s (RDMA), 14.91 GB/s (NVL)
[tuning] Best combine: SMs 24, NVL chunk 3, RDMA chunk 32: 2.97 GB/s (RDMA), 9.69 GB/s (NVL)

The difference between the above speed and the official data is greater than the difference between the network cards. How can I test the performance with perftest? Is there any reference?

Hi, @Baibaifan, are MASTER_ADDRs of rank0/1 same? Are they IP of CX7 cards or host IP?

MASTER_ADDRs are the same, both are host ip. It is to build Bootstrap use.

Baibaifan avatar Mar 06 '25 12:03 Baibaifan

I have created a new branch that updates NVSHMEM to version 3.2.5. However, I don't have a RoCE environment for verification. Can you test this out? @Baibaifan Branch: https://github.com/deepseek-ai/DeepEP/tree/roce-support我已经创建了一个新分支,更新了 NVSHMEM 到版本 3.2.5。然而,我没有 RoCE 环境进行验证。你能测试一下吗?@Baibaifan 分支:https://github.com/deepseek-ai/DeepEP/tree/roce-support

I have the same problem on one 8*H100 cluster node for test_low_latency.py, and update NVSHMEM to version 3.2.5 with the newest patch. However, it doesn't work, it lacks for, with the same logs, i.e., /work/nvshmem_src_3.1.7-1/nvshmem_src/src/modules/transport/ibrc/ibrc.cpp:417: non-zero status: 110 /work/nvshmem_src_3.1.7-1/nvshmem_src/src/modules/transport/ibrc/ibrc.cpp:417: non-zero status: 110 ibv_modify_qp failed ibv_modify_qp failed. However, I have tried the ib_write_bw cmd successfully. Any kind suggestion for this problem?我在一个 8 台 H100 集群节点( test_low_latency.py )上遇到了相同的问题,并将 NVSHMEM 更新到版本 3.2.5 并应用了最新的补丁。然而这并没有解决问题,日志依然相同,即 /work/nvshmem_src_3.1.7-1/nvshmem_src/src/modules/transport/ibrc/ibrc.cpp:417: non-zero status: 110 /work/nvshmem_src_3.1.7-1/nvshmem_src/src/modules/transport/ibrc/ibrc.cpp:417: non-zero status: 110 ibv_modify_qp failed ibv_modify_qp failed 。不过, ib_write_bw 命令我已经成功尝试过了。对于这个问题,有什么建议吗?

Could you provide the logs from your NVSHMEM 3.2.5 run? It appears you're still using NVSHMEM 3.1.7.您能提供 NVSHMEM 3.2.5 运行的日志吗?看来您仍在使用 NVSHMEM 3.1.7。

sorry for that old logs. The newest logs are as below, with some my comment logs.抱歉,旧的日志如下,以下是最新的日志,并附有一些我的注释。

(base) root@ubuntu:/work/DeepEP-main# python tests/test_low_latency.py 
local_rank:5, ip:127.0.0.1 port:3004
world_size:8, rank:5
local_rank:4, ip:127.0.0.1 port:3004
world_size:8, rank:4
local_rank:6, ip:127.0.0.1 port:3004
world_size:8, rank:6
local_rank:7, ip:127.0.0.1 port:3004
world_size:8, rank:7
local_rank:0, ip:127.0.0.1 port:3004
world_size:8, rank:0
local_rank:1, ip:127.0.0.1 port:3004
world_size:8, rank:1
local_rank:2, ip:127.0.0.1 port:3004
world_size:8, rank:2
local_rank:3, ip:127.0.0.1 port:3004
world_size:8, rank:3
setting....
setting....
setting....
setting....
setting....
setting....
setting....
setting....
setted...
rank:2, num_ranks:8
before init buffer
buffer.__init__
before cpp.Buffer
setted...
rank:0, num_ranks:8
Allocating buffer size: 2116.2912 MB ...
before init buffer
buffer.__init__
before cpp.Buffer
after cpp.Buffer
setted...
rank:1, num_ranks:8
before init buffer
buffer.__init__
before cpp.Buffer
setted...
rank:5, num_ranks:8
before init buffer
buffer.__init__
before cpp.Buffer
after cpp.Buffer
setted...
rank:4, num_ranks:8
before init buffer
buffer.__init__
before cpp.Buffer
after cpp.Buffer
after cpp.Buffer
setted...
rank:7, num_ranks:8
before init buffer
buffer.__init__
before cpp.Buffer
after cpp.Buffer
setted...
rank:3, num_ranks:8
before init buffer
buffer.__init__
before cpp.Buffer
after cpp.Buffer
after cpp.Buffer
setted...
rank:6, num_ranks:8
before init buffer
buffer.__init__
before cpp.Buffer
after cpp.Buffer
finish all_gather object
finish all_gather object
finish all_gather object
finish all_gather object
finish all_gather object
finish all_gather object
finish all_gather object
finish all_gather object
before runtime.sync
before runtime.sync
before runtime.sync
before runtime.sync
before runtime.sync
before runtime.sync
before runtime.sync
before runtime.sync
/work/nvshmem_src_3.2.5-1/nvshmem_src/src/modules/transport/ibrc/ibrc.cpp:418: non-zero status: 110 ibv_modify_qp failed 
/work/nvshmem_src_3.2.5-1/nvshmem_src/src/modules/transport/ibrc/ibrc.cpp:418: non-zero status: 110 
ibv_modify_qp failed 
/work/nvshmem_src_3.2.5-1/nvshmem_src/src/modules/transport/ibrc/ibrc.cpp:1433: non-zero status: 7 ep_connect failed 


/work/nvshmem_src_3.2.5-1/nvshmem_src/src/modules/transport/ibrc/ibrc.cpp:1433: non-zero status: 7 /work/nvshmem_src_3.2.5-1/nvshmem_src/src/modules/transport/ibrc/ibrc.cpp:1500: non-zero status: 7 ep_connect failed 
transport create connect failed 


/work/nvshmem_src_3.2.5-1/nvshmem_src/src/modules/transport/ibrc/ibrc.cpp:1500: non-zero status: 7 transport create connect failed 

/work/nvshmem_src_3.2.5-1/nvshmem_src/src/host/transport/transport.cpp:400: non-zero status: 7 connect EPS failed 
/work/nvshmem_src_3.2.5-1/nvshmem_src/src/host/transport/transport.cpp:400: non-zero status: 7 
connect EPS failed 
/work/nvshmem_src_3.2.5-1/nvshmem_src/src/host/init/init.cu:1007: non-zero status: 7 
nvshmem setup connections failed 
/work/nvshmem_src_3.2.5-1/nvshmem_src/src/host/init/init.cu:1007: non-zero status: 7 
nvshmem setup connections failed 

/work/nvshmem_src_3.2.5-1/nvshmem_src/src/modules/transport/ibrc/ibrc.cpp:418: non-zero status: 110 /work/nvshmem_src_3.2.5-1/nvshmem_src/src/modules/transport/ibrc/ibrc.cpp:418: non-zero status: 110 /work/nvshmem_src_3.2.5-1/nvshmem_src/src/modules/transport/ibrc/ibrc.cpp:418: non-zero status: 110 /work/nvshmem_src_3.2.5-1/nvshmem_src/src/modules/transport/ibrc/ibrc.cpp:418: non-zero status: 110 ibv_modify_qp failed 
/work/nvshmem_src_3.2.5-1/nvshmem_src/src/modules/transport/ibrc/ibrc.cpp:418: non-zero status: 110 /work/nvshmem_src_3.2.5-1/nvshmem_src/src/modules/transport/ibrc/ibrc.cpp:418: non-zero status: 110 ibv_modify_qp failed 
ibv_modify_qp failed 
ibv_modify_qp failed 

ibv_modify_qp failed 
ibv_modify_qp failed 



/work/nvshmem_src_3.2.5-1/nvshmem_src/src/modules/transport/ibrc/ibrc.cpp:1433: non-zero status: 7 

/work/nvshmem_src_3.2.5-1/nvshmem_src/src/modules/transport/ibrc/ibrc.cpp:1433: non-zero status: 7 /work/nvshmem_src_3.2.5-1/nvshmem_src/src/modules/transport/ibrc/ibrc.cpp:1433: non-zero status: 7 /work/nvshmem_src_3.2.5-1/nvshmem_src/src/modules/transport/ibrc/ibrc.cpp:1433: non-zero status: 7 ep_connect failed 
/work/nvshmem_src_3.2.5-1/nvshmem_src/src/modules/transport/ibrc/ibrc.cpp:1433: non-zero status: 7 /work/nvshmem_src_3.2.5-1/nvshmem_src/src/modules/transport/ibrc/ibrc.cpp:1433: non-zero status: 7 ep_connect failed 
ep_connect failed 
ep_connect failed 

ep_connect failed 
ep_connect failed 


/work/nvshmem_src_3.2.5-1/nvshmem_src/src/modules/transport/ibrc/ibrc.cpp:1500: non-zero status: 7 
/work/nvshmem_src_3.2.5-1/nvshmem_src/src/modules/transport/ibrc/ibrc.cpp:1500: non-zero status: 7 

/work/nvshmem_src_3.2.5-1/nvshmem_src/src/modules/transport/ibrc/ibrc.cpp:1500: non-zero status: 7 transport create connect failed 
/work/nvshmem_src_3.2.5-1/nvshmem_src/src/modules/transport/ibrc/ibrc.cpp:1500: non-zero status: 7 transport create connect failed 
/work/nvshmem_src_3.2.5-1/nvshmem_src/src/modules/transport/ibrc/ibrc.cpp:1500: non-zero status: 7 /work/nvshmem_src_3.2.5-1/nvshmem_src/src/modules/transport/ibrc/ibrc.cpp:1500: non-zero status: 7 transport create connect failed 

transport create connect failed 

transport create connect failed 
transport create connect failed 


/work/nvshmem_src_3.2.5-1/nvshmem_src/src/host/transport/transport.cpp:400: non-zero status: 7 
/work/nvshmem_src_3.2.5-1/nvshmem_src/src/host/transport/transport.cpp:400: non-zero status: 7 
/work/nvshmem_src_3.2.5-1/nvshmem_src/src/host/transport/transport.cpp:400: non-zero status: 7 connect EPS failed 
connect EPS failed 
/work/nvshmem_src_3.2.5-1/nvshmem_src/src/host/transport/transport.cpp:400: non-zero status: 7 connect EPS failed 
/work/nvshmem_src_3.2.5-1/nvshmem_src/src/host/transport/transport.cpp:400: non-zero status: 7 /work/nvshmem_src_3.2.5-1/nvshmem_src/src/host/transport/transport.cpp:400: non-zero status: 7 connect EPS failed 


connect EPS failed 

connect EPS failed 

/work/nvshmem_src_3.2.5-1/nvshmem_src/src/host/init/init.cu:1007: non-zero status: 7 /work/nvshmem_src_3.2.5-1/nvshmem_src/src/host/init/init.cu:1007: non-zero status: 7 
/work/nvshmem_src_3.2.5-1/nvshmem_src/src/host/init/init.cu:1007: non-zero status: 7 
/work/nvshmem_src_3.2.5-1/nvshmem_src/src/host/init/init.cu:1007: non-zero status: 7 nvshmem setup connections failed 
nvshmem setup connections failed 
/work/nvshmem_src_3.2.5-1/nvshmem_src/src/host/init/init.cu:1007: non-zero status: 7 nvshmem setup connections failed 
/work/nvshmem_src_3.2.5-1/nvshmem_src/src/host/init/init.cu:1007: non-zero status: 7 nvshmem setup connections failed 


nvshmem setup connections failed 

nvshmem setup connections failed 



/work/nvshmem_src_3.2.5-1/nvshmem_src/src/modules/transport/ibrc/ibrc.cpp:418: non-zero status: 110 /work/nvshmem_src_3.2.5-1/nvshmem_src/src/modules/transport/ibrc/ibrc.cpp:418: non-zero status: 110 /work/nvshmem_src_3.2.5-1/nvshmem_src/src/modules/transport/ibrc/ibrc.cpp:418: non-zero status: 110 ibv_modify_qp failed 
ibv_modify_qp failed 
/work/nvshmem_src_3.2.5-1/nvshmem_src/src/modules/transport/ibrc/ibrc.cpp:418: non-zero status: 110 ibv_modify_qp failed 


/work/nvshmem_src_3.2.5-1/nvshmem_src/src/modules/transport/ibrc/ibrc.cpp:418: non-zero status: 110 ibv_modify_qp failed 


/work/nvshmem_src_3.2.5-1/nvshmem_src/src/modules/transport/ibrc/ibrc.cpp:1433: non-zero status: 7 /work/nvshmem_src_3.2.5-1/nvshmem_src/src/modules/transport/ibrc/ibrc.cpp:1433: non-zero status: 7 /work/nvshmem_src_3.2.5-1/nvshmem_src/src/modules/transport/ibrc/ibrc.cpp:1433: non-zero status: 7 ibv_modify_qp failed 
/work/nvshmem_src_3.2.5-1/nvshmem_src/src/modules/transport/ibrc/ibrc.cpp:1433: non-zero status: 7 ep_connect failed 
ep_connect failed 
ep_connect failed 

ep_connect failed 



/work/nvshmem_src_3.2.5-1/nvshmem_src/src/modules/transport/ibrc/ibrc.cpp:1500: non-zero status: 7 /work/nvshmem_src_3.2.5-1/nvshmem_src/src/modules/transport/ibrc/ibrc.cpp:1433: non-zero status: 7 
/work/nvshmem_src_3.2.5-1/nvshmem_src/src/modules/transport/ibrc/ibrc.cpp:1500: non-zero status: 7 transport create connect failed 
/work/nvshmem_src_3.2.5-1/nvshmem_src/src/modules/transport/ibrc/ibrc.cpp:1500: non-zero status: 7 ep_connect failed 
/work/nvshmem_src_3.2.5-1/nvshmem_src/src/modules/transport/ibrc/ibrc.cpp:1500: non-zero status: 7 transport create connect failed 

transport create connect failed 

/work/nvshmem_src_3.2.5-1/nvshmem_src/src/modules/transport/ibrc/ibrc.cpp:418: non-zero status: 110 transport create connect failed 

/work/nvshmem_src_3.2.5-1/nvshmem_src/src/modules/transport/ibrc/ibrc.cpp:418: non-zero status: 110 /work/nvshmem_src_3.2.5-1/nvshmem_src/src/modules/transport/ibrc/ibrc.cpp:418: non-zero status: 110 
/work/nvshmem_src_3.2.5-1/nvshmem_src/src/host/transport/transport.cpp:400: non-zero status: 7 /work/nvshmem_src_3.2.5-1/nvshmem_src/src/modules/transport/ibrc/ibrc.cpp:1500: non-zero status: 7 ibv_modify_qp failed 

/work/nvshmem_src_3.2.5-1/nvshmem_src/src/host/transport/transport.cpp:400: non-zero status: 7 ibv_modify_qp failed 
ibv_modify_qp failed 
connect EPS failed 
/work/nvshmem_src_3.2.5-1/nvshmem_src/src/host/transport/transport.cpp:400: non-zero status: 7 transport create connect failed 

connect EPS failed 
/work/nvshmem_src_3.2.5-1/nvshmem_src/src/host/transport/transport.cpp:400: non-zero status: 7 


connect EPS failed 
/work/nvshmem_src_3.2.5-1/nvshmem_src/src/modules/transport/ibrc/ibrc.cpp:1433: non-zero status: 7 

/work/nvshmem_src_3.2.5-1/nvshmem_src/src/modules/transport/ibrc/ibrc.cpp:1433: non-zero status: 7 connect EPS failed 
/work/nvshmem_src_3.2.5-1/nvshmem_src/src/modules/transport/ibrc/ibrc.cpp:1433: non-zero status: 7 /work/nvshmem_src_3.2.5-1/nvshmem_src/src/host/init/init.cu:1007: non-zero status: 7 
ep_connect failed 
/work/nvshmem_src_3.2.5-1/nvshmem_src/src/host/init/init.cu:1007: non-zero status: 7 /work/nvshmem_src_3.2.5-1/nvshmem_src/src/host/transport/transport.cpp:400: non-zero status: 7 ep_connect failed 

ep_connect failed 
nvshmem setup connections failed 
/work/nvshmem_src_3.2.5-1/nvshmem_src/src/host/init/init.cu:1007: non-zero status: 7 
nvshmem setup connections failed 
connect EPS failed 

/work/nvshmem_src_3.2.5-1/nvshmem_src/src/host/init/init.cu:1007: non-zero status: 7 

nvshmem setup connections failed 
/work/nvshmem_src_3.2.5-1/nvshmem_src/src/modules/transport/ibrc/ibrc.cpp:1500: non-zero status: 7 

/work/nvshmem_src_3.2.5-1/nvshmem_src/src/modules/transport/ibrc/ibrc.cpp:1500: non-zero status: 7 nvshmem setup connections failed 
/work/nvshmem_src_3.2.5-1/nvshmem_src/src/modules/transport/ibrc/ibrc.cpp:1500: non-zero status: 7 /work/nvshmem_src_3.2.5-1/nvshmem_src/src/host/init/init.cu:nvshmemi_check_state_and_init:1080: transport create connect failed 

transport create connect failed 
/work/nvshmem_src_3.2.5-1/nvshmem_src/src/host/init/init.cu:nvshmemi_check_state_and_init:1080: /work/nvshmem_src_3.2.5-1/nvshmem_src/src/host/init/init.cu:1007: non-zero status: 7 transport create connect failed 

nvshmem initialization failed, exiting 

/work/nvshmem_src_3.2.5-1/nvshmem_src/src/host/init/init.cu:nvshmemi_check_state_and_init:1080: 
nvshmem initialization failed, exiting 
nvshmem setup connections failed 


/work/nvshmem_src_3.2.5-1/nvshmem_src/src/host/init/init.cu:nvshmemi_check_state_and_init:1080: /work/nvshmem_src_3.2.5-1/nvshmem_src/src/host/transport/transport.cpp:400: non-zero status: 7 nvshmem initialization failed, exiting 

/work/nvshmem_src_3.2.5-1/nvshmem_src/src/host/transport/transport.cpp:400: non-zero status: 7 
/work/nvshmem_src_3.2.5-1/nvshmem_src/src/host/transport/transport.cpp:400: non-zero status: 7 nvshmem initialization failed, exiting 
connect EPS failed 


connect EPS failed 
/work/nvshmem_src_3.2.5-1/nvshmem_src/src/host/init/init.cu:nvshmemi_check_state_and_init:1080: connect EPS failed 

/work/nvshmem_src_3.2.5-1/nvshmem_src/src/host/init/init.cu:1007: non-zero status: 7 
nvshmem initialization failed, exiting 

nvshmem setup connections failed 
/work/nvshmem_src_3.2.5-1/nvshmem_src/src/host/init/init.cu:1007: non-zero status: 7 
/work/nvshmem_src_3.2.5-1/nvshmem_src/src/host/init/init.cu:1007: non-zero status: 7 
nvshmem setup connections failed 
nvshmem setup connections failed 
/work/nvshmem_src_3.2.5-1/nvshmem_src/src/host/init/init.cu:nvshmemi_check_state_and_init:1080: 

nvshmem initialization failed, exiting 
/work/nvshmem_src_3.2.5-1/nvshmem_src/src/host/init/init.cu:nvshmemi_check_state_and_init:1080: /work/nvshmem_src_3.2.5-1/nvshmem_src/src/host/init/init.cu:nvshmemi_check_state_and_init:1080: 
nvshmem initialization failed, exiting 
nvshmem initialization failed, exiting 


W0305 11:55:57.242000 19054 site-packages/torch/multiprocessing/spawn.py:169] Terminating process 19119 via signal SIGTERM
W0305 11:55:57.244000 19054 site-packages/torch/multiprocessing/spawn.py:169] Terminating process 19120 via signal SIGTERM
W0305 11:55:57.245000 19054 site-packages/torch/multiprocessing/spawn.py:169] Terminating process 19121 via signal SIGTERM
W0305 11:55:57.245000 19054 site-packages/torch/multiprocessing/spawn.py:169] Terminating process 19122 via signal SIGTERM
W0305 11:55:57.245000 19054 site-packages/torch/multiprocessing/spawn.py:169] Terminating process 19124 via signal SIGTERM
W0305 11:55:57.246000 19054 site-packages/torch/multiprocessing/spawn.py:169] Terminating process 19125 via signal SIGTERM
W0305 11:55:57.246000 19054 site-packages/torch/multiprocessing/spawn.py:169] Terminating process 19126 via signal SIGTERM
Traceback (most recent call last):
  File "/work/DeepEP-main/tests/test_low_latency.py", line 164, in <module>
    torch.multiprocessing.spawn(test_loop, args=(num_processes,), nprocs=num_processes)
  File "/root/miniconda3/lib/python3.12/site-packages/torch/multiprocessing/spawn.py", line 340, in spawn
    return start_processes(fn, args, nprocs, join, daemon, start_method="spawn")
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/miniconda3/lib/python3.12/site-packages/torch/multiprocessing/spawn.py", line 296, in start_processes
    while not context.join():
              ^^^^^^^^^^^^^^
  File "/root/miniconda3/lib/python3.12/site-packages/torch/multiprocessing/spawn.py", line 204, in join
    raise ProcessExitedException(
torch.multiprocessing.spawn.ProcessExitedException: process 4 terminated with exit code 255

By following the logs, we found the calltrace was test_loop() --> buffer = deep_ep.Buffer() --> __init__ --> self.runtime.sync(device_ids, ipc_handles, root_unique_id) in DeepEP and then nvshmemt_ibrc_connect_endpoints --> nvshmemt_ibrc_ep_connect --> ep_connect --> status = ftable.modify_qp(ep->qp, &attr, flags); in nvshmem_src_3.2.5-1. The status 110 is usually ETIMEDOUT。通过查看日志,我们发现调用跟踪在 DeepEP 中为 test_loop() --> buffer = deep_ep.Buffer() --> __init__ --> self.runtime.sync(device_ids, ipc_handles, root_unique_id) ,然后在 nvshmem_src_3.2.5-1 中为 nvshmemt_ibrc_connect_endpoints --> nvshmemt_ibrc_ep_connect --> ep_connect --> status = ftable.modify_qp(ep->qp, &attr, flags); 。状态 110 通常为 ETIMEDOUT

Could you add the environment variable NVSHMEM_DEBUG=TRACE, run it again, and provide the log?

sphish avatar Mar 06 '25 12:03 sphish

I have created a new branch that updates NVSHMEM to version 3.2.5. However, I don't have a RoCE environment for verification. Can you test this out? @Baibaifan Branch: https://github.com/deepseek-ai/DeepEP/tree/roce-support我已经创建了一个新分支,更新了 NVSHMEM 到版本 3.2.5。然而,我没有 RoCE 环境进行验证。你能测试一下吗?@Baibaifan 分支:https://github.com/deepseek-ai/DeepEP/tree/roce-support

ok, I will try it and does the cmake configuration need to be modified? @sphish好的,我试试看,cmake 配置需要修改吗?@sphish

CUDA_HOME=/path/to/cuda \
GDRCOPY_HOME=/path/to/gdrcopy \
NVSHMEM_SHMEM_SUPPORT=0 \
NVSHMEM_UCX_SUPPORT=0 \
NVSHMEM_USE_NCCL=0 \
NVSHMEM_MPI_SUPPORT=0 \
NVSHMEM_IBGDA_SUPPORT=1 \
NVSHMEM_PMIX_SUPPORT=0 \
NVSHMEM_TIMEOUT_DEVICE_POLLING=0 \
NVSHMEM_USE_GDRCOPY=1 \
cmake -S . -B build/ -DCMAKE_INSTALL_PREFIX=/path/to/your/dir/to/install

Nope.  不行。

Hi @sphish, The process works, but the performance does not seem to meet expectations.Hi,过程正常,但性能似乎未达到预期。

env:  环境:

  1. H100 80GB HBM3 *8/HPC
  2. 4 * CX7 NICs 400gb/s/HPC
  3. RoCE

rank0 NCCL_DEBUG=INFO MASTER_ADDR=xxx WORLD_SIZE=2 RANK=0 python tests/test_internode.py

rank1 NCCL_DEBUG=INFO MASTER_ADDR=xxx WORLD_SIZE=2 RANK=1 python tests/test_internode.py

[tuning] Best dispatch (FP8): SMs 24, NVL chunk 28, RDMA chunk 8: 4.09 GB/s (RDMA), 13.36 GB/s (NVL)
[tuning] Best dispatch (BF16): SMs 24, NVL chunk 28, RDMA chunk 32: 4.57 GB/s (RDMA), 14.91 GB/s (NVL)
[tuning] Best combine: SMs 24, NVL chunk 3, RDMA chunk 32: 2.97 GB/s (RDMA), 9.69 GB/s (NVL)

The difference between the above speed and the official data is greater than the difference between the network cards. How can I test the performance with perftest? Is there any reference?以上速度与官方数据之间的差异大于网卡之间的差异。 perftest 如何测试性能?有没有参考?

We haven't tested on a machine with 4 nics, but I don't think the performance should be this low. Could you try running alltoall_perf from NCCL_tests?

sphish avatar Mar 06 '25 12:03 sphish

NCCL_DEBUG=INFO MASTER_ADDR=xxx WORLD_SIZE=2 RANK=0 python tests/test_internode.py

Image After modifying the wrong OOB configuration, the current speed of the 4 network NICs is:

[tuning] Best dispatch (FP8): SMs 24, NVL chunk 24, RDMA chunk 32: 41.14 GB/s (RDMA), 134.29 GB/s (NVL)
[tuning] Best dispatch (BF16): SMs 24, NVL chunk 8, RDMA chunk 28: 45.36 GB/s (RDMA), 148.06 GB/s (NVL)
[tuning] Best combine: SMs 24, NVL chunk 3, RDMA chunk 12: 45.20 GB/s (RDMA), 147.55 GB/s (NVL)

Is this reasonable, since I have 4 network NICs? @sphish

Baibaifan avatar Mar 07 '25 04:03 Baibaifan

By following the logs, we found the calltrace was test_loop() --> buffer = deep_ep.Buffer() --> __init__ --> self.runtime.sync(device_ids, ipc_handles, root_unique_id) in DeepEP and then nvshmemt_ibrc_connect_endpoints --> nvshmemt_ibrc_ep_connect --> ep_connect --> status = ftable.modify_qp(ep->qp, &attr, flags); in nvshmem_src_3.2.5-1. The status 110 is usually ETIMEDOUT。通过查看日志,我们发现调用跟踪在 DeepEP 中为 test_loop() --> buffer = deep_ep.Buffer() --> __init__ --> self.runtime.sync(device_ids, ipc_handles, root_unique_id) ,然后在 nvshmem_src_3.2.5-1 中为 nvshmemt_ibrc_connect_endpoints --> nvshmemt_ibrc_ep_connect --> ep_connect --> status = ftable.modify_qp(ep->qp, &attr, flags); 。状态 110 通常为 ETIMEDOUT

Could you add the environment variable NVSHMEM_DEBUG=TRACE, run it again, and provide the log?

Hi, @sphish,

In my setting, I run test_internode.py on two nodes and test_intranode.py on one node successfully, but test_low_latency.py still cannot pass with ibv_modfiy_qp failed. Then I run perftest cmd and get some observations:

  • TCP based QP connection is always failed for different indexed CX cards with RTR status, e.g., mlx5_0 to mlx5_1, while succeed for same indexed CX cards, e.g., mlx5_0 to mlx5_0.
  • CM based QP connection always succeed and perftest cmd pass.

And I also set environment variable NVSHMEM_DEBUG=TRACE, and run it. Please see the part of logs as below. The logs show that the nvshmem driver tries to build EP connection with mutiple times, and finally failed.

/work/nvshmem_src/src/modules/transport/ibgda/ibgda.cpp 3867 Begin - Ordered list of devices for assignment (after processing user provdied env vars (if any))  - 

/work/nvshmem_src/src/modules/transport/ibgda/ibgda.cpp 3871 Ordered list of devices for assignment - idx=0 (of 9), device id=0, port_num=1

/work/nvshmem_src/src/modules/transport/ibgda/ibgda.cpp 3871 Ordered list of devices for assignment - idx=1 (of 9), device id=1, port_num=1

/work/nvshmem_src/src/modules/transport/ibgda/ibgda.cpp 3871 Ordered list of devices for assignment - idx=2 (of 9), device id=2, port_num=1

/work/nvshmem_src/src/modules/transport/ibgda/ibgda.cpp 3871 Ordered list of devices for assignment - idx=3 (of 9), device id=3, port_num=1

/work/nvshmem_src/src/modules/transport/ibgda/ibgda.cpp 3871 Ordered list of devices for assignment - idx=4 (of 9), device id=5, port_num=1

/work/nvshmem_src/src/modules/transport/ibgda/ibgda.cpp 3871 Ordered list of devices for assignment - idx=5 (of 9), device id=6, port_num=1

/work/nvshmem_src/src/modules/transport/ibgda/ibgda.cpp 3871 Ordered list of devices for assignment - idx=6 (of 9), device id=7, port_num=1

/work/nvshmem_src/src/modules/transport/ibgda/ibgda.cpp 3871 Ordered list of devices for assignment - idx=7 (of 9), device id=8, port_num=1

/work/nvshmem_src/src/modules/transport/ibgda/ibgda.cpp 3871 Ordered list of devices for assignment - idx=8 (of 9), device id=9, port_num=1

/work/nvshmem_src/src/modules/transport/ibgda/ibgda.cpp 3880 End - Ordered list of devices for assignment (after processing user provdied env vars (if any))

/work/nvshmem_src/src/modules/transport/ibgda/ibgda.cpp 3905 NIC buffer will be on GPU memory.

/work/nvshmem_src/src/modules/transport/ibgda/ibgda.cpp 3916 NIC handler will be GPU.

/work/nvshmem_src/src/modules/bootstrap/uid/bootstrap_uid.cpp:bootstrap_uid_allgather:480: rank 3 nranks 8 size 32
/work/nvshmem_src/src/modules/bootstrap/uid/bootstrap_uid.cpp:bootstrap_uid_allgather:503: rank 2 nranks 8 size 32 - DONE
/work/nvshmem_src/src/modules/bootstrap/uid/bootstrap_uid.cpp:bootstrap_uid_allgather:503: rank 1 nranks 8 size 32 - DONE
/work/nvshmem_src/src/modules/bootstrap/uid/bootstrap_uid.cpp:bootstrap_uid_allgather:503: rank 7 nranks 8 size 32 - DONE
/work/nvshmem_src/src/modules/bootstrap/uid/bootstrap_uid.cpp:bootstrap_uid_allgather:503: rank 6 nranks 8 size 32 - DONE
/work/nvshmem_src/src/modules/bootstrap/uid/bootstrap_uid.cpp:bootstrap_uid_allgather:503: rank 0 nranks 8 size 32 - DONE
/work/nvshmem_src/src/modules/bootstrap/uid/bootstrap_uid.cpp:bootstrap_uid_allgather:503: rank 5 nranks 8 size 32 - DONE
/work/nvshmem_src/src/modules/bootstrap/uid/bootstrap_uid.cpp:bootstrap_uid_allgather:503: rank 4 nranks 8 size 32 - DONE
/work/nvshmem_src/src/modules/bootstrap/uid/bootstrap_uid.cpp:bootstrap_uid_allgather:480: rank 2 nranks 8 size 16
/work/nvshmem_src/src/modules/bootstrap/uid/bootstrap_uid.cpp:bootstrap_uid_allgather:480: rank 1 nranks 8 size 16
/work/nvshmem_src/src/modules/bootstrap/uid/bootstrap_uid.cpp:bootstrap_uid_allgather:503: rank 3 nranks 8 size 32 - DONE
/work/nvshmem_src/src/modules/bootstrap/uid/bootstrap_uid.cpp:bootstrap_uid_allgather:480: rank 6 nranks 8 size 16
/work/nvshmem_src/src/modules/bootstrap/uid/bootstrap_uid.cpp:bootstrap_uid_allgather:480: rank 5 nranks 8 size 16
/work/nvshmem_src/src/modules/bootstrap/uid/bootstrap_uid.cpp:bootstrap_uid_allgather:480: rank 0 nranks 8 size 16
/work/nvshmem_src/src/modules/bootstrap/uid/bootstrap_uid.cpp:bootstrap_uid_allgather:480: rank 7 nranks 8 size 16
/work/nvshmem_src/src/modules/bootstrap/uid/bootstrap_uid.cpp:bootstrap_uid_allgather:480: rank 4 nranks 8 size 16
/work/nvshmem_src/src/modules/bootstrap/uid/bootstrap_uid.cpp:bootstrap_uid_allgather:480: rank 3 nranks 8 size 16
/work/nvshmem_src/src/modules/bootstrap/uid/bootstrap_uid.cpp:bootstrap_uid_allgather:503: rank 2 nranks 8 size 16 - DONE
/work/nvshmem_src/src/modules/bootstrap/uid/bootstrap_uid.cpp:bootstrap_uid_allgather:503: rank 1 nranks 8 size 16 - DONE
/work/nvshmem_src/src/modules/bootstrap/uid/bootstrap_uid.cpp:bootstrap_uid_allgather:503: rank 6 nranks 8 size 16 - DONE
/work/nvshmem_src/src/modules/bootstrap/uid/bootstrap_uid.cpp:bootstrap_uid_allgather:503: rank 0 nranks 8 size 16 - DONE
/work/nvshmem_src/src/modules/bootstrap/uid/bootstrap_uid.cpp:bootstrap_uid_allgather:503: rank 5 nranks 8 size 16 - DONE
/work/nvshmem_src/src/modules/bootstrap/uid/bootstrap_uid.cpp:bootstrap_uid_allgather:503: rank 7 nranks 8 size 16 - DONE
/work/nvshmem_src/src/modules/bootstrap/uid/bootstrap_uid.cpp:bootstrap_uid_allgather:503: rank 4 nranks 8 size 16 - DONE
/work/nvshmem_src/src/modules/bootstrap/uid/bootstrap_uid.cpp:bootstrap_uid_allgather:503: rank 3 nranks 8 size 16 - DONE
glusterfs-06:1692:1692 [0] NVSHMEM INFO NVSHMEM_ENABLE_NIC_PE_MAPPING = 0, device 0 setting dev_id = 0
glusterfs-06:1697:1697 [5] NVSHMEM INFO NVSHMEM_ENABLE_NIC_PE_MAPPING = 0, device 0 setting dev_id = 6
glusterfs-06:1693:1693 [1] NVSHMEM INFO NVSHMEM_ENABLE_NIC_PE_MAPPING = 0, device 0 setting dev_id = 1
glusterfs-06:1696:1696 [4] NVSHMEM INFO NVSHMEM_ENABLE_NIC_PE_MAPPING = 0, device 0 setting dev_id = 5
glusterfs-06:1698:1698 [6] NVSHMEM INFO NVSHMEM_ENABLE_NIC_PE_MAPPING = 0, device 0 setting dev_id = 7
glusterfs-06:1694:1694 [2] NVSHMEM INFO NVSHMEM_ENABLE_NIC_PE_MAPPING = 0, device 0 setting dev_id = 2
glusterfs-06:1699:1699 [7] NVSHMEM INFO NVSHMEM_ENABLE_NIC_PE_MAPPING = 0, device 0 setting dev_id = 8
glusterfs-06:1695:1695 [3] NVSHMEM INFO NVSHMEM_ENABLE_NIC_PE_MAPPING = 0, device 0 setting dev_id = 4
/work/nvshmem_src/src/modules/bootstrap/uid/bootstrap_uid.cpp:bootstrap_uid_alltoall:516: rank 6 nranks 8 size 48
/work/nvshmem_src/src/modules/bootstrap/uid/bootstrap_uid.cpp:bootstrap_uid_alltoall:516: rank 0 nranks 8 size 48
/work/nvshmem_src/src/modules/bootstrap/uid/bootstrap_uid.cpp:bootstrap_uid_alltoall:516: rank 5 nranks 8 size 48
/work/nvshmem_src/src/modules/bootstrap/uid/bootstrap_uid.cpp:bootstrap_uid_alltoall:516: rank 1 nranks 8 size 48
/work/nvshmem_src/src/modules/bootstrap/uid/bootstrap_uid.cpp:bootstrap_uid_alltoall:516: rank 4 nranks 8 size 48
/work/nvshmem_src/src/modules/bootstrap/uid/bootstrap_uid.cpp:bootstrap_uid_alltoall:516: rank 2 nranks 8 size 48
/work/nvshmem_src/src/modules/bootstrap/uid/bootstrap_uid.cpp:bootstrap_uid_alltoall:516: rank 7 nranks 8 size 48
/work/nvshmem_src/src/modules/bootstrap/uid/bootstrap_uid.cpp:bootstrap_uid_alltoall:516: rank 3 nranks 8 size 48
/work/nvshmem_src/src/modules/bootstrap/uid/bootstrap_uid.cpp:bootstrap_uid_alltoall:546: rank 5 nranks 8 size 48 - DONE
/work/nvshmem_src/src/modules/transport/common/transport_ib_common.h 103 NVSHMEM_IB_ADDR_FAMILY set by environment to AF_INET
/work/nvshmem_src/src/modules/transport/common/transport_ib_common.h 126 NVSHMEM_IB_ADDR_RANGE set by environment to ::/0
/work/nvshmem_src/src/modules/transport/common/transport_ib_common.h 138 NET/IB: Ip address '::' is invalid for family AF_INET, ignoring address
/work/nvshmem_src/src/modules/bootstrap/uid/bootstrap_uid.cpp:bootstrap_uid_alltoall:546: rank 6 nranks 8 size 48 - DONE
/work/nvshmem_src/src/modules/transport/common/transport_ib_common.h 103 NVSHMEM_IB_ADDR_FAMILY set by environment to AF_INET
/work/nvshmem_src/src/modules/transport/common/transport_ib_common.h 126 NVSHMEM_IB_ADDR_RANGE set by environment to ::/0
/work/nvshmem_src/src/modules/transport/common/transport_ib_common.h 138 NET/IB: Ip address '::' is invalid for family AF_INET, ignoring address
/work/nvshmem_src/src/modules/bootstrap/uid/bootstrap_uid.cpp:bootstrap_uid_alltoall:546: rank 4 nranks 8 size 48 - DONE
/work/nvshmem_src/src/modules/bootstrap/uid/bootstrap_uid.cpp:bootstrap_uid_alltoall:546: rank 3 nranks 8 size 48 - DONE
/work/nvshmem_src/src/modules/transport/common/transport_ib_common.h 103 NVSHMEM_IB_ADDR_FAMILY set by environment to AF_INET
/work/nvshmem_src/src/modules/transport/common/transport_ib_common.h 126 NVSHMEM_IB_ADDR_RANGE set by environment to ::/0
/work/nvshmem_src/src/modules/transport/common/transport_ib_common.h 103 NVSHMEM_IB_ADDR_FAMILY set by environment to AF_INET
/work/nvshmem_src/src/modules/transport/common/transport_ib_common.h 138 NET/IB: Ip address '::' is invalid for family AF_INET, ignoring address
/work/nvshmem_src/src/modules/transport/common/transport_ib_common.h 126 NVSHMEM_IB_ADDR_RANGE set by environment to ::/0
/work/nvshmem_src/src/modules/transport/common/transport_ib_common.h 138 NET/IB: Ip address '::' is invalid for family AF_INET, ignoring address
/work/nvshmem_src/src/modules/bootstrap/uid/bootstrap_uid.cpp:bootstrap_uid_alltoall:546: rank 2 nranks 8 size 48 - DONE
/work/nvshmem_src/src/modules/transport/common/transport_ib_common.h 103 NVSHMEM_IB_ADDR_FAMILY set by environment to AF_INET
/work/nvshmem_src/src/modules/transport/common/transport_ib_common.h 126 NVSHMEM_IB_ADDR_RANGE set by environment to ::/0
/work/nvshmem_src/src/modules/transport/common/transport_ib_common.h 138 NET/IB: Ip address '::' is invalid for family AF_INET, ignoring address
/work/nvshmem_src/src/modules/bootstrap/uid/bootstrap_uid.cpp:bootstrap_uid_alltoall:546: rank 1 nranks 8 size 48 - DONE
/work/nvshmem_src/src/modules/transport/common/transport_ib_common.h 103 NVSHMEM_IB_ADDR_FAMILY set by environment to AF_INET
/work/nvshmem_src/src/modules/transport/common/transport_ib_common.h 126 NVSHMEM_IB_ADDR_RANGE set by environment to ::/0
/work/nvshmem_src/src/modules/transport/common/transport_ib_common.h 138 NET/IB: Ip address '::' is invalid for family AF_INET, ignoring address
/work/nvshmem_src/src/modules/bootstrap/uid/bootstrap_uid.cpp:bootstrap_uid_alltoall:546: rank 0 nranks 8 size 48 - DONE
/work/nvshmem_src/src/modules/transport/common/transport_ib_common.h 103 NVSHMEM_IB_ADDR_FAMILY set by environment to AF_INET
/work/nvshmem_src/src/modules/transport/common/transport_ib_common.h 126 NVSHMEM_IB_ADDR_RANGE set by environment to ::/0
/work/nvshmem_src/src/modules/transport/common/transport_ib_common.h 138 NET/IB: Ip address '::' is invalid for family AF_INET, ignoring address
/work/nvshmem_src/src/modules/bootstrap/uid/bootstrap_uid.cpp:bootstrap_uid_alltoall:546: rank 7 nranks 8 size 48 - DONE
/work/nvshmem_src/src/modules/transport/common/transport_ib_common.h 103 NVSHMEM_IB_ADDR_FAMILY set by environment to AF_INET
/work/nvshmem_src/src/modules/transport/common/transport_ib_common.h 126 NVSHMEM_IB_ADDR_RANGE set by environment to ::/0
/work/nvshmem_src/src/modules/transport/common/transport_ib_common.h 138 NET/IB: Ip address '::' is invalid for family AF_INET, ignoring address
/work/nvshmem_src/src/modules/transport/common/transport_ib_common.h 103 NVSHMEM_IB_ADDR_FAMILY set by environment to AF_INET
/work/nvshmem_src/src/modules/transport/common/transport_ib_common.h 126 NVSHMEM_IB_ADDR_RANGE set by environment to ::/0
/work/nvshmem_src/src/modules/transport/common/transport_ib_common.h 138 NET/IB: Ip address '::' is invalid for family AF_INET, ignoring address
/work/nvshmem_src/src/modules/transport/common/transport_ib_common.h 103 NVSHMEM_IB_ADDR_FAMILY set by environment to AF_INET
/work/nvshmem_src/src/modules/transport/common/transport_ib_common.h 126 NVSHMEM_IB_ADDR_RANGE set by environment to ::/0
/work/nvshmem_src/src/modules/transport/common/transport_ib_common.h 138 NET/IB: Ip address '::' is invalid for family AF_INET, ignoring address
/work/nvshmem_src/src/modules/transport/ibrc/ibrc.cpp:418: non-zero status: 110 ibv_modify_qp failed 

/work/nvshmem_src/src/modules/transport/ibrc/ibrc.cpp:418: non-zero status: 110 /work/nvshmem_src/src/modules/transport/ibrc/ibrc.cpp:418: non-zero status: 110 /work/nvshmem_src/src/modules/transport/ibrc/ibrc.cpp:418: non-zero status: 110 /work/nvshmem_src/src/modules/transport/ibrc/ibrc.cpp:1433: non-zero status: 7 /work/nvshmem_src/src/modules/transport/ibrc/ibrc.cpp:418: non-zero status: 110 ibv_modify_qp failed 
ibv_modify_qp failed 
ibv_modify_qp failed 
ep_connect failed 
ibv_modify_qp failed 

/work/nvshmem_src/src/modules/transport/ibrc/ibrc.cpp:1433: non-zero status: 7 
/work/nvshmem_src/src/modules/transport/ibrc/ibrc.cpp:418: non-zero status: 110 /work/nvshmem_src/src/modules/transport/ibrc/ibrc.cpp:418: non-zero status: 110 /work/nvshmem_src/src/modules/transport/ibrc/ibrc.cpp:1433: non-zero status: 7 /work/nvshmem_src/src/modules/transport/ibrc/ibrc.cpp:418: non-zero status: 110 /work/nvshmem_src/src/modules/transport/ibrc/ibrc.cpp:1433: non-zero status: 7 /work/nvshmem_src/src/modules/transport/ibrc/ibrc.cpp:1500: non-zero status: 7 ep_connect failed 
/work/nvshmem_src/src/modules/transport/ibrc/ibrc.cpp:1433: non-zero status: 7 ibv_modify_qp failed 
ibv_modify_qp failed 
ep_connect failed 
ibv_modify_qp failed 
ep_connect failed 
transport create connect failed 
ep_connect failed 
/work/nvshmem_src/src/modules/transport/ibrc/ibrc.cpp:1500: non-zero status: 7 
/work/nvshmem_src/src/modules/transport/ibrc/ibrc.cpp:1500: non-zero status: 7 
/work/nvshmem_src/src/modules/transport/ibrc/ibrc.cpp:1433: non-zero status: 7 /work/nvshmem_src/src/modules/transport/ibrc/ibrc.cpp:1433: non-zero status: 7 /work/nvshmem_src/src/modules/transport/ibrc/ibrc.cpp:1433: non-zero status: 7 transport create connect failed 
/work/nvshmem_src/src/modules/transport/ibrc/ibrc.cpp:1500: non-zero status: 7 /work/nvshmem_src/src/host/transport/transport.cpp:400: non-zero status: 7 transport create connect failed 
/work/nvshmem_src/src/modules/transport/ibrc/ibrc.cpp:1500: non-zero status: 7 ep_connect failed 
ep_connect failed 
ep_connect failed 

transport create connect failed 
connect EPS failed 

transport create connect failed 
/work/nvshmem_src/src/host/transport/transport.cpp:400: non-zero status: 7 

/work/nvshmem_src/src/host/transport/transport.cpp:400: non-zero status: 7 
/work/nvshmem_src/src/modules/transport/ibrc/ibrc.cpp:1500: non-zero status: 7 /work/nvshmem_src/src/modules/transport/ibrc/ibrc.cpp:1500: non-zero status: 7 /work/nvshmem_src/src/modules/transport/ibrc/ibrc.cpp:1500: non-zero status: 7 connect EPS failed 
/work/nvshmem_src/src/host/init/init.cu:1007: non-zero status: 7 /work/nvshmem_src/src/host/transport/transport.cpp:400: non-zero status: 7 connect EPS failed 
transport create connect failed 
/work/nvshmem_src/src/host/transport/transport.cpp:400: non-zero status: 7 transport create connect failed 
transport create connect failed 

nvshmem setup connections failed 
connect EPS failed 
connect EPS failed 

/work/nvshmem_src/src/host/init/init.cu:1007: non-zero status: 7 
nvshmem setup connections failed 
/work/nvshmem_src/src/host/init/init.cu:1007: non-zero status: 7 
/work/nvshmem_src/src/host/transport/transport.cpp:400: non-zero status: 7 /work/nvshmem_src/src/host/transport/transport.cpp:400: non-zero status: 7 /work/nvshmem_src/src/host/init/init.cu:nvshmemi_check_state_and_init:1080: /work/nvshmem_src/src/host/transport/transport.cpp:400: non-zero status: 7 /work/nvshmem_src/src/host/init/init.cu:1007: non-zero status: 7 
nvshmem setup connections failed 
/work/nvshmem_src/src/host/init/init.cu:1007: non-zero status: 7 connect EPS failed 
connect EPS failed 
nvshmem initialization failed, exiting 
connect EPS failed 
nvshmem setup connections failed 
/work/nvshmem_src/src/host/init/init.cu:nvshmemi_check_state_and_init:1080: 
nvshmem setup connections failed 
/work/nvshmem_src/src/host/init/init.cu:nvshmemi_check_state_and_init:1080: 

nvshmem initialization failed, exiting 

nvshmem initialization failed, exiting 
/work/nvshmem_src/src/host/init/init.cu:1007: non-zero status: 7 /work/nvshmem_src/src/host/init/init.cu:1007: non-zero status: 7 /work/nvshmem_src/src/host/init/init.cu:1007: non-zero status: 7 /work/nvshmem_src/src/host/init/init.cu:nvshmemi_check_state_and_init:1080: 
/work/nvshmem_src/src/host/init/init.cu:nvshmemi_check_state_and_init:1080: 
nvshmem setup connections failed 
nvshmem setup connections failed 
nvshmem setup connections failed 
nvshmem initialization failed, exiting 
nvshmem initialization failed, exiting 

/work/nvshmem_src/src/host/init/init.cu:nvshmemi_check_state_and_init:1080: /work/nvshmem_src/src/host/init/init.cu:nvshmemi_check_state_and_init:1080: /work/nvshmem_src/src/host/init/init.cu:nvshmemi_check_state_and_init:1080: nvshmem initialization failed, exiting 
nvshmem initialization failed, exiting 
nvshmem initialization failed, exiting 

W0307 15:04:39.317000 1627 site-packages/torch/multiprocessing/spawn.py:169] Terminating process 1692 via signal SIGTERM
W0307 15:04:39.318000 1627 site-packages/torch/multiprocessing/spawn.py:169] Terminating process 1693 via signal SIGTERM
W0307 15:04:39.319000 1627 site-packages/torch/multiprocessing/spawn.py:169] Terminating process 1694 via signal SIGTERM
W0307 15:04:39.319000 1627 site-packages/torch/multiprocessing/spawn.py:169] Terminating process 1695 via signal SIGTERM
W0307 15:04:39.319000 1627 site-packages/torch/multiprocessing/spawn.py:169] Terminating process 1696 via signal SIGTERM
W0307 15:04:39.320000 1627 site-packages/torch/multiprocessing/spawn.py:169] Terminating process 1697 via signal SIGTERM
W0307 15:04:39.320000 1627 site-packages/torch/multiprocessing/spawn.py:169] Terminating process 1699 via signal SIGTERM
Traceback (most recent call last):
  File "/work/DeepEP-roce-support/tests/test_low_latency.py", line 160, in <module>
    torch.multiprocessing.spawn(test_loop, args=(num_processes,), nprocs=num_processes)
  File "/root/miniconda3/lib/python3.12/site-packages/torch/multiprocessing/spawn.py", line 340, in spawn
    return start_processes(fn, args, nprocs, join, daemon, start_method="spawn")
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/miniconda3/lib/python3.12/site-packages/torch/multiprocessing/spawn.py", line 296, in start_processes
    while not context.join():
              ^^^^^^^^^^^^^^
  File "/root/miniconda3/lib/python3.12/site-packages/torch/multiprocessing/spawn.py", line 204, in join
    raise ProcessExitedException(
torch.multiprocessing.spawn.ProcessExitedException: process 6 terminated with exit code 255

kunfupanda-hw avatar Mar 07 '25 12:03 kunfupanda-hw

NCCL_DEBUG=INFO MASTER_ADDR=xxx WORLD_SIZE=2 RANK=0 python tests/test_internode.py

Image After modifying the wrong OOB configuration, the current speed of the 4 network NICs is:

[tuning] Best dispatch (FP8): SMs 24, NVL chunk 24, RDMA chunk 32: 41.14 GB/s (RDMA), 134.29 GB/s (NVL)
[tuning] Best dispatch (BF16): SMs 24, NVL chunk 8, RDMA chunk 28: 45.36 GB/s (RDMA), 148.06 GB/s (NVL)
[tuning] Best combine: SMs 24, NVL chunk 3, RDMA chunk 12: 45.20 GB/s (RDMA), 147.55 GB/s (NVL)

Is this reasonable, since I have 4 network NICs? @sphish

@Baibaifan What is OOB configuration? Regarding the performance issue, I agree. It appears that the bandwidth is limited by the NICs.

sphish avatar Mar 09 '25 02:03 sphish

perftest

Could you please send me the command to run perftest for reference? @kunfupanda-hw

Baibaifan avatar Mar 10 '25 02:03 Baibaifan

@Baibaifan What is OOB configuration? Regarding the performance issue, I agree. It appears that the bandwidth is limited by the NICs.

OOB configuration is NCCL_SOCKET_IFNAME, there was something wrong with the previous test setup. I was doing test_low_latency.py test and encountered the same problem as @kunfupanda-hw. @sphish

Baibaifan avatar Mar 10 '25 02:03 Baibaifan

Hi @sphish, The process works, but the performance does not seem to meet expectations. env:嗨@sphish,这个过程是有效的,但性能似乎没有达到预期。

  1. H100 80GB HBM3 *8/HPC h10080gb hbm3 *8/ hpc
  2. 4 * CX7 NICs 400gb/s/HPC4 * CX7网卡400gb/s/HPC
  3. RoCE

rank0 rank1 NCCL_DEBUG=INFO MASTER_ADDR=xxx WORLD_SIZE=2 RANK=0 python tests/test_internode.py``NCCL_DEBUG=INFO MASTER_ADDR=xxx WORLD_SIZE=2 RANK=1 python tests/test_internode.py

[tuning] Best dispatch (FP8): SMs 24, NVL chunk 28, RDMA chunk 8: 4.09 GB/s (RDMA), 13.36 GB/s (NVL)
[tuning] Best dispatch (BF16): SMs 24, NVL chunk 28, RDMA chunk 32: 4.57 GB/s (RDMA), 14.91 GB/s (NVL)
[tuning] Best combine: SMs 24, NVL chunk 3, RDMA chunk 32: 2.97 GB/s (RDMA), 9.69 GB/s (NVL)

The difference between the above speed and the official data is greater than the difference between the network cards. How can I test the performance with ? Is there any reference?以上速度与官方数据的差异大于网卡之间的差异。如何测试性能?有参考吗?perftest

Hi, @Baibaifan, are MASTER_ADDRs of rank0/1 same? Are they IP of CX7 cards or host IP?嗨,@百百范,排名0/1的master_addr是一样的吗?是CX7卡的IP还是主机IP?

MASTER_ADDRs are the same, both are host ip. It is to build Bootstrap use.master_addr是一样的,都是主机ip。它是用来构建Bootstrap的。

Hi @Baibaifan what's configuration have you changed, i have 4 IB per node, there is also the large difference between self-testing data (4.47GB/s RDMA) and offical data.

MinhuiWan avatar Mar 10 '25 02:03 MinhuiWan

perftest

Could you please send me the command to run perftest for reference? @kunfupanda-hw

server: ib_write_bw -d your_card_name -a --report_gbits -n 1000 -q 16 --CPU-freq -p 1 client: ib_write_bw -d your_card_name -a --report_gbits -n 1000 -q 16 your_ip --CPU-freq -p 1

kunfupanda-hw avatar Mar 10 '25 02:03 kunfupanda-hw

ib_write_bw

I mean the tests in perftest/perftest_install, in the nvshmem_src directory. @kunfupanda-hw

Baibaifan avatar Mar 10 '25 03:03 Baibaifan

Hi @sphish, The process works, but the performance does not seem to meet expectations. env:嗨@sphish,这个过程是有效的,但性能似乎没有达到预期。

  1. H100 80GB HBM3 *8/HPC h10080gb hbm3 *8/ hpc
  2. 4 * CX7 NICs 400gb/s/HPC4 * CX7网卡400gb/s/HPC
  3. RoCE

rank0 rank1 NCCL_DEBUG=INFO MASTER_ADDR=xxx WORLD_SIZE=2 RANK=0 python tests/test_internode.pyNCCL_DEBUG=INFO MASTER_ADDR=xxx WORLD_SIZE=2 RANK=1 python tests/test_internode.py ``

[tuning] Best dispatch (FP8): SMs 24, NVL chunk 28, RDMA chunk 8: 4.09 GB/s (RDMA), 13.36 GB/s (NVL)
[tuning] Best dispatch (BF16): SMs 24, NVL chunk 28, RDMA chunk 32: 4.57 GB/s (RDMA), 14.91 GB/s (NVL)
[tuning] Best combine: SMs 24, NVL chunk 3, RDMA chunk 32: 2.97 GB/s (RDMA), 9.69 GB/s (NVL)

The difference between the above speed and the official data is greater than the difference between the network cards. How can I test the performance with ? Is there any reference?以上速度与官方数据的差异大于网卡之间的差异。如何测试性能?有参考吗?perftest

Hi, @Baibaifan, are MASTER_ADDRs of rank0/1 same? Are they IP of CX7 cards or host IP?嗨,@百百范,排名0/1的master_addr是一样的吗?是CX7卡的IP还是主机IP?

MASTER_ADDRs are the same, both are host ip. It is to build Bootstrap use.master_addr是一样的,都是主机ip。它是用来构建Bootstrap的。

Hi @Baibaifan what's configuration have you changed, i have 4 IB per node, there is also the large difference between self-testing data (4.47GB/s RDMA) and offical data.

Please check the nccl network configuration. For example: NCCL_IB_HCA and NCCL_SOCKET_IFNAME. @MinhuiWan

Baibaifan avatar Mar 10 '25 03:03 Baibaifan

Hi @sphish, The process works, but the performance does not seem to meet expectations.Hi @sphish, 进程正常运行,但性能似乎未达到预期。 env:嗨@sphish,这个过程是有效的,但性能似乎没有达到预期。

  1. H100 80GB HBM3 *8/HPC h10080gb hbm3 *8/ hpcH100 80GB HBM3 *8/HPC H100 80GB HBM3 *8/ HPC
  2. 4 * CX7 NICs 400gb/s/HPC4 * CX7网卡400gb/s/HPC4 * CX7 NICs 400gb/s/HPC4 * CX7 网卡 400gb/s/HPC
  3. RoCE

rank0 rank1 NCCL_DEBUG=INFO MASTER_ADDR=xxx WORLD_SIZE=2 RANK=0 python tests/test_internode.pyNCCL_DEBUG=INFO MASTER_ADDR=xxx WORLD_SIZE=2 RANK=1 python tests/test_internode.py ``rank1 NCCL_DEBUG=INFO MASTER_ADDR=xxx WORLD_SIZE=2 RANK=0 python tests/test_internode.py NCCL_DEBUG=INFO MASTER_ADDR=xxx WORLD_SIZE=2 RANK=1 python tests/test_internode.py

[tuning] Best dispatch (FP8): SMs 24, NVL chunk 28, RDMA chunk 8: 4.09 GB/s (RDMA), 13.36 GB/s (NVL)
[tuning] Best dispatch (BF16): SMs 24, NVL chunk 28, RDMA chunk 32: 4.57 GB/s (RDMA), 14.91 GB/s (NVL)
[tuning] Best combine: SMs 24, NVL chunk 3, RDMA chunk 32: 2.97 GB/s (RDMA), 9.69 GB/s (NVL)

The difference between the above speed and the official data is greater than the difference between the network cards. How can I test the performance with ? Is there any reference?以上速度与官方数据的差异大于网卡之间的差异。如何测试性能?有参考吗?perftest以上速度与官方数据的差异大于网卡之间的差异。如何测试性能?有参考吗? perftest

Hi, @Baibaifan, are MASTER_ADDRs of rank0/1 same? Are they IP of CX7 cards or host IP?嗨,@百百范,排名0/1的master_addr是一样的吗?是CX7卡的IP还是主机IP?嗨,@百百范,rank0/1 的 MASTER_ADDR 是一样的吗?是 CX7 卡的 IP 还是主机 IP?

MASTER_ADDRs are the same, both are host ip. It is to build Bootstrap use.master_addr是一样的,都是主机ip。它是用来构建Bootstrap的。master_addr 是一样的,都是主机 ip。它是用来构建 Bootstrap 的。

Hi @Baibaifan what's configuration have you changed, i have 4 IB per node, there is also the large difference between self-testing data (4.47GB/s RDMA) and offical data.你好 @Baibaifan 你更改了什么配置?我这里有 4 个 IB 每节点,自测数据(4.47GB/s RDMA)和官方数据之间也有很大的差异。

Please check the nccl network configuration. For example: NCCL_IB_HCA and NCCL_SOCKET_IFNAME. 请检查 nccl 网络配置。例如: NCCL_IB_HCANCCL_SOCKET_IFNAME@MinhuiWan

@Baibaifan If you are looking for some performance references for nvshmem perftest, there are some official P2P benchmark results in this link. As for collective communication, we haven't tested it either.

sphish avatar Mar 10 '25 03:03 sphish

By following the logs, we found the calltrace was test_loop() --> buffer = deep_ep.Buffer() --> __init__ --> self.runtime.sync(device_ids, ipc_handles, root_unique_id) in DeepEP and then nvshmemt_ibrc_connect_endpoints --> nvshmemt_ibrc_ep_connect --> ep_connect --> status = ftable.modify_qp(ep->qp, &attr, flags); in nvshmem_src_3.2.5-1. The status 110 is usually ETIMEDOUT。通过查看日志,我们发现调用跟踪在 DeepEP 中为 test_loop() --> buffer = deep_ep.Buffer() --> __init__ --> self.runtime.sync(device_ids, ipc_handles, root_unique_id) ,然后在 nvshmem_src_3.2.5-1 中为 nvshmemt_ibrc_connect_endpoints --> nvshmemt_ibrc_ep_connect --> ep_connect --> status = ftable.modify_qp(ep->qp, &attr, flags); 。状态 110 通常为 ETIMEDOUT

Could you add the environment variable NVSHMEM_DEBUG=TRACE, run it again, and provide the log?

Hi, @sphish,

In my setting, I run test_internode.py on two nodes and test_intranode.py on one node successfully, but test_low_latency.py still cannot pass with ibv_modfiy_qp failed. Then I run perftest cmd and get some observations:

  • TCP based QP connection is always failed for different indexed CX cards with RTR status, e.g., mlx5_0 to mlx5_1, while succeed for same indexed CX cards, e.g., mlx5_0 to mlx5_0.
  • CM based QP connection always succeed and perftest cmd pass.

And I also set environment variable NVSHMEM_DEBUG=TRACE, and run it. Please see the part of logs as below. The logs show that the nvshmem driver tries to build EP connection with mutiple times, and finally failed.

/work/nvshmem_src/src/modules/transport/ibgda/ibgda.cpp 3867 Begin - Ordered list of devices for assignment (after processing user provdied env vars (if any))  - 

/work/nvshmem_src/src/modules/transport/ibgda/ibgda.cpp 3871 Ordered list of devices for assignment - idx=0 (of 9), device id=0, port_num=1

/work/nvshmem_src/src/modules/transport/ibgda/ibgda.cpp 3871 Ordered list of devices for assignment - idx=1 (of 9), device id=1, port_num=1

/work/nvshmem_src/src/modules/transport/ibgda/ibgda.cpp 3871 Ordered list of devices for assignment - idx=2 (of 9), device id=2, port_num=1

/work/nvshmem_src/src/modules/transport/ibgda/ibgda.cpp 3871 Ordered list of devices for assignment - idx=3 (of 9), device id=3, port_num=1

/work/nvshmem_src/src/modules/transport/ibgda/ibgda.cpp 3871 Ordered list of devices for assignment - idx=4 (of 9), device id=5, port_num=1

/work/nvshmem_src/src/modules/transport/ibgda/ibgda.cpp 3871 Ordered list of devices for assignment - idx=5 (of 9), device id=6, port_num=1

/work/nvshmem_src/src/modules/transport/ibgda/ibgda.cpp 3871 Ordered list of devices for assignment - idx=6 (of 9), device id=7, port_num=1

/work/nvshmem_src/src/modules/transport/ibgda/ibgda.cpp 3871 Ordered list of devices for assignment - idx=7 (of 9), device id=8, port_num=1

/work/nvshmem_src/src/modules/transport/ibgda/ibgda.cpp 3871 Ordered list of devices for assignment - idx=8 (of 9), device id=9, port_num=1

/work/nvshmem_src/src/modules/transport/ibgda/ibgda.cpp 3880 End - Ordered list of devices for assignment (after processing user provdied env vars (if any))

/work/nvshmem_src/src/modules/transport/ibgda/ibgda.cpp 3905 NIC buffer will be on GPU memory.

/work/nvshmem_src/src/modules/transport/ibgda/ibgda.cpp 3916 NIC handler will be GPU.

/work/nvshmem_src/src/modules/bootstrap/uid/bootstrap_uid.cpp:bootstrap_uid_allgather:480: rank 3 nranks 8 size 32
/work/nvshmem_src/src/modules/bootstrap/uid/bootstrap_uid.cpp:bootstrap_uid_allgather:503: rank 2 nranks 8 size 32 - DONE
/work/nvshmem_src/src/modules/bootstrap/uid/bootstrap_uid.cpp:bootstrap_uid_allgather:503: rank 1 nranks 8 size 32 - DONE
/work/nvshmem_src/src/modules/bootstrap/uid/bootstrap_uid.cpp:bootstrap_uid_allgather:503: rank 7 nranks 8 size 32 - DONE
/work/nvshmem_src/src/modules/bootstrap/uid/bootstrap_uid.cpp:bootstrap_uid_allgather:503: rank 6 nranks 8 size 32 - DONE
/work/nvshmem_src/src/modules/bootstrap/uid/bootstrap_uid.cpp:bootstrap_uid_allgather:503: rank 0 nranks 8 size 32 - DONE
/work/nvshmem_src/src/modules/bootstrap/uid/bootstrap_uid.cpp:bootstrap_uid_allgather:503: rank 5 nranks 8 size 32 - DONE
/work/nvshmem_src/src/modules/bootstrap/uid/bootstrap_uid.cpp:bootstrap_uid_allgather:503: rank 4 nranks 8 size 32 - DONE
/work/nvshmem_src/src/modules/bootstrap/uid/bootstrap_uid.cpp:bootstrap_uid_allgather:480: rank 2 nranks 8 size 16
/work/nvshmem_src/src/modules/bootstrap/uid/bootstrap_uid.cpp:bootstrap_uid_allgather:480: rank 1 nranks 8 size 16
/work/nvshmem_src/src/modules/bootstrap/uid/bootstrap_uid.cpp:bootstrap_uid_allgather:503: rank 3 nranks 8 size 32 - DONE
/work/nvshmem_src/src/modules/bootstrap/uid/bootstrap_uid.cpp:bootstrap_uid_allgather:480: rank 6 nranks 8 size 16
/work/nvshmem_src/src/modules/bootstrap/uid/bootstrap_uid.cpp:bootstrap_uid_allgather:480: rank 5 nranks 8 size 16
/work/nvshmem_src/src/modules/bootstrap/uid/bootstrap_uid.cpp:bootstrap_uid_allgather:480: rank 0 nranks 8 size 16
/work/nvshmem_src/src/modules/bootstrap/uid/bootstrap_uid.cpp:bootstrap_uid_allgather:480: rank 7 nranks 8 size 16
/work/nvshmem_src/src/modules/bootstrap/uid/bootstrap_uid.cpp:bootstrap_uid_allgather:480: rank 4 nranks 8 size 16
/work/nvshmem_src/src/modules/bootstrap/uid/bootstrap_uid.cpp:bootstrap_uid_allgather:480: rank 3 nranks 8 size 16
/work/nvshmem_src/src/modules/bootstrap/uid/bootstrap_uid.cpp:bootstrap_uid_allgather:503: rank 2 nranks 8 size 16 - DONE
/work/nvshmem_src/src/modules/bootstrap/uid/bootstrap_uid.cpp:bootstrap_uid_allgather:503: rank 1 nranks 8 size 16 - DONE
/work/nvshmem_src/src/modules/bootstrap/uid/bootstrap_uid.cpp:bootstrap_uid_allgather:503: rank 6 nranks 8 size 16 - DONE
/work/nvshmem_src/src/modules/bootstrap/uid/bootstrap_uid.cpp:bootstrap_uid_allgather:503: rank 0 nranks 8 size 16 - DONE
/work/nvshmem_src/src/modules/bootstrap/uid/bootstrap_uid.cpp:bootstrap_uid_allgather:503: rank 5 nranks 8 size 16 - DONE
/work/nvshmem_src/src/modules/bootstrap/uid/bootstrap_uid.cpp:bootstrap_uid_allgather:503: rank 7 nranks 8 size 16 - DONE
/work/nvshmem_src/src/modules/bootstrap/uid/bootstrap_uid.cpp:bootstrap_uid_allgather:503: rank 4 nranks 8 size 16 - DONE
/work/nvshmem_src/src/modules/bootstrap/uid/bootstrap_uid.cpp:bootstrap_uid_allgather:503: rank 3 nranks 8 size 16 - DONE
glusterfs-06:1692:1692 [0] NVSHMEM INFO NVSHMEM_ENABLE_NIC_PE_MAPPING = 0, device 0 setting dev_id = 0
glusterfs-06:1697:1697 [5] NVSHMEM INFO NVSHMEM_ENABLE_NIC_PE_MAPPING = 0, device 0 setting dev_id = 6
glusterfs-06:1693:1693 [1] NVSHMEM INFO NVSHMEM_ENABLE_NIC_PE_MAPPING = 0, device 0 setting dev_id = 1
glusterfs-06:1696:1696 [4] NVSHMEM INFO NVSHMEM_ENABLE_NIC_PE_MAPPING = 0, device 0 setting dev_id = 5
glusterfs-06:1698:1698 [6] NVSHMEM INFO NVSHMEM_ENABLE_NIC_PE_MAPPING = 0, device 0 setting dev_id = 7
glusterfs-06:1694:1694 [2] NVSHMEM INFO NVSHMEM_ENABLE_NIC_PE_MAPPING = 0, device 0 setting dev_id = 2
glusterfs-06:1699:1699 [7] NVSHMEM INFO NVSHMEM_ENABLE_NIC_PE_MAPPING = 0, device 0 setting dev_id = 8
glusterfs-06:1695:1695 [3] NVSHMEM INFO NVSHMEM_ENABLE_NIC_PE_MAPPING = 0, device 0 setting dev_id = 4
/work/nvshmem_src/src/modules/bootstrap/uid/bootstrap_uid.cpp:bootstrap_uid_alltoall:516: rank 6 nranks 8 size 48
/work/nvshmem_src/src/modules/bootstrap/uid/bootstrap_uid.cpp:bootstrap_uid_alltoall:516: rank 0 nranks 8 size 48
/work/nvshmem_src/src/modules/bootstrap/uid/bootstrap_uid.cpp:bootstrap_uid_alltoall:516: rank 5 nranks 8 size 48
/work/nvshmem_src/src/modules/bootstrap/uid/bootstrap_uid.cpp:bootstrap_uid_alltoall:516: rank 1 nranks 8 size 48
/work/nvshmem_src/src/modules/bootstrap/uid/bootstrap_uid.cpp:bootstrap_uid_alltoall:516: rank 4 nranks 8 size 48
/work/nvshmem_src/src/modules/bootstrap/uid/bootstrap_uid.cpp:bootstrap_uid_alltoall:516: rank 2 nranks 8 size 48
/work/nvshmem_src/src/modules/bootstrap/uid/bootstrap_uid.cpp:bootstrap_uid_alltoall:516: rank 7 nranks 8 size 48
/work/nvshmem_src/src/modules/bootstrap/uid/bootstrap_uid.cpp:bootstrap_uid_alltoall:516: rank 3 nranks 8 size 48
/work/nvshmem_src/src/modules/bootstrap/uid/bootstrap_uid.cpp:bootstrap_uid_alltoall:546: rank 5 nranks 8 size 48 - DONE
/work/nvshmem_src/src/modules/transport/common/transport_ib_common.h 103 NVSHMEM_IB_ADDR_FAMILY set by environment to AF_INET
/work/nvshmem_src/src/modules/transport/common/transport_ib_common.h 126 NVSHMEM_IB_ADDR_RANGE set by environment to ::/0
/work/nvshmem_src/src/modules/transport/common/transport_ib_common.h 138 NET/IB: Ip address '::' is invalid for family AF_INET, ignoring address
/work/nvshmem_src/src/modules/bootstrap/uid/bootstrap_uid.cpp:bootstrap_uid_alltoall:546: rank 6 nranks 8 size 48 - DONE
/work/nvshmem_src/src/modules/transport/common/transport_ib_common.h 103 NVSHMEM_IB_ADDR_FAMILY set by environment to AF_INET
/work/nvshmem_src/src/modules/transport/common/transport_ib_common.h 126 NVSHMEM_IB_ADDR_RANGE set by environment to ::/0
/work/nvshmem_src/src/modules/transport/common/transport_ib_common.h 138 NET/IB: Ip address '::' is invalid for family AF_INET, ignoring address
/work/nvshmem_src/src/modules/bootstrap/uid/bootstrap_uid.cpp:bootstrap_uid_alltoall:546: rank 4 nranks 8 size 48 - DONE
/work/nvshmem_src/src/modules/bootstrap/uid/bootstrap_uid.cpp:bootstrap_uid_alltoall:546: rank 3 nranks 8 size 48 - DONE
/work/nvshmem_src/src/modules/transport/common/transport_ib_common.h 103 NVSHMEM_IB_ADDR_FAMILY set by environment to AF_INET
/work/nvshmem_src/src/modules/transport/common/transport_ib_common.h 126 NVSHMEM_IB_ADDR_RANGE set by environment to ::/0
/work/nvshmem_src/src/modules/transport/common/transport_ib_common.h 103 NVSHMEM_IB_ADDR_FAMILY set by environment to AF_INET
/work/nvshmem_src/src/modules/transport/common/transport_ib_common.h 138 NET/IB: Ip address '::' is invalid for family AF_INET, ignoring address
/work/nvshmem_src/src/modules/transport/common/transport_ib_common.h 126 NVSHMEM_IB_ADDR_RANGE set by environment to ::/0
/work/nvshmem_src/src/modules/transport/common/transport_ib_common.h 138 NET/IB: Ip address '::' is invalid for family AF_INET, ignoring address
/work/nvshmem_src/src/modules/bootstrap/uid/bootstrap_uid.cpp:bootstrap_uid_alltoall:546: rank 2 nranks 8 size 48 - DONE
/work/nvshmem_src/src/modules/transport/common/transport_ib_common.h 103 NVSHMEM_IB_ADDR_FAMILY set by environment to AF_INET
/work/nvshmem_src/src/modules/transport/common/transport_ib_common.h 126 NVSHMEM_IB_ADDR_RANGE set by environment to ::/0
/work/nvshmem_src/src/modules/transport/common/transport_ib_common.h 138 NET/IB: Ip address '::' is invalid for family AF_INET, ignoring address
/work/nvshmem_src/src/modules/bootstrap/uid/bootstrap_uid.cpp:bootstrap_uid_alltoall:546: rank 1 nranks 8 size 48 - DONE
/work/nvshmem_src/src/modules/transport/common/transport_ib_common.h 103 NVSHMEM_IB_ADDR_FAMILY set by environment to AF_INET
/work/nvshmem_src/src/modules/transport/common/transport_ib_common.h 126 NVSHMEM_IB_ADDR_RANGE set by environment to ::/0
/work/nvshmem_src/src/modules/transport/common/transport_ib_common.h 138 NET/IB: Ip address '::' is invalid for family AF_INET, ignoring address
/work/nvshmem_src/src/modules/bootstrap/uid/bootstrap_uid.cpp:bootstrap_uid_alltoall:546: rank 0 nranks 8 size 48 - DONE
/work/nvshmem_src/src/modules/transport/common/transport_ib_common.h 103 NVSHMEM_IB_ADDR_FAMILY set by environment to AF_INET
/work/nvshmem_src/src/modules/transport/common/transport_ib_common.h 126 NVSHMEM_IB_ADDR_RANGE set by environment to ::/0
/work/nvshmem_src/src/modules/transport/common/transport_ib_common.h 138 NET/IB: Ip address '::' is invalid for family AF_INET, ignoring address
/work/nvshmem_src/src/modules/bootstrap/uid/bootstrap_uid.cpp:bootstrap_uid_alltoall:546: rank 7 nranks 8 size 48 - DONE
/work/nvshmem_src/src/modules/transport/common/transport_ib_common.h 103 NVSHMEM_IB_ADDR_FAMILY set by environment to AF_INET
/work/nvshmem_src/src/modules/transport/common/transport_ib_common.h 126 NVSHMEM_IB_ADDR_RANGE set by environment to ::/0
/work/nvshmem_src/src/modules/transport/common/transport_ib_common.h 138 NET/IB: Ip address '::' is invalid for family AF_INET, ignoring address
/work/nvshmem_src/src/modules/transport/common/transport_ib_common.h 103 NVSHMEM_IB_ADDR_FAMILY set by environment to AF_INET
/work/nvshmem_src/src/modules/transport/common/transport_ib_common.h 126 NVSHMEM_IB_ADDR_RANGE set by environment to ::/0
/work/nvshmem_src/src/modules/transport/common/transport_ib_common.h 138 NET/IB: Ip address '::' is invalid for family AF_INET, ignoring address
/work/nvshmem_src/src/modules/transport/common/transport_ib_common.h 103 NVSHMEM_IB_ADDR_FAMILY set by environment to AF_INET
/work/nvshmem_src/src/modules/transport/common/transport_ib_common.h 126 NVSHMEM_IB_ADDR_RANGE set by environment to ::/0
/work/nvshmem_src/src/modules/transport/common/transport_ib_common.h 138 NET/IB: Ip address '::' is invalid for family AF_INET, ignoring address
/work/nvshmem_src/src/modules/transport/ibrc/ibrc.cpp:418: non-zero status: 110 ibv_modify_qp failed 

/work/nvshmem_src/src/modules/transport/ibrc/ibrc.cpp:418: non-zero status: 110 /work/nvshmem_src/src/modules/transport/ibrc/ibrc.cpp:418: non-zero status: 110 /work/nvshmem_src/src/modules/transport/ibrc/ibrc.cpp:418: non-zero status: 110 /work/nvshmem_src/src/modules/transport/ibrc/ibrc.cpp:1433: non-zero status: 7 /work/nvshmem_src/src/modules/transport/ibrc/ibrc.cpp:418: non-zero status: 110 ibv_modify_qp failed 
ibv_modify_qp failed 
ibv_modify_qp failed 
ep_connect failed 
ibv_modify_qp failed 

/work/nvshmem_src/src/modules/transport/ibrc/ibrc.cpp:1433: non-zero status: 7 
/work/nvshmem_src/src/modules/transport/ibrc/ibrc.cpp:418: non-zero status: 110 /work/nvshmem_src/src/modules/transport/ibrc/ibrc.cpp:418: non-zero status: 110 /work/nvshmem_src/src/modules/transport/ibrc/ibrc.cpp:1433: non-zero status: 7 /work/nvshmem_src/src/modules/transport/ibrc/ibrc.cpp:418: non-zero status: 110 /work/nvshmem_src/src/modules/transport/ibrc/ibrc.cpp:1433: non-zero status: 7 /work/nvshmem_src/src/modules/transport/ibrc/ibrc.cpp:1500: non-zero status: 7 ep_connect failed 
/work/nvshmem_src/src/modules/transport/ibrc/ibrc.cpp:1433: non-zero status: 7 ibv_modify_qp failed 
ibv_modify_qp failed 
ep_connect failed 
ibv_modify_qp failed 
ep_connect failed 
transport create connect failed 
ep_connect failed 
/work/nvshmem_src/src/modules/transport/ibrc/ibrc.cpp:1500: non-zero status: 7 
/work/nvshmem_src/src/modules/transport/ibrc/ibrc.cpp:1500: non-zero status: 7 
/work/nvshmem_src/src/modules/transport/ibrc/ibrc.cpp:1433: non-zero status: 7 /work/nvshmem_src/src/modules/transport/ibrc/ibrc.cpp:1433: non-zero status: 7 /work/nvshmem_src/src/modules/transport/ibrc/ibrc.cpp:1433: non-zero status: 7 transport create connect failed 
/work/nvshmem_src/src/modules/transport/ibrc/ibrc.cpp:1500: non-zero status: 7 /work/nvshmem_src/src/host/transport/transport.cpp:400: non-zero status: 7 transport create connect failed 
/work/nvshmem_src/src/modules/transport/ibrc/ibrc.cpp:1500: non-zero status: 7 ep_connect failed 
ep_connect failed 
ep_connect failed 

transport create connect failed 
connect EPS failed 

transport create connect failed 
/work/nvshmem_src/src/host/transport/transport.cpp:400: non-zero status: 7 

/work/nvshmem_src/src/host/transport/transport.cpp:400: non-zero status: 7 
/work/nvshmem_src/src/modules/transport/ibrc/ibrc.cpp:1500: non-zero status: 7 /work/nvshmem_src/src/modules/transport/ibrc/ibrc.cpp:1500: non-zero status: 7 /work/nvshmem_src/src/modules/transport/ibrc/ibrc.cpp:1500: non-zero status: 7 connect EPS failed 
/work/nvshmem_src/src/host/init/init.cu:1007: non-zero status: 7 /work/nvshmem_src/src/host/transport/transport.cpp:400: non-zero status: 7 connect EPS failed 
transport create connect failed 
/work/nvshmem_src/src/host/transport/transport.cpp:400: non-zero status: 7 transport create connect failed 
transport create connect failed 

nvshmem setup connections failed 
connect EPS failed 
connect EPS failed 

/work/nvshmem_src/src/host/init/init.cu:1007: non-zero status: 7 
nvshmem setup connections failed 
/work/nvshmem_src/src/host/init/init.cu:1007: non-zero status: 7 
/work/nvshmem_src/src/host/transport/transport.cpp:400: non-zero status: 7 /work/nvshmem_src/src/host/transport/transport.cpp:400: non-zero status: 7 /work/nvshmem_src/src/host/init/init.cu:nvshmemi_check_state_and_init:1080: /work/nvshmem_src/src/host/transport/transport.cpp:400: non-zero status: 7 /work/nvshmem_src/src/host/init/init.cu:1007: non-zero status: 7 
nvshmem setup connections failed 
/work/nvshmem_src/src/host/init/init.cu:1007: non-zero status: 7 connect EPS failed 
connect EPS failed 
nvshmem initialization failed, exiting 
connect EPS failed 
nvshmem setup connections failed 
/work/nvshmem_src/src/host/init/init.cu:nvshmemi_check_state_and_init:1080: 
nvshmem setup connections failed 
/work/nvshmem_src/src/host/init/init.cu:nvshmemi_check_state_and_init:1080: 

nvshmem initialization failed, exiting 

nvshmem initialization failed, exiting 
/work/nvshmem_src/src/host/init/init.cu:1007: non-zero status: 7 /work/nvshmem_src/src/host/init/init.cu:1007: non-zero status: 7 /work/nvshmem_src/src/host/init/init.cu:1007: non-zero status: 7 /work/nvshmem_src/src/host/init/init.cu:nvshmemi_check_state_and_init:1080: 
/work/nvshmem_src/src/host/init/init.cu:nvshmemi_check_state_and_init:1080: 
nvshmem setup connections failed 
nvshmem setup connections failed 
nvshmem setup connections failed 
nvshmem initialization failed, exiting 
nvshmem initialization failed, exiting 

/work/nvshmem_src/src/host/init/init.cu:nvshmemi_check_state_and_init:1080: /work/nvshmem_src/src/host/init/init.cu:nvshmemi_check_state_and_init:1080: /work/nvshmem_src/src/host/init/init.cu:nvshmemi_check_state_and_init:1080: nvshmem initialization failed, exiting 
nvshmem initialization failed, exiting 
nvshmem initialization failed, exiting 

W0307 15:04:39.317000 1627 site-packages/torch/multiprocessing/spawn.py:169] Terminating process 1692 via signal SIGTERM
W0307 15:04:39.318000 1627 site-packages/torch/multiprocessing/spawn.py:169] Terminating process 1693 via signal SIGTERM
W0307 15:04:39.319000 1627 site-packages/torch/multiprocessing/spawn.py:169] Terminating process 1694 via signal SIGTERM
W0307 15:04:39.319000 1627 site-packages/torch/multiprocessing/spawn.py:169] Terminating process 1695 via signal SIGTERM
W0307 15:04:39.319000 1627 site-packages/torch/multiprocessing/spawn.py:169] Terminating process 1696 via signal SIGTERM
W0307 15:04:39.320000 1627 site-packages/torch/multiprocessing/spawn.py:169] Terminating process 1697 via signal SIGTERM
W0307 15:04:39.320000 1627 site-packages/torch/multiprocessing/spawn.py:169] Terminating process 1699 via signal SIGTERM
Traceback (most recent call last):
  File "/work/DeepEP-roce-support/tests/test_low_latency.py", line 160, in <module>
    torch.multiprocessing.spawn(test_loop, args=(num_processes,), nprocs=num_processes)
  File "/root/miniconda3/lib/python3.12/site-packages/torch/multiprocessing/spawn.py", line 340, in spawn
    return start_processes(fn, args, nprocs, join, daemon, start_method="spawn")
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/miniconda3/lib/python3.12/site-packages/torch/multiprocessing/spawn.py", line 296, in start_processes
    while not context.join():
              ^^^^^^^^^^^^^^
  File "/root/miniconda3/lib/python3.12/site-packages/torch/multiprocessing/spawn.py", line 204, in join
    raise ProcessExitedException(
torch.multiprocessing.spawn.ProcessExitedException: process 6 terminated with exit code 255

@kunfupanda-hw I'm sorry, I don't have any ideas. Are you using a RoCE network? You can try running the shmem_put_bw in nvshmem. If that doesn't work, it might be an issue with your IP configuration.

sphish avatar Mar 10 '25 03:03 sphish

Hi @sphish, The process works, but the performance does not seem to meet expectations. env:嗨@sphish,这个过程是有效的,但性能似乎没有达到预期。

  1. H100 80GB HBM3 *8/HPC h10080gb hbm3 *8/ hpc
  2. 4 * CX7 NICs 400gb/s/HPC4 * CX7网卡400gb/s/HPC
  3. RoCE

rank0 rank1 NCCL_DEBUG=INFO MASTER_ADDR=xxx WORLD_SIZE=2 RANK=0 python tests/test_internode.pyNCCL_DEBUG=INFO MASTER_ADDR=xxx WORLD_SIZE=2 RANK=1 python tests/test_internode.py ``

[tuning] Best dispatch (FP8): SMs 24, NVL chunk 28, RDMA chunk 8: 4.09 GB/s (RDMA), 13.36 GB/s (NVL)
[tuning] Best dispatch (BF16): SMs 24, NVL chunk 28, RDMA chunk 32: 4.57 GB/s (RDMA), 14.91 GB/s (NVL)
[tuning] Best combine: SMs 24, NVL chunk 3, RDMA chunk 32: 2.97 GB/s (RDMA), 9.69 GB/s (NVL)

The difference between the above speed and the official data is greater than the difference between the network cards. How can I test the performance with ? Is there any reference?以上速度与官方数据的差异大于网卡之间的差异。如何测试性能?有参考吗?perftest

Hi, @Baibaifan, are MASTER_ADDRs of rank0/1 same? Are they IP of CX7 cards or host IP?嗨,@百百范,排名0/1的master_addr是一样的吗?是CX7卡的IP还是主机IP?

MASTER_ADDRs are the same, both are host ip. It is to build Bootstrap use.master_addr是一样的,都是主机ip。它是用来构建Bootstrap的。

Hi @Baibaifan what's configuration have you changed, i have 4 IB per node, there is also the large difference between self-testing data (4.47GB/s RDMA) and offical data.

Please check the nccl network configuration. For example: NCCL_IB_HCA and NCCL_SOCKET_IFNAME. @MinhuiWan

Hi @sphish, The process works, but the performance does not seem to meet expectations.Hi @sphish, 进程正常运行,但性能似乎未达到预期。 env:嗨@sphish,这个过程是有效的,但性能似乎没有达到预期。

  1. H100 80GB HBM3 *8/HPC h10080gb hbm3 *8/ hpcH100 80GB HBM3 *8/HPC H100 80GB HBM3 *8/ HPC
  2. 4 * CX7 NICs 400gb/s/HPC4 * CX7网卡400gb/s/HPC4 * CX7 NICs 400gb/s/HPC4 * CX7 网卡 400gb/s/HPC
  3. RoCE

rank0 rank1 NCCL_DEBUG=INFO MASTER_ADDR=xxx WORLD_SIZE=2 RANK=0 python tests/test_internode.pyNCCL_DEBUG=INFO MASTER_ADDR=xxx WORLD_SIZE=2 RANK=1 python tests/test_internode.py ``rank1 NCCL_DEBUG=INFO MASTER_ADDR=xxx WORLD_SIZE=2 RANK=0 python tests/test_internode.py NCCL_DEBUG=INFO MASTER_ADDR=xxx WORLD_SIZE=2 RANK=1 python tests/test_internode.py

[tuning] Best dispatch (FP8): SMs 24, NVL chunk 28, RDMA chunk 8: 4.09 GB/s (RDMA), 13.36 GB/s (NVL)
[tuning] Best dispatch (BF16): SMs 24, NVL chunk 28, RDMA chunk 32: 4.57 GB/s (RDMA), 14.91 GB/s (NVL)
[tuning] Best combine: SMs 24, NVL chunk 3, RDMA chunk 32: 2.97 GB/s (RDMA), 9.69 GB/s (NVL)

The difference between the above speed and the official data is greater than the difference between the network cards. How can I test the performance with ? Is there any reference?以上速度与官方数据的差异大于网卡之间的差异。如何测试性能?有参考吗?perftest以上速度与官方数据的差异大于网卡之间的差异。如何测试性能?有参考吗? perftest

Hi, @Baibaifan, are MASTER_ADDRs of rank0/1 same? Are they IP of CX7 cards or host IP?嗨,@百百范,排名0/1的master_addr是一样的吗?是CX7卡的IP还是主机IP?嗨,@百百范,rank0/1 的 MASTER_ADDR 是一样的吗?是 CX7 卡的 IP 还是主机 IP?

MASTER_ADDRs are the same, both are host ip. It is to build Bootstrap use.master_addr是一样的,都是主机ip。它是用来构建Bootstrap的。master_addr 是一样的,都是主机 ip。它是用来构建 Bootstrap 的。

Hi @Baibaifan what's configuration have you changed, i have 4 IB per node, there is also the large difference between self-testing data (4.47GB/s RDMA) and offical data.你好 @Baibaifan 你更改了什么配置?我这里有 4 个 IB 每节点,自测数据(4.47GB/s RDMA)和官方数据之间也有很大的差异。

Please check the nccl network configuration. For example: NCCL_IB_HCA and NCCL_SOCKET_IFNAME. 请检查 nccl 网络配置。例如: NCCL_IB_HCANCCL_SOCKET_IFNAME@MinhuiWan

@Baibaifan If you are looking for some performance references for nvshmem perftest, there are some official P2P benchmark results in this link. As for collective communication, we haven't tested it either.

I am using RoCE with 4NICs. I have a problem with running test_low_latency.py. I want to use perftest/perftest_install in the nvshmem_src directory, for example, shmem_put_bw. I want to refer to the following process to see how to use perftest. @sphish

Baibaifan avatar Mar 10 '25 03:03 Baibaifan

Hi @sphish, The process works, but the performance does not seem to meet expectations. env:嗨@sphish,这个过程是有效的,但性能似乎没有达到预期。嗨@sphish,这个过程是有效的,但性能似乎没有达到预期。

  1. H100 80GB HBM3 *8/HPC h10080gb hbm3 *8/ hpch10080gb HBM3 *8/HPC
  2. 4 * CX7 NICs 400gb/s/HPC4 * CX7网卡400gb/s/HPC
  3. RoCE

rank0 rank1 NCCL_DEBUG=INFO MASTER_ADDR=xxx WORLD_SIZE=2 RANK=0 python tests/test_internode.pyNCCL_DEBUG=INFO MASTER_ADDR=xxx WORLD_SIZE=2 RANK=1 python tests/test_internode.py ``

[tuning] Best dispatch (FP8): SMs 24, NVL chunk 28, RDMA chunk 8: 4.09 GB/s (RDMA), 13.36 GB/s (NVL)
[tuning] Best dispatch (BF16): SMs 24, NVL chunk 28, RDMA chunk 32: 4.57 GB/s (RDMA), 14.91 GB/s (NVL)
[tuning] Best combine: SMs 24, NVL chunk 3, RDMA chunk 32: 2.97 GB/s (RDMA), 9.69 GB/s (NVL)

The difference between the above speed and the official data is greater than the difference between the network cards. How can I test the performance with ? Is there any reference?以上速度与官方数据的差异大于网卡之间的差异。如何测试性能?有参考吗?perftest

Hi, @Baibaifan, are MASTER_ADDRs of rank0/1 same? Are they IP of CX7 cards or host IP?嗨,@百百范,排名0/1的master_addr是一样的吗?是CX7卡的IP还是主机IP?

MASTER_ADDRs are the same, both are host ip. It is to build Bootstrap use.master_addr是一样的,都是主机ip。它是用来构建Bootstrap的。

Hi @Baibaifan what's configuration have you changed, i have 4 IB per node, there is also the large difference between self-testing data (4.47GB/s RDMA) and offical data.

Please check the nccl network configuration. For example: NCCL_IB_HCA and NCCL_SOCKET_IFNAME. @MinhuiWan

thanks for your reply @Baibaifan. I will set NCCL_IB_HCA and NCCL_SOCKET_IFNAME to test. But i have a question: DeepEP does not rely on NCCL, so why can NCCL environment variables be set to control it?

MinhuiWan avatar Mar 10 '25 04:03 MinhuiWan

Hi @sphish, The process works, but the performance does not seem to meet expectations. env:嗨@sphish,这个过程是有效的,但性能似乎没有达到预期。嗨@sphish,这个过程是有效的,但性能似乎没有达到预期。

  1. H100 80GB HBM3 *8/HPC h10080gb hbm3 *8/ hpch10080gb HBM3 *8/HPC
  2. 4 * CX7 NICs 400gb/s/HPC4 * CX7网卡400gb/s/HPC
  3. RoCE

rank0 rank1 NCCL_DEBUG=INFO MASTER_ADDR=xxx WORLD_SIZE=2 RANK=0 python tests/test_internode.pyNCCL_DEBUG=INFO MASTER_ADDR=xxx WORLD_SIZE=2 RANK=1 python tests/test_internode.py ``

[tuning] Best dispatch (FP8): SMs 24, NVL chunk 28, RDMA chunk 8: 4.09 GB/s (RDMA), 13.36 GB/s (NVL)
[tuning] Best dispatch (BF16): SMs 24, NVL chunk 28, RDMA chunk 32: 4.57 GB/s (RDMA), 14.91 GB/s (NVL)
[tuning] Best combine: SMs 24, NVL chunk 3, RDMA chunk 32: 2.97 GB/s (RDMA), 9.69 GB/s (NVL)

The difference between the above speed and the official data is greater than the difference between the network cards. How can I test the performance with ? Is there any reference?以上速度与官方数据的差异大于网卡之间的差异。如何测试性能?有参考吗?perftest

Hi, @Baibaifan, are MASTER_ADDRs of rank0/1 same? Are they IP of CX7 cards or host IP?嗨,@百百范,排名0/1的master_addr是一样的吗?是CX7卡的IP还是主机IP?

MASTER_ADDRs are the same, both are host ip. It is to build Bootstrap use.master_addr是一样的,都是主机ip。它是用来构建Bootstrap的。

Hi @Baibaifan what's configuration have you changed, i have 4 IB per node, there is also the large difference between self-testing data (4.47GB/s RDMA) and offical data.

Please check the nccl network configuration. For example: NCCL_IB_HCA and NCCL_SOCKET_IFNAME. @MinhuiWan

thanks for your reply @Baibaifan. I will set NCCL_IB_HCA and NCCL_SOCKET_IFNAME to test. But i have a question: DeepEP does not rely on NCCL, so why can NCCL environment variables be set to control it?

The unit test used nccl to create a communication group, and I found that there was a problem with my previous settings. @MinhuiWan

Baibaifan avatar Mar 10 '25 05:03 Baibaifan