Test test_low_latency.py failed on H100 with ROCE
Issue Description
I'm working on an H100 GPU cluster with RoCE drivers properly installed on the network interface cards.
While the test_intranode.py script runs successfully and produces expected results, the test_low_latency consistently fails with errors.
Technical Details:
- NVSHMEM version installed: 3.1.7-1 (following the README instructions)
- Suspected compatibility issue: Potential mismatch between NVSHMEM version and RoCE configuration
I would greatly appreciate any assistance or insights to resolve this. Below are the specific error messages for reference:
Actual Result
(base) root@ubuntu:/work/DeepEP-main# python tests/test_low_latency.py
local_rank:1, ip:127.0.0.1 port:3004
world_size:2, rank:1
local_rank:0, ip:127.0.0.1 port:3004
world_size:2, rank:0
setting....
setting....
setted...
rank:0, num_ranks:2
Allocating buffer size: 2116.292096 MB ...
/work/nvshmem_src_3.1.7-1/nvshmem_src/src/modules/transport/ibrc/ibrc.cpp:417: non-zero status: 110 /work/nvshmem_src_3.1.7-1/nvshmem_src/src/modules/transport/ibrc/ibrc.cpp:417: non-zero status: 110 ibv_modify_qp failed
ibv_modify_qp failed
/work/nvshmem_src_3.1.7-1/nvshmem_src/src/modules/transport/ibrc/ibrc.cpp:1432: non-zero status: 7 /work/nvshmem_src_3.1.7-1/nvshmem_src/src/modules/transport/ibrc/ibrc.cpp:1432: non-zero status: 7 ep_connect failed
ep_connect failed
/work/nvshmem_src_3.1.7-1/nvshmem_src/src/modules/transport/ibrc/ibrc.cpp:1499: non-zero status: 7 /work/nvshmem_src_3.1.7-1/nvshmem_src/src/modules/transport/ibrc/ibrc.cpp:1499: non-zero status: 7 transport create connect failed
transport create connect failed
/work/nvshmem_src_3.1.7-1/nvshmem_src/src/host/transport/transport.cpp:400: non-zero status: 7 /work/nvshmem_src_3.1.7-1/nvshmem_src/src/host/transport/transport.cpp:400: non-zero status: 7 connect EPS failed
connect EPS failed
/work/nvshmem_src_3.1.7-1/nvshmem_src/src/host/init/init.cu:1001: non-zero status: 7 /work/nvshmem_src_3.1.7-1/nvshmem_src/src/host/init/init.cu:1001: non-zero status: 7 nvshmem setup connections failed
nvshmem setup connections failed
/work/nvshmem_src_3.1.7-1/nvshmem_src/src/modules/transport/ibrc/ibrc.cpp:417: non-zero status: 110 ibv_modify_qp failed
/work/nvshmem_src_3.1.7-1/nvshmem_src/src/modules/transport/ibrc/ibrc.cpp:1432: non-zero status: 7 ep_connect failed
/work/nvshmem_src_3.1.7-1/nvshmem_src/src/modules/transport/ibrc/ibrc.cpp:1499: non-zero status: 7 transport create connect failed
/work/nvshmem_src_3.1.7-1/nvshmem_src/src/host/transport/transport.cpp:400: non-zero status: 7 connect EPS failed
/work/nvshmem_src_3.1.7-1/nvshmem_src/src/modules/transport/ibrc/ibrc.cpp:417: non-zero status: 110
ibv_modify_qp failed
/work/nvshmem_src_3.1.7-1/nvshmem_src/src/host/init/init.cu:1001: non-zero status: 7
nvshmem setup connections failed
/work/nvshmem_src_3.1.7-1/nvshmem_src/src/modules/transport/ibrc/ibrc.cpp:1432: non-zero status: 7
ep_connect failed
/work/nvshmem_src_3.1.7-1/nvshmem_src/src/host/init/init.cu:nvshmemi_check_state_and_init:1074:
nvshmem initialization failed, exiting
/work/nvshmem_src_3.1.7-1/nvshmem_src/src/modules/transport/ibrc/ibrc.cpp:1499: non-zero status: 7
transport create connect failed
/work/nvshmem_src_3.1.7-1/nvshmem_src/src/host/transport/transport.cpp:400: non-zero status: 7 connect EPS failed
/work/nvshmem_src_3.1.7-1/nvshmem_src/src/host/init/init.cu:1001: non-zero status: 7 nvshmem setup connections failed
/work/nvshmem_src_3.1.7-1/nvshmem_src/src/host/init/init.cu:nvshmemi_check_state_and_init:1074: nvshmem initialization failed, exiting
W0303 20:33:45.904000 12712 site-packages/torch/multiprocessing/spawn.py:169] Terminating process 12777 via signal SIGTERM
Traceback (most recent call last):
File "/work/DeepEP-main/tests/test_low_latency.py", line 164, in <module>
torch.multiprocessing.spawn(test_loop, args=(num_processes,), nprocs=num_processes)
File "/root/miniconda3/lib/python3.12/site-packages/torch/multiprocessing/spawn.py", line 340, in spawn
return start_processes(fn, args, nprocs, join, daemon, start_method="spawn")
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/miniconda3/lib/python3.12/site-packages/torch/multiprocessing/spawn.py", line 296, in start_processes
while not context.join():
^^^^^^^^^^^^^^
File "/root/miniconda3/lib/python3.12/site-packages/torch/multiprocessing/spawn.py", line 204, in join
raise ProcessExitedException(
torch.multiprocessing.spawn.ProcessExitedException: process 1 terminated with exit code 255
Env
Ubuntu2204
Linux ubuntu 5.15.0-25-generic #25-Ubuntu SMP Wed Mar 30 15:54:22 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux
nvidia-smi
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 560.35.05 Driver Version: 560.35.05 CUDA Version: 12.6 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA H100 80GB HBM3 Off | 00000000:18:00.0 Off | 0 |
| N/A 22C P0 68W / 700W | 1MiB / 81559MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
| 1 NVIDIA H100 80GB HBM3 Off | 00000000:2A:00.0 Off | 0 |
| N/A 25C P0 71W / 700W | 1MiB / 81559MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
| 2 NVIDIA H100 80GB HBM3 Off | 00000000:3A:00.0 Off | 0 |
| N/A 24C P0 69W / 700W | 1MiB / 81559MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
| 3 NVIDIA H100 80GB HBM3 Off | 00000000:5D:00.0 Off | 0 |
| N/A 22C P0 70W / 700W | 1MiB / 81559MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
| 4 NVIDIA H100 80GB HBM3 Off | 00000000:9A:00.0 Off | 0 |
| N/A 24C P0 71W / 700W | 1MiB / 81559MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
| 5 NVIDIA H100 80GB HBM3 Off | 00000000:AB:00.0 Off | 0 |
| N/A 25C P0 72W / 700W | 1MiB / 81559MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
| 6 NVIDIA H100 80GB HBM3 Off | 00000000:BA:00.0 Off | 0 |
| N/A 23C P0 70W / 700W | 1MiB / 81559MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
| 7 NVIDIA H100 80GB HBM3 Off | 00000000:DB:00.0 Off | 0 |
| N/A 22C P0 69W / 700W | 1MiB / 81559MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| No running processes found |
+-----------------------------------------------------------------------------------------+
ibv_devinfo
hca_id: mlx5_0
transport: InfiniBand (0)
fw_ver: 28.42.1000
node_guid: a088:c203:0059:910c
sys_image_guid: a088:c203:0059:910c
vendor_id: 0x02c9
vendor_part_id: 4129
hw_ver: 0x0
board_id: MT_0000000838
phys_port_cnt: 1
port: 1
state: PORT_ACTIVE (4)
max_mtu: 4096 (5)
active_mtu: 4096 (5)
sm_lid: 0
port_lid: 0
port_lmc: 0x00
link_layer: Ethernet
hca_id: mlx5_1
transport: InfiniBand (0)
fw_ver: 28.42.1000
node_guid: a088:c203:0050:a72c
sys_image_guid: a088:c203:0050:a72c
vendor_id: 0x02c9
vendor_part_id: 4129
hw_ver: 0x0
board_id: MT_0000000838
phys_port_cnt: 1
port: 1
state: PORT_ACTIVE (4)
max_mtu: 4096 (5)
active_mtu: 4096 (5)
sm_lid: 0
port_lid: 0
port_lmc: 0x00
link_layer: Ethernet
hca_id: mlx5_2
transport: InfiniBand (0)
fw_ver: 28.42.1000
node_guid: a088:c203:007e:1dba
sys_image_guid: a088:c203:007e:1dba
vendor_id: 0x02c9
vendor_part_id: 4129
hw_ver: 0x0
board_id: MT_0000000838
phys_port_cnt: 1
port: 1
state: PORT_ACTIVE (4)
max_mtu: 4096 (5)
active_mtu: 4096 (5)
sm_lid: 0
port_lid: 0
port_lmc: 0x00
link_layer: Ethernet
hca_id: mlx5_3
transport: InfiniBand (0)
fw_ver: 16.35.4030
node_guid: e8eb:d303:0055:750a
sys_image_guid: e8eb:d303:0055:750a
vendor_id: 0x02c9
vendor_part_id: 4119
hw_ver: 0x0
board_id: MT_0000000425
phys_port_cnt: 1
port: 1
state: PORT_ACTIVE (4)
max_mtu: 4096 (5)
active_mtu: 1024 (3)
sm_lid: 0
port_lid: 0
port_lmc: 0x00
link_layer: Ethernet
hca_id: mlx5_4
transport: InfiniBand (0)
fw_ver: 16.35.4030
node_guid: e8eb:d303:0055:750b
sys_image_guid: e8eb:d303:0055:750a
vendor_id: 0x02c9
vendor_part_id: 4119
hw_ver: 0x0
board_id: MT_0000000425
phys_port_cnt: 1
port: 1
state: PORT_DOWN (1)
max_mtu: 4096 (5)
active_mtu: 1024 (3)
sm_lid: 0
port_lid: 0
port_lmc: 0x00
link_layer: Ethernet
hca_id: mlx5_5
transport: InfiniBand (0)
fw_ver: 28.42.1000
node_guid: a088:c203:0060:27e6
sys_image_guid: a088:c203:0060:27e6
vendor_id: 0x02c9
vendor_part_id: 4129
hw_ver: 0x0
board_id: MT_0000000838
phys_port_cnt: 1
port: 1
state: PORT_ACTIVE (4)
max_mtu: 4096 (5)
active_mtu: 4096 (5)
sm_lid: 0
port_lid: 0
port_lmc: 0x00
link_layer: Ethernet
hca_id: mlx5_6
transport: InfiniBand (0)
fw_ver: 28.42.1000
node_guid: a088:c203:007e:1c3a
sys_image_guid: a088:c203:007e:1c3a
vendor_id: 0x02c9
vendor_part_id: 4129
hw_ver: 0x0
board_id: MT_0000000838
phys_port_cnt: 1
port: 1
state: PORT_ACTIVE (4)
max_mtu: 4096 (5)
active_mtu: 4096 (5)
sm_lid: 0
port_lid: 0
port_lmc: 0x00
link_layer: Ethernet
hca_id: mlx5_7
transport: InfiniBand (0)
fw_ver: 28.42.1000
node_guid: a088:c203:0060:2b1e
sys_image_guid: a088:c203:0060:2b1e
vendor_id: 0x02c9
vendor_part_id: 4129
hw_ver: 0x0
board_id: MT_0000000838
phys_port_cnt: 1
port: 1
state: PORT_ACTIVE (4)
max_mtu: 4096 (5)
active_mtu: 4096 (5)
sm_lid: 0
port_lid: 0
port_lmc: 0x00
link_layer: Ethernet
hca_id: mlx5_8
transport: InfiniBand (0)
fw_ver: 28.42.1000
node_guid: a088:c203:007d:ab62
sys_image_guid: a088:c203:007d:ab62
vendor_id: 0x02c9
vendor_part_id: 4129
hw_ver: 0x0
board_id: MT_0000000838
phys_port_cnt: 1
port: 1
state: PORT_ACTIVE (4)
max_mtu: 4096 (5)
active_mtu: 4096 (5)
sm_lid: 0
port_lid: 0
port_lmc: 0x00
link_layer: Ethernet
hca_id: mlx5_9
transport: InfiniBand (0)
fw_ver: 28.42.1000
node_guid: a088:c203:007d:ab9a
sys_image_guid: a088:c203:007d:ab9a
vendor_id: 0x02c9
vendor_part_id: 4129
hw_ver: 0x0
board_id: MT_0000000838
phys_port_cnt: 1
port: 1
state: PORT_ACTIVE (4)
max_mtu: 4096 (5)
active_mtu: 4096 (5)
sm_lid: 0
port_lid: 0
port_lmc: 0x00
link_layer: Ethernet
This is a bug in NVSHMEM 3.1.7 and can be resolved by using NVSHMEM 3.2.5. https://github.com/deepseek-ai/DeepEP/issues/17#issuecomment-2684327121
This is a bug in NVSHMEM 3.1.7 and can be resolved by using NVSHMEM 3.2.5. #17 (comment)
How to deal with conflicts in deepep patch packages?
This is a bug in NVSHMEM 3.1.7 and can be resolved by using NVSHMEM 3.2.5. #17 (comment)
How to deal with conflicts in deepep patch packages?
For the current patch, the conflict is caused by a commit modifying the CMake file. You can skip this commit. I will later upload a new patch compatible with NVSHMEM 3.2.5.
This is a bug in NVSHMEM 3.1.7 and can be resolved by using NVSHMEM 3.2.5. #17 (comment)
How to deal with conflicts in deepep patch packages?
For the current patch, the conflict is caused by a commit modifying the CMake file. You can skip this commit. I will later upload a new patch compatible with NVSHMEM 3.2.5.
I tried to modify the CMakeLists.txt file, but CMakeLists.txt has been refactored, and I don't know how to modify the original patch 140 and 165. I hope you can give me some suggestions. I'll wait for you to update the patch. @sphish
I have created a new branch that updates NVSHMEM to version 3.2.5. However, I don't have a RoCE environment for verification. Can you test this out? @Baibaifan Branch: https://github.com/deepseek-ai/DeepEP/tree/roce-support
I have created a new branch that updates NVSHMEM to version 3.2.5. However, I don't have a RoCE environment for verification. Can you test this out? @Baibaifan Branch: https://github.com/deepseek-ai/DeepEP/tree/roce-support
ok, I will try it and does the cmake configuration need to be modified? @sphish
CUDA_HOME=/path/to/cuda \
GDRCOPY_HOME=/path/to/gdrcopy \
NVSHMEM_SHMEM_SUPPORT=0 \
NVSHMEM_UCX_SUPPORT=0 \
NVSHMEM_USE_NCCL=0 \
NVSHMEM_MPI_SUPPORT=0 \
NVSHMEM_IBGDA_SUPPORT=1 \
NVSHMEM_PMIX_SUPPORT=0 \
NVSHMEM_TIMEOUT_DEVICE_POLLING=0 \
NVSHMEM_USE_GDRCOPY=1 \
cmake -S . -B build/ -DCMAKE_INSTALL_PREFIX=/path/to/your/dir/to/install
I have created a new branch that updates NVSHMEM to version 3.2.5. However, I don't have a RoCE environment for verification. Can you test this out? @Baibaifan Branch: https://github.com/deepseek-ai/DeepEP/tree/roce-support
ok, I will try it and does the cmake configuration need to be modified? @sphish
CUDA_HOME=/path/to/cuda \ GDRCOPY_HOME=/path/to/gdrcopy \ NVSHMEM_SHMEM_SUPPORT=0 \ NVSHMEM_UCX_SUPPORT=0 \ NVSHMEM_USE_NCCL=0 \ NVSHMEM_MPI_SUPPORT=0 \ NVSHMEM_IBGDA_SUPPORT=1 \ NVSHMEM_PMIX_SUPPORT=0 \ NVSHMEM_TIMEOUT_DEVICE_POLLING=0 \ NVSHMEM_USE_GDRCOPY=1 \ cmake -S . -B build/ -DCMAKE_INSTALL_PREFIX=/path/to/your/dir/to/install
Nope.
I have created a new branch that updates NVSHMEM to version 3.2.5. However, I don't have a RoCE environment for verification. Can you test this out? @Baibaifan Branch: https://github.com/deepseek-ai/DeepEP/tree/roce-support
I have the same problem on one 8*H100 cluster node for test_low_latency.py, and update NVSHMEM to version 3.2.5 with the newest patch. However, it doesn't work, it lacks for, with the same logs, i.e., /work/nvshmem_src_3.1.7-1/nvshmem_src/src/modules/transport/ibrc/ibrc.cpp:417: non-zero status: 110 /work/nvshmem_src_3.1.7-1/nvshmem_src/src/modules/transport/ibrc/ibrc.cpp:417: non-zero status: 110 ibv_modify_qp failed ibv_modify_qp failed. However, I have tried the ib_write_bw cmd successfully. Any kind suggestion for this problem?
I have created a new branch that updates NVSHMEM to version 3.2.5. However, I don't have a RoCE environment for verification. Can you test this out? @Baibaifan Branch: https://github.com/deepseek-ai/DeepEP/tree/roce-support
I have the same problem on one 8*H100 cluster node for
test_low_latency.py, and update NVSHMEM to version 3.2.5 with the newest patch. However, it doesn't work, it lacks for, with the same logs, i.e.,/work/nvshmem_src_3.1.7-1/nvshmem_src/src/modules/transport/ibrc/ibrc.cpp:417: non-zero status: 110 /work/nvshmem_src_3.1.7-1/nvshmem_src/src/modules/transport/ibrc/ibrc.cpp:417: non-zero status: 110 ibv_modify_qp failed ibv_modify_qp failed. However, I have tried theib_write_bwcmd successfully. Any kind suggestion for this problem?
Could you provide the logs from your NVSHMEM 3.2.5 run? It appears you're still using NVSHMEM 3.1.7.
I have created a new branch that updates NVSHMEM to version 3.2.5. However, I don't have a RoCE environment for verification. Can you test this out? @Baibaifan Branch: https://github.com/deepseek-ai/DeepEP/tree/roce-support
ok, I will try it and does the cmake configuration need to be modified? @sphish
CUDA_HOME=/path/to/cuda \ GDRCOPY_HOME=/path/to/gdrcopy \ NVSHMEM_SHMEM_SUPPORT=0 \ NVSHMEM_UCX_SUPPORT=0 \ NVSHMEM_USE_NCCL=0 \ NVSHMEM_MPI_SUPPORT=0 \ NVSHMEM_IBGDA_SUPPORT=1 \ NVSHMEM_PMIX_SUPPORT=0 \ NVSHMEM_TIMEOUT_DEVICE_POLLING=0 \ NVSHMEM_USE_GDRCOPY=1 \ cmake -S . -B build/ -DCMAKE_INSTALL_PREFIX=/path/to/your/dir/to/installNope.
Hi @sphish, The process works, but the performance does not seem to meet expectations.
env:
- H100 80GB HBM3 *8/HPC
- 4 * CX7 NICs 400gb/s/HPC
- RoCE
rank0
NCCL_DEBUG=INFO MASTER_ADDR=xxx WORLD_SIZE=2 RANK=0 python tests/test_internode.py
rank1
NCCL_DEBUG=INFO MASTER_ADDR=xxx WORLD_SIZE=2 RANK=1 python tests/test_internode.py
[tuning] Best dispatch (FP8): SMs 24, NVL chunk 28, RDMA chunk 8: 4.09 GB/s (RDMA), 13.36 GB/s (NVL)
[tuning] Best dispatch (BF16): SMs 24, NVL chunk 28, RDMA chunk 32: 4.57 GB/s (RDMA), 14.91 GB/s (NVL)
[tuning] Best combine: SMs 24, NVL chunk 3, RDMA chunk 32: 2.97 GB/s (RDMA), 9.69 GB/s (NVL)
The difference between the above speed and the official data is greater than the difference between the network cards. How can I test the performance with perftest? Is there any reference?
I have created a new branch that updates NVSHMEM to version 3.2.5. However, I don't have a RoCE environment for verification. Can you test this out? @Baibaifan Branch: https://github.com/deepseek-ai/DeepEP/tree/roce-support
I have the same problem on one 8*H100 cluster node for
test_low_latency.py, and update NVSHMEM to version 3.2.5 with the newest patch. However, it doesn't work, it lacks for, with the same logs, i.e.,/work/nvshmem_src_3.1.7-1/nvshmem_src/src/modules/transport/ibrc/ibrc.cpp:417: non-zero status: 110 /work/nvshmem_src_3.1.7-1/nvshmem_src/src/modules/transport/ibrc/ibrc.cpp:417: non-zero status: 110 ibv_modify_qp failed ibv_modify_qp failed. However, I have tried theib_write_bwcmd successfully. Any kind suggestion for this problem?Could you provide the logs from your NVSHMEM 3.2.5 run? It appears you're still using NVSHMEM 3.1.7.
sorry for that old logs. The newest logs are as below, with some my comment logs.
(base) root@ubuntu:/work/DeepEP-main# python tests/test_low_latency.py
local_rank:5, ip:127.0.0.1 port:3004
world_size:8, rank:5
local_rank:4, ip:127.0.0.1 port:3004
world_size:8, rank:4
local_rank:6, ip:127.0.0.1 port:3004
world_size:8, rank:6
local_rank:7, ip:127.0.0.1 port:3004
world_size:8, rank:7
local_rank:0, ip:127.0.0.1 port:3004
world_size:8, rank:0
local_rank:1, ip:127.0.0.1 port:3004
world_size:8, rank:1
local_rank:2, ip:127.0.0.1 port:3004
world_size:8, rank:2
local_rank:3, ip:127.0.0.1 port:3004
world_size:8, rank:3
setting....
setting....
setting....
setting....
setting....
setting....
setting....
setting....
setted...
rank:2, num_ranks:8
before init buffer
buffer.__init__
before cpp.Buffer
setted...
rank:0, num_ranks:8
Allocating buffer size: 2116.2912 MB ...
before init buffer
buffer.__init__
before cpp.Buffer
after cpp.Buffer
setted...
rank:1, num_ranks:8
before init buffer
buffer.__init__
before cpp.Buffer
setted...
rank:5, num_ranks:8
before init buffer
buffer.__init__
before cpp.Buffer
after cpp.Buffer
setted...
rank:4, num_ranks:8
before init buffer
buffer.__init__
before cpp.Buffer
after cpp.Buffer
after cpp.Buffer
setted...
rank:7, num_ranks:8
before init buffer
buffer.__init__
before cpp.Buffer
after cpp.Buffer
setted...
rank:3, num_ranks:8
before init buffer
buffer.__init__
before cpp.Buffer
after cpp.Buffer
after cpp.Buffer
setted...
rank:6, num_ranks:8
before init buffer
buffer.__init__
before cpp.Buffer
after cpp.Buffer
finish all_gather object
finish all_gather object
finish all_gather object
finish all_gather object
finish all_gather object
finish all_gather object
finish all_gather object
finish all_gather object
before runtime.sync
before runtime.sync
before runtime.sync
before runtime.sync
before runtime.sync
before runtime.sync
before runtime.sync
before runtime.sync
/work/nvshmem_src_3.2.5-1/nvshmem_src/src/modules/transport/ibrc/ibrc.cpp:418: non-zero status: 110 ibv_modify_qp failed
/work/nvshmem_src_3.2.5-1/nvshmem_src/src/modules/transport/ibrc/ibrc.cpp:418: non-zero status: 110
ibv_modify_qp failed
/work/nvshmem_src_3.2.5-1/nvshmem_src/src/modules/transport/ibrc/ibrc.cpp:1433: non-zero status: 7 ep_connect failed
/work/nvshmem_src_3.2.5-1/nvshmem_src/src/modules/transport/ibrc/ibrc.cpp:1433: non-zero status: 7 /work/nvshmem_src_3.2.5-1/nvshmem_src/src/modules/transport/ibrc/ibrc.cpp:1500: non-zero status: 7 ep_connect failed
transport create connect failed
/work/nvshmem_src_3.2.5-1/nvshmem_src/src/modules/transport/ibrc/ibrc.cpp:1500: non-zero status: 7 transport create connect failed
/work/nvshmem_src_3.2.5-1/nvshmem_src/src/host/transport/transport.cpp:400: non-zero status: 7 connect EPS failed
/work/nvshmem_src_3.2.5-1/nvshmem_src/src/host/transport/transport.cpp:400: non-zero status: 7
connect EPS failed
/work/nvshmem_src_3.2.5-1/nvshmem_src/src/host/init/init.cu:1007: non-zero status: 7
nvshmem setup connections failed
/work/nvshmem_src_3.2.5-1/nvshmem_src/src/host/init/init.cu:1007: non-zero status: 7
nvshmem setup connections failed
/work/nvshmem_src_3.2.5-1/nvshmem_src/src/modules/transport/ibrc/ibrc.cpp:418: non-zero status: 110 /work/nvshmem_src_3.2.5-1/nvshmem_src/src/modules/transport/ibrc/ibrc.cpp:418: non-zero status: 110 /work/nvshmem_src_3.2.5-1/nvshmem_src/src/modules/transport/ibrc/ibrc.cpp:418: non-zero status: 110 /work/nvshmem_src_3.2.5-1/nvshmem_src/src/modules/transport/ibrc/ibrc.cpp:418: non-zero status: 110 ibv_modify_qp failed
/work/nvshmem_src_3.2.5-1/nvshmem_src/src/modules/transport/ibrc/ibrc.cpp:418: non-zero status: 110 /work/nvshmem_src_3.2.5-1/nvshmem_src/src/modules/transport/ibrc/ibrc.cpp:418: non-zero status: 110 ibv_modify_qp failed
ibv_modify_qp failed
ibv_modify_qp failed
ibv_modify_qp failed
ibv_modify_qp failed
/work/nvshmem_src_3.2.5-1/nvshmem_src/src/modules/transport/ibrc/ibrc.cpp:1433: non-zero status: 7
/work/nvshmem_src_3.2.5-1/nvshmem_src/src/modules/transport/ibrc/ibrc.cpp:1433: non-zero status: 7 /work/nvshmem_src_3.2.5-1/nvshmem_src/src/modules/transport/ibrc/ibrc.cpp:1433: non-zero status: 7 /work/nvshmem_src_3.2.5-1/nvshmem_src/src/modules/transport/ibrc/ibrc.cpp:1433: non-zero status: 7 ep_connect failed
/work/nvshmem_src_3.2.5-1/nvshmem_src/src/modules/transport/ibrc/ibrc.cpp:1433: non-zero status: 7 /work/nvshmem_src_3.2.5-1/nvshmem_src/src/modules/transport/ibrc/ibrc.cpp:1433: non-zero status: 7 ep_connect failed
ep_connect failed
ep_connect failed
ep_connect failed
ep_connect failed
/work/nvshmem_src_3.2.5-1/nvshmem_src/src/modules/transport/ibrc/ibrc.cpp:1500: non-zero status: 7
/work/nvshmem_src_3.2.5-1/nvshmem_src/src/modules/transport/ibrc/ibrc.cpp:1500: non-zero status: 7
/work/nvshmem_src_3.2.5-1/nvshmem_src/src/modules/transport/ibrc/ibrc.cpp:1500: non-zero status: 7 transport create connect failed
/work/nvshmem_src_3.2.5-1/nvshmem_src/src/modules/transport/ibrc/ibrc.cpp:1500: non-zero status: 7 transport create connect failed
/work/nvshmem_src_3.2.5-1/nvshmem_src/src/modules/transport/ibrc/ibrc.cpp:1500: non-zero status: 7 /work/nvshmem_src_3.2.5-1/nvshmem_src/src/modules/transport/ibrc/ibrc.cpp:1500: non-zero status: 7 transport create connect failed
transport create connect failed
transport create connect failed
transport create connect failed
/work/nvshmem_src_3.2.5-1/nvshmem_src/src/host/transport/transport.cpp:400: non-zero status: 7
/work/nvshmem_src_3.2.5-1/nvshmem_src/src/host/transport/transport.cpp:400: non-zero status: 7
/work/nvshmem_src_3.2.5-1/nvshmem_src/src/host/transport/transport.cpp:400: non-zero status: 7 connect EPS failed
connect EPS failed
/work/nvshmem_src_3.2.5-1/nvshmem_src/src/host/transport/transport.cpp:400: non-zero status: 7 connect EPS failed
/work/nvshmem_src_3.2.5-1/nvshmem_src/src/host/transport/transport.cpp:400: non-zero status: 7 /work/nvshmem_src_3.2.5-1/nvshmem_src/src/host/transport/transport.cpp:400: non-zero status: 7 connect EPS failed
connect EPS failed
connect EPS failed
/work/nvshmem_src_3.2.5-1/nvshmem_src/src/host/init/init.cu:1007: non-zero status: 7 /work/nvshmem_src_3.2.5-1/nvshmem_src/src/host/init/init.cu:1007: non-zero status: 7
/work/nvshmem_src_3.2.5-1/nvshmem_src/src/host/init/init.cu:1007: non-zero status: 7
/work/nvshmem_src_3.2.5-1/nvshmem_src/src/host/init/init.cu:1007: non-zero status: 7 nvshmem setup connections failed
nvshmem setup connections failed
/work/nvshmem_src_3.2.5-1/nvshmem_src/src/host/init/init.cu:1007: non-zero status: 7 nvshmem setup connections failed
/work/nvshmem_src_3.2.5-1/nvshmem_src/src/host/init/init.cu:1007: non-zero status: 7 nvshmem setup connections failed
nvshmem setup connections failed
nvshmem setup connections failed
/work/nvshmem_src_3.2.5-1/nvshmem_src/src/modules/transport/ibrc/ibrc.cpp:418: non-zero status: 110 /work/nvshmem_src_3.2.5-1/nvshmem_src/src/modules/transport/ibrc/ibrc.cpp:418: non-zero status: 110 /work/nvshmem_src_3.2.5-1/nvshmem_src/src/modules/transport/ibrc/ibrc.cpp:418: non-zero status: 110 ibv_modify_qp failed
ibv_modify_qp failed
/work/nvshmem_src_3.2.5-1/nvshmem_src/src/modules/transport/ibrc/ibrc.cpp:418: non-zero status: 110 ibv_modify_qp failed
/work/nvshmem_src_3.2.5-1/nvshmem_src/src/modules/transport/ibrc/ibrc.cpp:418: non-zero status: 110 ibv_modify_qp failed
/work/nvshmem_src_3.2.5-1/nvshmem_src/src/modules/transport/ibrc/ibrc.cpp:1433: non-zero status: 7 /work/nvshmem_src_3.2.5-1/nvshmem_src/src/modules/transport/ibrc/ibrc.cpp:1433: non-zero status: 7 /work/nvshmem_src_3.2.5-1/nvshmem_src/src/modules/transport/ibrc/ibrc.cpp:1433: non-zero status: 7 ibv_modify_qp failed
/work/nvshmem_src_3.2.5-1/nvshmem_src/src/modules/transport/ibrc/ibrc.cpp:1433: non-zero status: 7 ep_connect failed
ep_connect failed
ep_connect failed
ep_connect failed
/work/nvshmem_src_3.2.5-1/nvshmem_src/src/modules/transport/ibrc/ibrc.cpp:1500: non-zero status: 7 /work/nvshmem_src_3.2.5-1/nvshmem_src/src/modules/transport/ibrc/ibrc.cpp:1433: non-zero status: 7
/work/nvshmem_src_3.2.5-1/nvshmem_src/src/modules/transport/ibrc/ibrc.cpp:1500: non-zero status: 7 transport create connect failed
/work/nvshmem_src_3.2.5-1/nvshmem_src/src/modules/transport/ibrc/ibrc.cpp:1500: non-zero status: 7 ep_connect failed
/work/nvshmem_src_3.2.5-1/nvshmem_src/src/modules/transport/ibrc/ibrc.cpp:1500: non-zero status: 7 transport create connect failed
transport create connect failed
/work/nvshmem_src_3.2.5-1/nvshmem_src/src/modules/transport/ibrc/ibrc.cpp:418: non-zero status: 110 transport create connect failed
/work/nvshmem_src_3.2.5-1/nvshmem_src/src/modules/transport/ibrc/ibrc.cpp:418: non-zero status: 110 /work/nvshmem_src_3.2.5-1/nvshmem_src/src/modules/transport/ibrc/ibrc.cpp:418: non-zero status: 110
/work/nvshmem_src_3.2.5-1/nvshmem_src/src/host/transport/transport.cpp:400: non-zero status: 7 /work/nvshmem_src_3.2.5-1/nvshmem_src/src/modules/transport/ibrc/ibrc.cpp:1500: non-zero status: 7 ibv_modify_qp failed
/work/nvshmem_src_3.2.5-1/nvshmem_src/src/host/transport/transport.cpp:400: non-zero status: 7 ibv_modify_qp failed
ibv_modify_qp failed
connect EPS failed
/work/nvshmem_src_3.2.5-1/nvshmem_src/src/host/transport/transport.cpp:400: non-zero status: 7 transport create connect failed
connect EPS failed
/work/nvshmem_src_3.2.5-1/nvshmem_src/src/host/transport/transport.cpp:400: non-zero status: 7
connect EPS failed
/work/nvshmem_src_3.2.5-1/nvshmem_src/src/modules/transport/ibrc/ibrc.cpp:1433: non-zero status: 7
/work/nvshmem_src_3.2.5-1/nvshmem_src/src/modules/transport/ibrc/ibrc.cpp:1433: non-zero status: 7 connect EPS failed
/work/nvshmem_src_3.2.5-1/nvshmem_src/src/modules/transport/ibrc/ibrc.cpp:1433: non-zero status: 7 /work/nvshmem_src_3.2.5-1/nvshmem_src/src/host/init/init.cu:1007: non-zero status: 7
ep_connect failed
/work/nvshmem_src_3.2.5-1/nvshmem_src/src/host/init/init.cu:1007: non-zero status: 7 /work/nvshmem_src_3.2.5-1/nvshmem_src/src/host/transport/transport.cpp:400: non-zero status: 7 ep_connect failed
ep_connect failed
nvshmem setup connections failed
/work/nvshmem_src_3.2.5-1/nvshmem_src/src/host/init/init.cu:1007: non-zero status: 7
nvshmem setup connections failed
connect EPS failed
/work/nvshmem_src_3.2.5-1/nvshmem_src/src/host/init/init.cu:1007: non-zero status: 7
nvshmem setup connections failed
/work/nvshmem_src_3.2.5-1/nvshmem_src/src/modules/transport/ibrc/ibrc.cpp:1500: non-zero status: 7
/work/nvshmem_src_3.2.5-1/nvshmem_src/src/modules/transport/ibrc/ibrc.cpp:1500: non-zero status: 7 nvshmem setup connections failed
/work/nvshmem_src_3.2.5-1/nvshmem_src/src/modules/transport/ibrc/ibrc.cpp:1500: non-zero status: 7 /work/nvshmem_src_3.2.5-1/nvshmem_src/src/host/init/init.cu:nvshmemi_check_state_and_init:1080: transport create connect failed
transport create connect failed
/work/nvshmem_src_3.2.5-1/nvshmem_src/src/host/init/init.cu:nvshmemi_check_state_and_init:1080: /work/nvshmem_src_3.2.5-1/nvshmem_src/src/host/init/init.cu:1007: non-zero status: 7 transport create connect failed
nvshmem initialization failed, exiting
/work/nvshmem_src_3.2.5-1/nvshmem_src/src/host/init/init.cu:nvshmemi_check_state_and_init:1080:
nvshmem initialization failed, exiting
nvshmem setup connections failed
/work/nvshmem_src_3.2.5-1/nvshmem_src/src/host/init/init.cu:nvshmemi_check_state_and_init:1080: /work/nvshmem_src_3.2.5-1/nvshmem_src/src/host/transport/transport.cpp:400: non-zero status: 7 nvshmem initialization failed, exiting
/work/nvshmem_src_3.2.5-1/nvshmem_src/src/host/transport/transport.cpp:400: non-zero status: 7
/work/nvshmem_src_3.2.5-1/nvshmem_src/src/host/transport/transport.cpp:400: non-zero status: 7 nvshmem initialization failed, exiting
connect EPS failed
connect EPS failed
/work/nvshmem_src_3.2.5-1/nvshmem_src/src/host/init/init.cu:nvshmemi_check_state_and_init:1080: connect EPS failed
/work/nvshmem_src_3.2.5-1/nvshmem_src/src/host/init/init.cu:1007: non-zero status: 7
nvshmem initialization failed, exiting
nvshmem setup connections failed
/work/nvshmem_src_3.2.5-1/nvshmem_src/src/host/init/init.cu:1007: non-zero status: 7
/work/nvshmem_src_3.2.5-1/nvshmem_src/src/host/init/init.cu:1007: non-zero status: 7
nvshmem setup connections failed
nvshmem setup connections failed
/work/nvshmem_src_3.2.5-1/nvshmem_src/src/host/init/init.cu:nvshmemi_check_state_and_init:1080:
nvshmem initialization failed, exiting
/work/nvshmem_src_3.2.5-1/nvshmem_src/src/host/init/init.cu:nvshmemi_check_state_and_init:1080: /work/nvshmem_src_3.2.5-1/nvshmem_src/src/host/init/init.cu:nvshmemi_check_state_and_init:1080:
nvshmem initialization failed, exiting
nvshmem initialization failed, exiting
W0305 11:55:57.242000 19054 site-packages/torch/multiprocessing/spawn.py:169] Terminating process 19119 via signal SIGTERM
W0305 11:55:57.244000 19054 site-packages/torch/multiprocessing/spawn.py:169] Terminating process 19120 via signal SIGTERM
W0305 11:55:57.245000 19054 site-packages/torch/multiprocessing/spawn.py:169] Terminating process 19121 via signal SIGTERM
W0305 11:55:57.245000 19054 site-packages/torch/multiprocessing/spawn.py:169] Terminating process 19122 via signal SIGTERM
W0305 11:55:57.245000 19054 site-packages/torch/multiprocessing/spawn.py:169] Terminating process 19124 via signal SIGTERM
W0305 11:55:57.246000 19054 site-packages/torch/multiprocessing/spawn.py:169] Terminating process 19125 via signal SIGTERM
W0305 11:55:57.246000 19054 site-packages/torch/multiprocessing/spawn.py:169] Terminating process 19126 via signal SIGTERM
Traceback (most recent call last):
File "/work/DeepEP-main/tests/test_low_latency.py", line 164, in <module>
torch.multiprocessing.spawn(test_loop, args=(num_processes,), nprocs=num_processes)
File "/root/miniconda3/lib/python3.12/site-packages/torch/multiprocessing/spawn.py", line 340, in spawn
return start_processes(fn, args, nprocs, join, daemon, start_method="spawn")
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/miniconda3/lib/python3.12/site-packages/torch/multiprocessing/spawn.py", line 296, in start_processes
while not context.join():
^^^^^^^^^^^^^^
File "/root/miniconda3/lib/python3.12/site-packages/torch/multiprocessing/spawn.py", line 204, in join
raise ProcessExitedException(
torch.multiprocessing.spawn.ProcessExitedException: process 4 terminated with exit code 255
By following the logs, we found the calltrace was test_loop() --> buffer = deep_ep.Buffer() --> __init__ --> self.runtime.sync(device_ids, ipc_handles, root_unique_id) in DeepEP and then nvshmemt_ibrc_connect_endpoints --> nvshmemt_ibrc_ep_connect --> ep_connect --> status = ftable.modify_qp(ep->qp, &attr, flags); in nvshmem_src_3.2.5-1. The status 110 is usually ETIMEDOUT。
I have created a new branch that updates NVSHMEM to version 3.2.5. However, I don't have a RoCE environment for verification. Can you test this out? @Baibaifan Branch: https://github.com/deepseek-ai/DeepEP/tree/roce-support
I have the same problem on one 8*H100 cluster node for
test_low_latency.py, and update NVSHMEM to version 3.2.5 with the newest patch. However, it doesn't work, it lacks for, with the same logs, i.e.,/work/nvshmem_src_3.1.7-1/nvshmem_src/src/modules/transport/ibrc/ibrc.cpp:417: non-zero status: 110 /work/nvshmem_src_3.1.7-1/nvshmem_src/src/modules/transport/ibrc/ibrc.cpp:417: non-zero status: 110 ibv_modify_qp failed ibv_modify_qp failed. However, I have tried theib_write_bwcmd successfully. Any kind suggestion for this problem?Could you provide the logs from your NVSHMEM 3.2.5 run? It appears you're still using NVSHMEM 3.1.7.
sorry for that old logs. The newest logs are as below, with some my comment logs.
(base) root@ubuntu:/work/DeepEP-main# python tests/test_low_latency.py local_rank:5, ip:127.0.0.1 port:3004 world_size:8, rank:5 local_rank:4, ip:127.0.0.1 port:3004 world_size:8, rank:4 local_rank:6, ip:127.0.0.1 port:3004 world_size:8, rank:6 local_rank:7, ip:127.0.0.1 port:3004 world_size:8, rank:7 local_rank:0, ip:127.0.0.1 port:3004 world_size:8, rank:0 local_rank:1, ip:127.0.0.1 port:3004 world_size:8, rank:1 local_rank:2, ip:127.0.0.1 port:3004 world_size:8, rank:2 local_rank:3, ip:127.0.0.1 port:3004 world_size:8, rank:3 setting.... setting.... setting.... setting.... setting.... setting.... setting.... setting.... setted... rank:2, num_ranks:8 before init buffer buffer.__init__ before cpp.Buffer setted... rank:0, num_ranks:8 Allocating buffer size: 2116.2912 MB ... before init buffer buffer.__init__ before cpp.Buffer after cpp.Buffer setted... rank:1, num_ranks:8 before init buffer buffer.__init__ before cpp.Buffer setted... rank:5, num_ranks:8 before init buffer buffer.__init__ before cpp.Buffer after cpp.Buffer setted... rank:4, num_ranks:8 before init buffer buffer.__init__ before cpp.Buffer after cpp.Buffer after cpp.Buffer setted... rank:7, num_ranks:8 before init buffer buffer.__init__ before cpp.Buffer after cpp.Buffer setted... rank:3, num_ranks:8 before init buffer buffer.__init__ before cpp.Buffer after cpp.Buffer after cpp.Buffer setted... rank:6, num_ranks:8 before init buffer buffer.__init__ before cpp.Buffer after cpp.Buffer finish all_gather object finish all_gather object finish all_gather object finish all_gather object finish all_gather object finish all_gather object finish all_gather object finish all_gather object before runtime.sync before runtime.sync before runtime.sync before runtime.sync before runtime.sync before runtime.sync before runtime.sync before runtime.sync /work/nvshmem_src_3.2.5-1/nvshmem_src/src/modules/transport/ibrc/ibrc.cpp:418: non-zero status: 110 ibv_modify_qp failed /work/nvshmem_src_3.2.5-1/nvshmem_src/src/modules/transport/ibrc/ibrc.cpp:418: non-zero status: 110 ibv_modify_qp failed /work/nvshmem_src_3.2.5-1/nvshmem_src/src/modules/transport/ibrc/ibrc.cpp:1433: non-zero status: 7 ep_connect failed /work/nvshmem_src_3.2.5-1/nvshmem_src/src/modules/transport/ibrc/ibrc.cpp:1433: non-zero status: 7 /work/nvshmem_src_3.2.5-1/nvshmem_src/src/modules/transport/ibrc/ibrc.cpp:1500: non-zero status: 7 ep_connect failed transport create connect failed /work/nvshmem_src_3.2.5-1/nvshmem_src/src/modules/transport/ibrc/ibrc.cpp:1500: non-zero status: 7 transport create connect failed /work/nvshmem_src_3.2.5-1/nvshmem_src/src/host/transport/transport.cpp:400: non-zero status: 7 connect EPS failed /work/nvshmem_src_3.2.5-1/nvshmem_src/src/host/transport/transport.cpp:400: non-zero status: 7 connect EPS failed /work/nvshmem_src_3.2.5-1/nvshmem_src/src/host/init/init.cu:1007: non-zero status: 7 nvshmem setup connections failed /work/nvshmem_src_3.2.5-1/nvshmem_src/src/host/init/init.cu:1007: non-zero status: 7 nvshmem setup connections failed /work/nvshmem_src_3.2.5-1/nvshmem_src/src/modules/transport/ibrc/ibrc.cpp:418: non-zero status: 110 /work/nvshmem_src_3.2.5-1/nvshmem_src/src/modules/transport/ibrc/ibrc.cpp:418: non-zero status: 110 /work/nvshmem_src_3.2.5-1/nvshmem_src/src/modules/transport/ibrc/ibrc.cpp:418: non-zero status: 110 /work/nvshmem_src_3.2.5-1/nvshmem_src/src/modules/transport/ibrc/ibrc.cpp:418: non-zero status: 110 ibv_modify_qp failed /work/nvshmem_src_3.2.5-1/nvshmem_src/src/modules/transport/ibrc/ibrc.cpp:418: non-zero status: 110 /work/nvshmem_src_3.2.5-1/nvshmem_src/src/modules/transport/ibrc/ibrc.cpp:418: non-zero status: 110 ibv_modify_qp failed ibv_modify_qp failed ibv_modify_qp failed ibv_modify_qp failed ibv_modify_qp failed /work/nvshmem_src_3.2.5-1/nvshmem_src/src/modules/transport/ibrc/ibrc.cpp:1433: non-zero status: 7 /work/nvshmem_src_3.2.5-1/nvshmem_src/src/modules/transport/ibrc/ibrc.cpp:1433: non-zero status: 7 /work/nvshmem_src_3.2.5-1/nvshmem_src/src/modules/transport/ibrc/ibrc.cpp:1433: non-zero status: 7 /work/nvshmem_src_3.2.5-1/nvshmem_src/src/modules/transport/ibrc/ibrc.cpp:1433: non-zero status: 7 ep_connect failed /work/nvshmem_src_3.2.5-1/nvshmem_src/src/modules/transport/ibrc/ibrc.cpp:1433: non-zero status: 7 /work/nvshmem_src_3.2.5-1/nvshmem_src/src/modules/transport/ibrc/ibrc.cpp:1433: non-zero status: 7 ep_connect failed ep_connect failed ep_connect failed ep_connect failed ep_connect failed /work/nvshmem_src_3.2.5-1/nvshmem_src/src/modules/transport/ibrc/ibrc.cpp:1500: non-zero status: 7 /work/nvshmem_src_3.2.5-1/nvshmem_src/src/modules/transport/ibrc/ibrc.cpp:1500: non-zero status: 7 /work/nvshmem_src_3.2.5-1/nvshmem_src/src/modules/transport/ibrc/ibrc.cpp:1500: non-zero status: 7 transport create connect failed /work/nvshmem_src_3.2.5-1/nvshmem_src/src/modules/transport/ibrc/ibrc.cpp:1500: non-zero status: 7 transport create connect failed /work/nvshmem_src_3.2.5-1/nvshmem_src/src/modules/transport/ibrc/ibrc.cpp:1500: non-zero status: 7 /work/nvshmem_src_3.2.5-1/nvshmem_src/src/modules/transport/ibrc/ibrc.cpp:1500: non-zero status: 7 transport create connect failed transport create connect failed transport create connect failed transport create connect failed /work/nvshmem_src_3.2.5-1/nvshmem_src/src/host/transport/transport.cpp:400: non-zero status: 7 /work/nvshmem_src_3.2.5-1/nvshmem_src/src/host/transport/transport.cpp:400: non-zero status: 7 /work/nvshmem_src_3.2.5-1/nvshmem_src/src/host/transport/transport.cpp:400: non-zero status: 7 connect EPS failed connect EPS failed /work/nvshmem_src_3.2.5-1/nvshmem_src/src/host/transport/transport.cpp:400: non-zero status: 7 connect EPS failed /work/nvshmem_src_3.2.5-1/nvshmem_src/src/host/transport/transport.cpp:400: non-zero status: 7 /work/nvshmem_src_3.2.5-1/nvshmem_src/src/host/transport/transport.cpp:400: non-zero status: 7 connect EPS failed connect EPS failed connect EPS failed /work/nvshmem_src_3.2.5-1/nvshmem_src/src/host/init/init.cu:1007: non-zero status: 7 /work/nvshmem_src_3.2.5-1/nvshmem_src/src/host/init/init.cu:1007: non-zero status: 7 /work/nvshmem_src_3.2.5-1/nvshmem_src/src/host/init/init.cu:1007: non-zero status: 7 /work/nvshmem_src_3.2.5-1/nvshmem_src/src/host/init/init.cu:1007: non-zero status: 7 nvshmem setup connections failed nvshmem setup connections failed /work/nvshmem_src_3.2.5-1/nvshmem_src/src/host/init/init.cu:1007: non-zero status: 7 nvshmem setup connections failed /work/nvshmem_src_3.2.5-1/nvshmem_src/src/host/init/init.cu:1007: non-zero status: 7 nvshmem setup connections failed nvshmem setup connections failed nvshmem setup connections failed /work/nvshmem_src_3.2.5-1/nvshmem_src/src/modules/transport/ibrc/ibrc.cpp:418: non-zero status: 110 /work/nvshmem_src_3.2.5-1/nvshmem_src/src/modules/transport/ibrc/ibrc.cpp:418: non-zero status: 110 /work/nvshmem_src_3.2.5-1/nvshmem_src/src/modules/transport/ibrc/ibrc.cpp:418: non-zero status: 110 ibv_modify_qp failed ibv_modify_qp failed /work/nvshmem_src_3.2.5-1/nvshmem_src/src/modules/transport/ibrc/ibrc.cpp:418: non-zero status: 110 ibv_modify_qp failed /work/nvshmem_src_3.2.5-1/nvshmem_src/src/modules/transport/ibrc/ibrc.cpp:418: non-zero status: 110 ibv_modify_qp failed /work/nvshmem_src_3.2.5-1/nvshmem_src/src/modules/transport/ibrc/ibrc.cpp:1433: non-zero status: 7 /work/nvshmem_src_3.2.5-1/nvshmem_src/src/modules/transport/ibrc/ibrc.cpp:1433: non-zero status: 7 /work/nvshmem_src_3.2.5-1/nvshmem_src/src/modules/transport/ibrc/ibrc.cpp:1433: non-zero status: 7 ibv_modify_qp failed /work/nvshmem_src_3.2.5-1/nvshmem_src/src/modules/transport/ibrc/ibrc.cpp:1433: non-zero status: 7 ep_connect failed ep_connect failed ep_connect failed ep_connect failed /work/nvshmem_src_3.2.5-1/nvshmem_src/src/modules/transport/ibrc/ibrc.cpp:1500: non-zero status: 7 /work/nvshmem_src_3.2.5-1/nvshmem_src/src/modules/transport/ibrc/ibrc.cpp:1433: non-zero status: 7 /work/nvshmem_src_3.2.5-1/nvshmem_src/src/modules/transport/ibrc/ibrc.cpp:1500: non-zero status: 7 transport create connect failed /work/nvshmem_src_3.2.5-1/nvshmem_src/src/modules/transport/ibrc/ibrc.cpp:1500: non-zero status: 7 ep_connect failed /work/nvshmem_src_3.2.5-1/nvshmem_src/src/modules/transport/ibrc/ibrc.cpp:1500: non-zero status: 7 transport create connect failed transport create connect failed /work/nvshmem_src_3.2.5-1/nvshmem_src/src/modules/transport/ibrc/ibrc.cpp:418: non-zero status: 110 transport create connect failed /work/nvshmem_src_3.2.5-1/nvshmem_src/src/modules/transport/ibrc/ibrc.cpp:418: non-zero status: 110 /work/nvshmem_src_3.2.5-1/nvshmem_src/src/modules/transport/ibrc/ibrc.cpp:418: non-zero status: 110 /work/nvshmem_src_3.2.5-1/nvshmem_src/src/host/transport/transport.cpp:400: non-zero status: 7 /work/nvshmem_src_3.2.5-1/nvshmem_src/src/modules/transport/ibrc/ibrc.cpp:1500: non-zero status: 7 ibv_modify_qp failed /work/nvshmem_src_3.2.5-1/nvshmem_src/src/host/transport/transport.cpp:400: non-zero status: 7 ibv_modify_qp failed ibv_modify_qp failed connect EPS failed /work/nvshmem_src_3.2.5-1/nvshmem_src/src/host/transport/transport.cpp:400: non-zero status: 7 transport create connect failed connect EPS failed /work/nvshmem_src_3.2.5-1/nvshmem_src/src/host/transport/transport.cpp:400: non-zero status: 7 connect EPS failed /work/nvshmem_src_3.2.5-1/nvshmem_src/src/modules/transport/ibrc/ibrc.cpp:1433: non-zero status: 7 /work/nvshmem_src_3.2.5-1/nvshmem_src/src/modules/transport/ibrc/ibrc.cpp:1433: non-zero status: 7 connect EPS failed /work/nvshmem_src_3.2.5-1/nvshmem_src/src/modules/transport/ibrc/ibrc.cpp:1433: non-zero status: 7 /work/nvshmem_src_3.2.5-1/nvshmem_src/src/host/init/init.cu:1007: non-zero status: 7 ep_connect failed /work/nvshmem_src_3.2.5-1/nvshmem_src/src/host/init/init.cu:1007: non-zero status: 7 /work/nvshmem_src_3.2.5-1/nvshmem_src/src/host/transport/transport.cpp:400: non-zero status: 7 ep_connect failed ep_connect failed nvshmem setup connections failed /work/nvshmem_src_3.2.5-1/nvshmem_src/src/host/init/init.cu:1007: non-zero status: 7 nvshmem setup connections failed connect EPS failed /work/nvshmem_src_3.2.5-1/nvshmem_src/src/host/init/init.cu:1007: non-zero status: 7 nvshmem setup connections failed /work/nvshmem_src_3.2.5-1/nvshmem_src/src/modules/transport/ibrc/ibrc.cpp:1500: non-zero status: 7 /work/nvshmem_src_3.2.5-1/nvshmem_src/src/modules/transport/ibrc/ibrc.cpp:1500: non-zero status: 7 nvshmem setup connections failed /work/nvshmem_src_3.2.5-1/nvshmem_src/src/modules/transport/ibrc/ibrc.cpp:1500: non-zero status: 7 /work/nvshmem_src_3.2.5-1/nvshmem_src/src/host/init/init.cu:nvshmemi_check_state_and_init:1080: transport create connect failed transport create connect failed /work/nvshmem_src_3.2.5-1/nvshmem_src/src/host/init/init.cu:nvshmemi_check_state_and_init:1080: /work/nvshmem_src_3.2.5-1/nvshmem_src/src/host/init/init.cu:1007: non-zero status: 7 transport create connect failed nvshmem initialization failed, exiting /work/nvshmem_src_3.2.5-1/nvshmem_src/src/host/init/init.cu:nvshmemi_check_state_and_init:1080: nvshmem initialization failed, exiting nvshmem setup connections failed /work/nvshmem_src_3.2.5-1/nvshmem_src/src/host/init/init.cu:nvshmemi_check_state_and_init:1080: /work/nvshmem_src_3.2.5-1/nvshmem_src/src/host/transport/transport.cpp:400: non-zero status: 7 nvshmem initialization failed, exiting /work/nvshmem_src_3.2.5-1/nvshmem_src/src/host/transport/transport.cpp:400: non-zero status: 7 /work/nvshmem_src_3.2.5-1/nvshmem_src/src/host/transport/transport.cpp:400: non-zero status: 7 nvshmem initialization failed, exiting connect EPS failed connect EPS failed /work/nvshmem_src_3.2.5-1/nvshmem_src/src/host/init/init.cu:nvshmemi_check_state_and_init:1080: connect EPS failed /work/nvshmem_src_3.2.5-1/nvshmem_src/src/host/init/init.cu:1007: non-zero status: 7 nvshmem initialization failed, exiting nvshmem setup connections failed /work/nvshmem_src_3.2.5-1/nvshmem_src/src/host/init/init.cu:1007: non-zero status: 7 /work/nvshmem_src_3.2.5-1/nvshmem_src/src/host/init/init.cu:1007: non-zero status: 7 nvshmem setup connections failed nvshmem setup connections failed /work/nvshmem_src_3.2.5-1/nvshmem_src/src/host/init/init.cu:nvshmemi_check_state_and_init:1080: nvshmem initialization failed, exiting /work/nvshmem_src_3.2.5-1/nvshmem_src/src/host/init/init.cu:nvshmemi_check_state_and_init:1080: /work/nvshmem_src_3.2.5-1/nvshmem_src/src/host/init/init.cu:nvshmemi_check_state_and_init:1080: nvshmem initialization failed, exiting nvshmem initialization failed, exiting W0305 11:55:57.242000 19054 site-packages/torch/multiprocessing/spawn.py:169] Terminating process 19119 via signal SIGTERM W0305 11:55:57.244000 19054 site-packages/torch/multiprocessing/spawn.py:169] Terminating process 19120 via signal SIGTERM W0305 11:55:57.245000 19054 site-packages/torch/multiprocessing/spawn.py:169] Terminating process 19121 via signal SIGTERM W0305 11:55:57.245000 19054 site-packages/torch/multiprocessing/spawn.py:169] Terminating process 19122 via signal SIGTERM W0305 11:55:57.245000 19054 site-packages/torch/multiprocessing/spawn.py:169] Terminating process 19124 via signal SIGTERM W0305 11:55:57.246000 19054 site-packages/torch/multiprocessing/spawn.py:169] Terminating process 19125 via signal SIGTERM W0305 11:55:57.246000 19054 site-packages/torch/multiprocessing/spawn.py:169] Terminating process 19126 via signal SIGTERM Traceback (most recent call last): File "/work/DeepEP-main/tests/test_low_latency.py", line 164, in <module> torch.multiprocessing.spawn(test_loop, args=(num_processes,), nprocs=num_processes) File "/root/miniconda3/lib/python3.12/site-packages/torch/multiprocessing/spawn.py", line 340, in spawn return start_processes(fn, args, nprocs, join, daemon, start_method="spawn") ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/root/miniconda3/lib/python3.12/site-packages/torch/multiprocessing/spawn.py", line 296, in start_processes while not context.join(): ^^^^^^^^^^^^^^ File "/root/miniconda3/lib/python3.12/site-packages/torch/multiprocessing/spawn.py", line 204, in join raise ProcessExitedException( torch.multiprocessing.spawn.ProcessExitedException: process 4 terminated with exit code 255By following the logs, we found the calltrace was
test_loop() --> buffer = deep_ep.Buffer() --> __init__ --> self.runtime.sync(device_ids, ipc_handles, root_unique_id)in DeepEP and thennvshmemt_ibrc_connect_endpoints --> nvshmemt_ibrc_ep_connect --> ep_connect --> status = ftable.modify_qp(ep->qp, &attr, flags);in nvshmem_src_3.2.5-1. The status110is usuallyETIMEDOUT。
I also encountered this problem, but tests/test_internode.py can be run, but the performance is worse. Is it related to the fact that I have 4 NICs per machine? @sphish
Hi @sphish, The process works, but the performance does not seem to meet expectations.
env:
- H100 80GB HBM3 *8/HPC
- 4 * CX7 NICs 400gb/s/HPC
- RoCE
rank0
NCCL_DEBUG=INFO MASTER_ADDR=xxx WORLD_SIZE=2 RANK=0 python tests/test_internode.pyrank1
NCCL_DEBUG=INFO MASTER_ADDR=xxx WORLD_SIZE=2 RANK=1 python tests/test_internode.py[tuning] Best dispatch (FP8): SMs 24, NVL chunk 28, RDMA chunk 8: 4.09 GB/s (RDMA), 13.36 GB/s (NVL) [tuning] Best dispatch (BF16): SMs 24, NVL chunk 28, RDMA chunk 32: 4.57 GB/s (RDMA), 14.91 GB/s (NVL) [tuning] Best combine: SMs 24, NVL chunk 3, RDMA chunk 32: 2.97 GB/s (RDMA), 9.69 GB/s (NVL)The difference between the above speed and the official data is greater than the difference between the network cards. How can I test the performance with
perftest? Is there any reference?
Hi, @Baibaifan, are MASTER_ADDRs of rank0/1 same? Are they IP of CX7 cards or host IP?
Hi @sphish, The process works, but the performance does not seem to meet expectations. env:
- H100 80GB HBM3 *8/HPC
- 4 * CX7 NICs 400gb/s/HPC
- RoCE
rank0
NCCL_DEBUG=INFO MASTER_ADDR=xxx WORLD_SIZE=2 RANK=0 python tests/test_internode.pyrank1NCCL_DEBUG=INFO MASTER_ADDR=xxx WORLD_SIZE=2 RANK=1 python tests/test_internode.py[tuning] Best dispatch (FP8): SMs 24, NVL chunk 28, RDMA chunk 8: 4.09 GB/s (RDMA), 13.36 GB/s (NVL) [tuning] Best dispatch (BF16): SMs 24, NVL chunk 28, RDMA chunk 32: 4.57 GB/s (RDMA), 14.91 GB/s (NVL) [tuning] Best combine: SMs 24, NVL chunk 3, RDMA chunk 32: 2.97 GB/s (RDMA), 9.69 GB/s (NVL)The difference between the above speed and the official data is greater than the difference between the network cards. How can I test the performance with
perftest? Is there any reference?Hi, @Baibaifan, are MASTER_ADDRs of rank0/1 same? Are they IP of CX7 cards or host IP?
MASTER_ADDRs are the same, both are host ip. It is to build Bootstrap use.
I have created a new branch that updates NVSHMEM to version 3.2.5. However, I don't have a RoCE environment for verification. Can you test this out? @Baibaifan Branch: https://github.com/deepseek-ai/DeepEP/tree/roce-support我已经创建了一个新分支,更新了 NVSHMEM 到版本 3.2.5。然而,我没有 RoCE 环境进行验证。你能测试一下吗?@Baibaifan 分支:https://github.com/deepseek-ai/DeepEP/tree/roce-support
I have the same problem on one 8*H100 cluster node for
test_low_latency.py, and update NVSHMEM to version 3.2.5 with the newest patch. However, it doesn't work, it lacks for, with the same logs, i.e.,/work/nvshmem_src_3.1.7-1/nvshmem_src/src/modules/transport/ibrc/ibrc.cpp:417: non-zero status: 110 /work/nvshmem_src_3.1.7-1/nvshmem_src/src/modules/transport/ibrc/ibrc.cpp:417: non-zero status: 110 ibv_modify_qp failed ibv_modify_qp failed. However, I have tried theib_write_bwcmd successfully. Any kind suggestion for this problem?我在一个 8 台 H100 集群节点(test_low_latency.py)上遇到了相同的问题,并将 NVSHMEM 更新到版本 3.2.5 并应用了最新的补丁。然而这并没有解决问题,日志依然相同,即/work/nvshmem_src_3.1.7-1/nvshmem_src/src/modules/transport/ibrc/ibrc.cpp:417: non-zero status: 110 /work/nvshmem_src_3.1.7-1/nvshmem_src/src/modules/transport/ibrc/ibrc.cpp:417: non-zero status: 110 ibv_modify_qp failed ibv_modify_qp failed。不过,ib_write_bw命令我已经成功尝试过了。对于这个问题,有什么建议吗?Could you provide the logs from your NVSHMEM 3.2.5 run? It appears you're still using NVSHMEM 3.1.7.您能提供 NVSHMEM 3.2.5 运行的日志吗?看来您仍在使用 NVSHMEM 3.1.7。
sorry for that old logs. The newest logs are as below, with some my comment logs.抱歉,旧的日志如下,以下是最新的日志,并附有一些我的注释。
(base) root@ubuntu:/work/DeepEP-main# python tests/test_low_latency.py local_rank:5, ip:127.0.0.1 port:3004 world_size:8, rank:5 local_rank:4, ip:127.0.0.1 port:3004 world_size:8, rank:4 local_rank:6, ip:127.0.0.1 port:3004 world_size:8, rank:6 local_rank:7, ip:127.0.0.1 port:3004 world_size:8, rank:7 local_rank:0, ip:127.0.0.1 port:3004 world_size:8, rank:0 local_rank:1, ip:127.0.0.1 port:3004 world_size:8, rank:1 local_rank:2, ip:127.0.0.1 port:3004 world_size:8, rank:2 local_rank:3, ip:127.0.0.1 port:3004 world_size:8, rank:3 setting.... setting.... setting.... setting.... setting.... setting.... setting.... setting.... setted... rank:2, num_ranks:8 before init buffer buffer.__init__ before cpp.Buffer setted... rank:0, num_ranks:8 Allocating buffer size: 2116.2912 MB ... before init buffer buffer.__init__ before cpp.Buffer after cpp.Buffer setted... rank:1, num_ranks:8 before init buffer buffer.__init__ before cpp.Buffer setted... rank:5, num_ranks:8 before init buffer buffer.__init__ before cpp.Buffer after cpp.Buffer setted... rank:4, num_ranks:8 before init buffer buffer.__init__ before cpp.Buffer after cpp.Buffer after cpp.Buffer setted... rank:7, num_ranks:8 before init buffer buffer.__init__ before cpp.Buffer after cpp.Buffer setted... rank:3, num_ranks:8 before init buffer buffer.__init__ before cpp.Buffer after cpp.Buffer after cpp.Buffer setted... rank:6, num_ranks:8 before init buffer buffer.__init__ before cpp.Buffer after cpp.Buffer finish all_gather object finish all_gather object finish all_gather object finish all_gather object finish all_gather object finish all_gather object finish all_gather object finish all_gather object before runtime.sync before runtime.sync before runtime.sync before runtime.sync before runtime.sync before runtime.sync before runtime.sync before runtime.sync /work/nvshmem_src_3.2.5-1/nvshmem_src/src/modules/transport/ibrc/ibrc.cpp:418: non-zero status: 110 ibv_modify_qp failed /work/nvshmem_src_3.2.5-1/nvshmem_src/src/modules/transport/ibrc/ibrc.cpp:418: non-zero status: 110 ibv_modify_qp failed /work/nvshmem_src_3.2.5-1/nvshmem_src/src/modules/transport/ibrc/ibrc.cpp:1433: non-zero status: 7 ep_connect failed /work/nvshmem_src_3.2.5-1/nvshmem_src/src/modules/transport/ibrc/ibrc.cpp:1433: non-zero status: 7 /work/nvshmem_src_3.2.5-1/nvshmem_src/src/modules/transport/ibrc/ibrc.cpp:1500: non-zero status: 7 ep_connect failed transport create connect failed /work/nvshmem_src_3.2.5-1/nvshmem_src/src/modules/transport/ibrc/ibrc.cpp:1500: non-zero status: 7 transport create connect failed /work/nvshmem_src_3.2.5-1/nvshmem_src/src/host/transport/transport.cpp:400: non-zero status: 7 connect EPS failed /work/nvshmem_src_3.2.5-1/nvshmem_src/src/host/transport/transport.cpp:400: non-zero status: 7 connect EPS failed /work/nvshmem_src_3.2.5-1/nvshmem_src/src/host/init/init.cu:1007: non-zero status: 7 nvshmem setup connections failed /work/nvshmem_src_3.2.5-1/nvshmem_src/src/host/init/init.cu:1007: non-zero status: 7 nvshmem setup connections failed /work/nvshmem_src_3.2.5-1/nvshmem_src/src/modules/transport/ibrc/ibrc.cpp:418: non-zero status: 110 /work/nvshmem_src_3.2.5-1/nvshmem_src/src/modules/transport/ibrc/ibrc.cpp:418: non-zero status: 110 /work/nvshmem_src_3.2.5-1/nvshmem_src/src/modules/transport/ibrc/ibrc.cpp:418: non-zero status: 110 /work/nvshmem_src_3.2.5-1/nvshmem_src/src/modules/transport/ibrc/ibrc.cpp:418: non-zero status: 110 ibv_modify_qp failed /work/nvshmem_src_3.2.5-1/nvshmem_src/src/modules/transport/ibrc/ibrc.cpp:418: non-zero status: 110 /work/nvshmem_src_3.2.5-1/nvshmem_src/src/modules/transport/ibrc/ibrc.cpp:418: non-zero status: 110 ibv_modify_qp failed ibv_modify_qp failed ibv_modify_qp failed ibv_modify_qp failed ibv_modify_qp failed /work/nvshmem_src_3.2.5-1/nvshmem_src/src/modules/transport/ibrc/ibrc.cpp:1433: non-zero status: 7 /work/nvshmem_src_3.2.5-1/nvshmem_src/src/modules/transport/ibrc/ibrc.cpp:1433: non-zero status: 7 /work/nvshmem_src_3.2.5-1/nvshmem_src/src/modules/transport/ibrc/ibrc.cpp:1433: non-zero status: 7 /work/nvshmem_src_3.2.5-1/nvshmem_src/src/modules/transport/ibrc/ibrc.cpp:1433: non-zero status: 7 ep_connect failed /work/nvshmem_src_3.2.5-1/nvshmem_src/src/modules/transport/ibrc/ibrc.cpp:1433: non-zero status: 7 /work/nvshmem_src_3.2.5-1/nvshmem_src/src/modules/transport/ibrc/ibrc.cpp:1433: non-zero status: 7 ep_connect failed ep_connect failed ep_connect failed ep_connect failed ep_connect failed /work/nvshmem_src_3.2.5-1/nvshmem_src/src/modules/transport/ibrc/ibrc.cpp:1500: non-zero status: 7 /work/nvshmem_src_3.2.5-1/nvshmem_src/src/modules/transport/ibrc/ibrc.cpp:1500: non-zero status: 7 /work/nvshmem_src_3.2.5-1/nvshmem_src/src/modules/transport/ibrc/ibrc.cpp:1500: non-zero status: 7 transport create connect failed /work/nvshmem_src_3.2.5-1/nvshmem_src/src/modules/transport/ibrc/ibrc.cpp:1500: non-zero status: 7 transport create connect failed /work/nvshmem_src_3.2.5-1/nvshmem_src/src/modules/transport/ibrc/ibrc.cpp:1500: non-zero status: 7 /work/nvshmem_src_3.2.5-1/nvshmem_src/src/modules/transport/ibrc/ibrc.cpp:1500: non-zero status: 7 transport create connect failed transport create connect failed transport create connect failed transport create connect failed /work/nvshmem_src_3.2.5-1/nvshmem_src/src/host/transport/transport.cpp:400: non-zero status: 7 /work/nvshmem_src_3.2.5-1/nvshmem_src/src/host/transport/transport.cpp:400: non-zero status: 7 /work/nvshmem_src_3.2.5-1/nvshmem_src/src/host/transport/transport.cpp:400: non-zero status: 7 connect EPS failed connect EPS failed /work/nvshmem_src_3.2.5-1/nvshmem_src/src/host/transport/transport.cpp:400: non-zero status: 7 connect EPS failed /work/nvshmem_src_3.2.5-1/nvshmem_src/src/host/transport/transport.cpp:400: non-zero status: 7 /work/nvshmem_src_3.2.5-1/nvshmem_src/src/host/transport/transport.cpp:400: non-zero status: 7 connect EPS failed connect EPS failed connect EPS failed /work/nvshmem_src_3.2.5-1/nvshmem_src/src/host/init/init.cu:1007: non-zero status: 7 /work/nvshmem_src_3.2.5-1/nvshmem_src/src/host/init/init.cu:1007: non-zero status: 7 /work/nvshmem_src_3.2.5-1/nvshmem_src/src/host/init/init.cu:1007: non-zero status: 7 /work/nvshmem_src_3.2.5-1/nvshmem_src/src/host/init/init.cu:1007: non-zero status: 7 nvshmem setup connections failed nvshmem setup connections failed /work/nvshmem_src_3.2.5-1/nvshmem_src/src/host/init/init.cu:1007: non-zero status: 7 nvshmem setup connections failed /work/nvshmem_src_3.2.5-1/nvshmem_src/src/host/init/init.cu:1007: non-zero status: 7 nvshmem setup connections failed nvshmem setup connections failed nvshmem setup connections failed /work/nvshmem_src_3.2.5-1/nvshmem_src/src/modules/transport/ibrc/ibrc.cpp:418: non-zero status: 110 /work/nvshmem_src_3.2.5-1/nvshmem_src/src/modules/transport/ibrc/ibrc.cpp:418: non-zero status: 110 /work/nvshmem_src_3.2.5-1/nvshmem_src/src/modules/transport/ibrc/ibrc.cpp:418: non-zero status: 110 ibv_modify_qp failed ibv_modify_qp failed /work/nvshmem_src_3.2.5-1/nvshmem_src/src/modules/transport/ibrc/ibrc.cpp:418: non-zero status: 110 ibv_modify_qp failed /work/nvshmem_src_3.2.5-1/nvshmem_src/src/modules/transport/ibrc/ibrc.cpp:418: non-zero status: 110 ibv_modify_qp failed /work/nvshmem_src_3.2.5-1/nvshmem_src/src/modules/transport/ibrc/ibrc.cpp:1433: non-zero status: 7 /work/nvshmem_src_3.2.5-1/nvshmem_src/src/modules/transport/ibrc/ibrc.cpp:1433: non-zero status: 7 /work/nvshmem_src_3.2.5-1/nvshmem_src/src/modules/transport/ibrc/ibrc.cpp:1433: non-zero status: 7 ibv_modify_qp failed /work/nvshmem_src_3.2.5-1/nvshmem_src/src/modules/transport/ibrc/ibrc.cpp:1433: non-zero status: 7 ep_connect failed ep_connect failed ep_connect failed ep_connect failed /work/nvshmem_src_3.2.5-1/nvshmem_src/src/modules/transport/ibrc/ibrc.cpp:1500: non-zero status: 7 /work/nvshmem_src_3.2.5-1/nvshmem_src/src/modules/transport/ibrc/ibrc.cpp:1433: non-zero status: 7 /work/nvshmem_src_3.2.5-1/nvshmem_src/src/modules/transport/ibrc/ibrc.cpp:1500: non-zero status: 7 transport create connect failed /work/nvshmem_src_3.2.5-1/nvshmem_src/src/modules/transport/ibrc/ibrc.cpp:1500: non-zero status: 7 ep_connect failed /work/nvshmem_src_3.2.5-1/nvshmem_src/src/modules/transport/ibrc/ibrc.cpp:1500: non-zero status: 7 transport create connect failed transport create connect failed /work/nvshmem_src_3.2.5-1/nvshmem_src/src/modules/transport/ibrc/ibrc.cpp:418: non-zero status: 110 transport create connect failed /work/nvshmem_src_3.2.5-1/nvshmem_src/src/modules/transport/ibrc/ibrc.cpp:418: non-zero status: 110 /work/nvshmem_src_3.2.5-1/nvshmem_src/src/modules/transport/ibrc/ibrc.cpp:418: non-zero status: 110 /work/nvshmem_src_3.2.5-1/nvshmem_src/src/host/transport/transport.cpp:400: non-zero status: 7 /work/nvshmem_src_3.2.5-1/nvshmem_src/src/modules/transport/ibrc/ibrc.cpp:1500: non-zero status: 7 ibv_modify_qp failed /work/nvshmem_src_3.2.5-1/nvshmem_src/src/host/transport/transport.cpp:400: non-zero status: 7 ibv_modify_qp failed ibv_modify_qp failed connect EPS failed /work/nvshmem_src_3.2.5-1/nvshmem_src/src/host/transport/transport.cpp:400: non-zero status: 7 transport create connect failed connect EPS failed /work/nvshmem_src_3.2.5-1/nvshmem_src/src/host/transport/transport.cpp:400: non-zero status: 7 connect EPS failed /work/nvshmem_src_3.2.5-1/nvshmem_src/src/modules/transport/ibrc/ibrc.cpp:1433: non-zero status: 7 /work/nvshmem_src_3.2.5-1/nvshmem_src/src/modules/transport/ibrc/ibrc.cpp:1433: non-zero status: 7 connect EPS failed /work/nvshmem_src_3.2.5-1/nvshmem_src/src/modules/transport/ibrc/ibrc.cpp:1433: non-zero status: 7 /work/nvshmem_src_3.2.5-1/nvshmem_src/src/host/init/init.cu:1007: non-zero status: 7 ep_connect failed /work/nvshmem_src_3.2.5-1/nvshmem_src/src/host/init/init.cu:1007: non-zero status: 7 /work/nvshmem_src_3.2.5-1/nvshmem_src/src/host/transport/transport.cpp:400: non-zero status: 7 ep_connect failed ep_connect failed nvshmem setup connections failed /work/nvshmem_src_3.2.5-1/nvshmem_src/src/host/init/init.cu:1007: non-zero status: 7 nvshmem setup connections failed connect EPS failed /work/nvshmem_src_3.2.5-1/nvshmem_src/src/host/init/init.cu:1007: non-zero status: 7 nvshmem setup connections failed /work/nvshmem_src_3.2.5-1/nvshmem_src/src/modules/transport/ibrc/ibrc.cpp:1500: non-zero status: 7 /work/nvshmem_src_3.2.5-1/nvshmem_src/src/modules/transport/ibrc/ibrc.cpp:1500: non-zero status: 7 nvshmem setup connections failed /work/nvshmem_src_3.2.5-1/nvshmem_src/src/modules/transport/ibrc/ibrc.cpp:1500: non-zero status: 7 /work/nvshmem_src_3.2.5-1/nvshmem_src/src/host/init/init.cu:nvshmemi_check_state_and_init:1080: transport create connect failed transport create connect failed /work/nvshmem_src_3.2.5-1/nvshmem_src/src/host/init/init.cu:nvshmemi_check_state_and_init:1080: /work/nvshmem_src_3.2.5-1/nvshmem_src/src/host/init/init.cu:1007: non-zero status: 7 transport create connect failed nvshmem initialization failed, exiting /work/nvshmem_src_3.2.5-1/nvshmem_src/src/host/init/init.cu:nvshmemi_check_state_and_init:1080: nvshmem initialization failed, exiting nvshmem setup connections failed /work/nvshmem_src_3.2.5-1/nvshmem_src/src/host/init/init.cu:nvshmemi_check_state_and_init:1080: /work/nvshmem_src_3.2.5-1/nvshmem_src/src/host/transport/transport.cpp:400: non-zero status: 7 nvshmem initialization failed, exiting /work/nvshmem_src_3.2.5-1/nvshmem_src/src/host/transport/transport.cpp:400: non-zero status: 7 /work/nvshmem_src_3.2.5-1/nvshmem_src/src/host/transport/transport.cpp:400: non-zero status: 7 nvshmem initialization failed, exiting connect EPS failed connect EPS failed /work/nvshmem_src_3.2.5-1/nvshmem_src/src/host/init/init.cu:nvshmemi_check_state_and_init:1080: connect EPS failed /work/nvshmem_src_3.2.5-1/nvshmem_src/src/host/init/init.cu:1007: non-zero status: 7 nvshmem initialization failed, exiting nvshmem setup connections failed /work/nvshmem_src_3.2.5-1/nvshmem_src/src/host/init/init.cu:1007: non-zero status: 7 /work/nvshmem_src_3.2.5-1/nvshmem_src/src/host/init/init.cu:1007: non-zero status: 7 nvshmem setup connections failed nvshmem setup connections failed /work/nvshmem_src_3.2.5-1/nvshmem_src/src/host/init/init.cu:nvshmemi_check_state_and_init:1080: nvshmem initialization failed, exiting /work/nvshmem_src_3.2.5-1/nvshmem_src/src/host/init/init.cu:nvshmemi_check_state_and_init:1080: /work/nvshmem_src_3.2.5-1/nvshmem_src/src/host/init/init.cu:nvshmemi_check_state_and_init:1080: nvshmem initialization failed, exiting nvshmem initialization failed, exiting W0305 11:55:57.242000 19054 site-packages/torch/multiprocessing/spawn.py:169] Terminating process 19119 via signal SIGTERM W0305 11:55:57.244000 19054 site-packages/torch/multiprocessing/spawn.py:169] Terminating process 19120 via signal SIGTERM W0305 11:55:57.245000 19054 site-packages/torch/multiprocessing/spawn.py:169] Terminating process 19121 via signal SIGTERM W0305 11:55:57.245000 19054 site-packages/torch/multiprocessing/spawn.py:169] Terminating process 19122 via signal SIGTERM W0305 11:55:57.245000 19054 site-packages/torch/multiprocessing/spawn.py:169] Terminating process 19124 via signal SIGTERM W0305 11:55:57.246000 19054 site-packages/torch/multiprocessing/spawn.py:169] Terminating process 19125 via signal SIGTERM W0305 11:55:57.246000 19054 site-packages/torch/multiprocessing/spawn.py:169] Terminating process 19126 via signal SIGTERM Traceback (most recent call last): File "/work/DeepEP-main/tests/test_low_latency.py", line 164, in <module> torch.multiprocessing.spawn(test_loop, args=(num_processes,), nprocs=num_processes) File "/root/miniconda3/lib/python3.12/site-packages/torch/multiprocessing/spawn.py", line 340, in spawn return start_processes(fn, args, nprocs, join, daemon, start_method="spawn") ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/root/miniconda3/lib/python3.12/site-packages/torch/multiprocessing/spawn.py", line 296, in start_processes while not context.join(): ^^^^^^^^^^^^^^ File "/root/miniconda3/lib/python3.12/site-packages/torch/multiprocessing/spawn.py", line 204, in join raise ProcessExitedException( torch.multiprocessing.spawn.ProcessExitedException: process 4 terminated with exit code 255By following the logs, we found the calltrace was
test_loop() --> buffer = deep_ep.Buffer() --> __init__ --> self.runtime.sync(device_ids, ipc_handles, root_unique_id)in DeepEP and thennvshmemt_ibrc_connect_endpoints --> nvshmemt_ibrc_ep_connect --> ep_connect --> status = ftable.modify_qp(ep->qp, &attr, flags);in nvshmem_src_3.2.5-1. The status110is usuallyETIMEDOUT。通过查看日志,我们发现调用跟踪在 DeepEP 中为test_loop() --> buffer = deep_ep.Buffer() --> __init__ --> self.runtime.sync(device_ids, ipc_handles, root_unique_id),然后在 nvshmem_src_3.2.5-1 中为nvshmemt_ibrc_connect_endpoints --> nvshmemt_ibrc_ep_connect --> ep_connect --> status = ftable.modify_qp(ep->qp, &attr, flags);。状态110通常为ETIMEDOUT。
Could you add the environment variable NVSHMEM_DEBUG=TRACE, run it again, and provide the log?
I have created a new branch that updates NVSHMEM to version 3.2.5. However, I don't have a RoCE environment for verification. Can you test this out? @Baibaifan Branch: https://github.com/deepseek-ai/DeepEP/tree/roce-support我已经创建了一个新分支,更新了 NVSHMEM 到版本 3.2.5。然而,我没有 RoCE 环境进行验证。你能测试一下吗?@Baibaifan 分支:https://github.com/deepseek-ai/DeepEP/tree/roce-support
ok, I will try it and does the cmake configuration need to be modified? @sphish好的,我试试看,cmake 配置需要修改吗?@sphish
CUDA_HOME=/path/to/cuda \ GDRCOPY_HOME=/path/to/gdrcopy \ NVSHMEM_SHMEM_SUPPORT=0 \ NVSHMEM_UCX_SUPPORT=0 \ NVSHMEM_USE_NCCL=0 \ NVSHMEM_MPI_SUPPORT=0 \ NVSHMEM_IBGDA_SUPPORT=1 \ NVSHMEM_PMIX_SUPPORT=0 \ NVSHMEM_TIMEOUT_DEVICE_POLLING=0 \ NVSHMEM_USE_GDRCOPY=1 \ cmake -S . -B build/ -DCMAKE_INSTALL_PREFIX=/path/to/your/dir/to/installNope. 不行。
Hi @sphish, The process works, but the performance does not seem to meet expectations.Hi,过程正常,但性能似乎未达到预期。
env: 环境:
- H100 80GB HBM3 *8/HPC
- 4 * CX7 NICs 400gb/s/HPC
- RoCE
rank0
NCCL_DEBUG=INFO MASTER_ADDR=xxx WORLD_SIZE=2 RANK=0 python tests/test_internode.pyrank1
NCCL_DEBUG=INFO MASTER_ADDR=xxx WORLD_SIZE=2 RANK=1 python tests/test_internode.py[tuning] Best dispatch (FP8): SMs 24, NVL chunk 28, RDMA chunk 8: 4.09 GB/s (RDMA), 13.36 GB/s (NVL) [tuning] Best dispatch (BF16): SMs 24, NVL chunk 28, RDMA chunk 32: 4.57 GB/s (RDMA), 14.91 GB/s (NVL) [tuning] Best combine: SMs 24, NVL chunk 3, RDMA chunk 32: 2.97 GB/s (RDMA), 9.69 GB/s (NVL)The difference between the above speed and the official data is greater than the difference between the network cards. How can I test the performance with
perftest? Is there any reference?以上速度与官方数据之间的差异大于网卡之间的差异。perftest如何测试性能?有没有参考?
We haven't tested on a machine with 4 nics, but I don't think the performance should be this low. Could you try running alltoall_perf from NCCL_tests?
NCCL_DEBUG=INFO MASTER_ADDR=xxx WORLD_SIZE=2 RANK=0 python tests/test_internode.py
After modifying the wrong
OOB configuration, the current speed of the 4 network NICs is:
[tuning] Best dispatch (FP8): SMs 24, NVL chunk 24, RDMA chunk 32: 41.14 GB/s (RDMA), 134.29 GB/s (NVL)
[tuning] Best dispatch (BF16): SMs 24, NVL chunk 8, RDMA chunk 28: 45.36 GB/s (RDMA), 148.06 GB/s (NVL)
[tuning] Best combine: SMs 24, NVL chunk 3, RDMA chunk 12: 45.20 GB/s (RDMA), 147.55 GB/s (NVL)
Is this reasonable, since I have 4 network NICs? @sphish
By following the logs, we found the calltrace was
test_loop() --> buffer = deep_ep.Buffer() --> __init__ --> self.runtime.sync(device_ids, ipc_handles, root_unique_id)in DeepEP and thennvshmemt_ibrc_connect_endpoints --> nvshmemt_ibrc_ep_connect --> ep_connect --> status = ftable.modify_qp(ep->qp, &attr, flags);in nvshmem_src_3.2.5-1. The status110is usuallyETIMEDOUT。通过查看日志,我们发现调用跟踪在 DeepEP 中为test_loop() --> buffer = deep_ep.Buffer() --> __init__ --> self.runtime.sync(device_ids, ipc_handles, root_unique_id),然后在 nvshmem_src_3.2.5-1 中为nvshmemt_ibrc_connect_endpoints --> nvshmemt_ibrc_ep_connect --> ep_connect --> status = ftable.modify_qp(ep->qp, &attr, flags);。状态110通常为ETIMEDOUT。Could you add the environment variable NVSHMEM_DEBUG=TRACE, run it again, and provide the log?
Hi, @sphish,
In my setting, I run test_internode.py on two nodes and test_intranode.py on one node successfully, but test_low_latency.py still cannot pass with ibv_modfiy_qp failed. Then I run perftest cmd and get some observations:
- TCP based QP connection is always failed for different indexed CX cards with
RTRstatus, e.g.,mlx5_0tomlx5_1, while succeed for same indexed CX cards, e.g.,mlx5_0tomlx5_0. - CM based QP connection always succeed and
perftestcmd pass.
And I also set environment variable NVSHMEM_DEBUG=TRACE, and run it. Please see the part of logs as below. The logs show that the nvshmem driver tries to build EP connection with mutiple times, and finally failed.
/work/nvshmem_src/src/modules/transport/ibgda/ibgda.cpp 3867 Begin - Ordered list of devices for assignment (after processing user provdied env vars (if any)) -
/work/nvshmem_src/src/modules/transport/ibgda/ibgda.cpp 3871 Ordered list of devices for assignment - idx=0 (of 9), device id=0, port_num=1
/work/nvshmem_src/src/modules/transport/ibgda/ibgda.cpp 3871 Ordered list of devices for assignment - idx=1 (of 9), device id=1, port_num=1
/work/nvshmem_src/src/modules/transport/ibgda/ibgda.cpp 3871 Ordered list of devices for assignment - idx=2 (of 9), device id=2, port_num=1
/work/nvshmem_src/src/modules/transport/ibgda/ibgda.cpp 3871 Ordered list of devices for assignment - idx=3 (of 9), device id=3, port_num=1
/work/nvshmem_src/src/modules/transport/ibgda/ibgda.cpp 3871 Ordered list of devices for assignment - idx=4 (of 9), device id=5, port_num=1
/work/nvshmem_src/src/modules/transport/ibgda/ibgda.cpp 3871 Ordered list of devices for assignment - idx=5 (of 9), device id=6, port_num=1
/work/nvshmem_src/src/modules/transport/ibgda/ibgda.cpp 3871 Ordered list of devices for assignment - idx=6 (of 9), device id=7, port_num=1
/work/nvshmem_src/src/modules/transport/ibgda/ibgda.cpp 3871 Ordered list of devices for assignment - idx=7 (of 9), device id=8, port_num=1
/work/nvshmem_src/src/modules/transport/ibgda/ibgda.cpp 3871 Ordered list of devices for assignment - idx=8 (of 9), device id=9, port_num=1
/work/nvshmem_src/src/modules/transport/ibgda/ibgda.cpp 3880 End - Ordered list of devices for assignment (after processing user provdied env vars (if any))
/work/nvshmem_src/src/modules/transport/ibgda/ibgda.cpp 3905 NIC buffer will be on GPU memory.
/work/nvshmem_src/src/modules/transport/ibgda/ibgda.cpp 3916 NIC handler will be GPU.
/work/nvshmem_src/src/modules/bootstrap/uid/bootstrap_uid.cpp:bootstrap_uid_allgather:480: rank 3 nranks 8 size 32
/work/nvshmem_src/src/modules/bootstrap/uid/bootstrap_uid.cpp:bootstrap_uid_allgather:503: rank 2 nranks 8 size 32 - DONE
/work/nvshmem_src/src/modules/bootstrap/uid/bootstrap_uid.cpp:bootstrap_uid_allgather:503: rank 1 nranks 8 size 32 - DONE
/work/nvshmem_src/src/modules/bootstrap/uid/bootstrap_uid.cpp:bootstrap_uid_allgather:503: rank 7 nranks 8 size 32 - DONE
/work/nvshmem_src/src/modules/bootstrap/uid/bootstrap_uid.cpp:bootstrap_uid_allgather:503: rank 6 nranks 8 size 32 - DONE
/work/nvshmem_src/src/modules/bootstrap/uid/bootstrap_uid.cpp:bootstrap_uid_allgather:503: rank 0 nranks 8 size 32 - DONE
/work/nvshmem_src/src/modules/bootstrap/uid/bootstrap_uid.cpp:bootstrap_uid_allgather:503: rank 5 nranks 8 size 32 - DONE
/work/nvshmem_src/src/modules/bootstrap/uid/bootstrap_uid.cpp:bootstrap_uid_allgather:503: rank 4 nranks 8 size 32 - DONE
/work/nvshmem_src/src/modules/bootstrap/uid/bootstrap_uid.cpp:bootstrap_uid_allgather:480: rank 2 nranks 8 size 16
/work/nvshmem_src/src/modules/bootstrap/uid/bootstrap_uid.cpp:bootstrap_uid_allgather:480: rank 1 nranks 8 size 16
/work/nvshmem_src/src/modules/bootstrap/uid/bootstrap_uid.cpp:bootstrap_uid_allgather:503: rank 3 nranks 8 size 32 - DONE
/work/nvshmem_src/src/modules/bootstrap/uid/bootstrap_uid.cpp:bootstrap_uid_allgather:480: rank 6 nranks 8 size 16
/work/nvshmem_src/src/modules/bootstrap/uid/bootstrap_uid.cpp:bootstrap_uid_allgather:480: rank 5 nranks 8 size 16
/work/nvshmem_src/src/modules/bootstrap/uid/bootstrap_uid.cpp:bootstrap_uid_allgather:480: rank 0 nranks 8 size 16
/work/nvshmem_src/src/modules/bootstrap/uid/bootstrap_uid.cpp:bootstrap_uid_allgather:480: rank 7 nranks 8 size 16
/work/nvshmem_src/src/modules/bootstrap/uid/bootstrap_uid.cpp:bootstrap_uid_allgather:480: rank 4 nranks 8 size 16
/work/nvshmem_src/src/modules/bootstrap/uid/bootstrap_uid.cpp:bootstrap_uid_allgather:480: rank 3 nranks 8 size 16
/work/nvshmem_src/src/modules/bootstrap/uid/bootstrap_uid.cpp:bootstrap_uid_allgather:503: rank 2 nranks 8 size 16 - DONE
/work/nvshmem_src/src/modules/bootstrap/uid/bootstrap_uid.cpp:bootstrap_uid_allgather:503: rank 1 nranks 8 size 16 - DONE
/work/nvshmem_src/src/modules/bootstrap/uid/bootstrap_uid.cpp:bootstrap_uid_allgather:503: rank 6 nranks 8 size 16 - DONE
/work/nvshmem_src/src/modules/bootstrap/uid/bootstrap_uid.cpp:bootstrap_uid_allgather:503: rank 0 nranks 8 size 16 - DONE
/work/nvshmem_src/src/modules/bootstrap/uid/bootstrap_uid.cpp:bootstrap_uid_allgather:503: rank 5 nranks 8 size 16 - DONE
/work/nvshmem_src/src/modules/bootstrap/uid/bootstrap_uid.cpp:bootstrap_uid_allgather:503: rank 7 nranks 8 size 16 - DONE
/work/nvshmem_src/src/modules/bootstrap/uid/bootstrap_uid.cpp:bootstrap_uid_allgather:503: rank 4 nranks 8 size 16 - DONE
/work/nvshmem_src/src/modules/bootstrap/uid/bootstrap_uid.cpp:bootstrap_uid_allgather:503: rank 3 nranks 8 size 16 - DONE
glusterfs-06:1692:1692 [0] NVSHMEM INFO NVSHMEM_ENABLE_NIC_PE_MAPPING = 0, device 0 setting dev_id = 0
glusterfs-06:1697:1697 [5] NVSHMEM INFO NVSHMEM_ENABLE_NIC_PE_MAPPING = 0, device 0 setting dev_id = 6
glusterfs-06:1693:1693 [1] NVSHMEM INFO NVSHMEM_ENABLE_NIC_PE_MAPPING = 0, device 0 setting dev_id = 1
glusterfs-06:1696:1696 [4] NVSHMEM INFO NVSHMEM_ENABLE_NIC_PE_MAPPING = 0, device 0 setting dev_id = 5
glusterfs-06:1698:1698 [6] NVSHMEM INFO NVSHMEM_ENABLE_NIC_PE_MAPPING = 0, device 0 setting dev_id = 7
glusterfs-06:1694:1694 [2] NVSHMEM INFO NVSHMEM_ENABLE_NIC_PE_MAPPING = 0, device 0 setting dev_id = 2
glusterfs-06:1699:1699 [7] NVSHMEM INFO NVSHMEM_ENABLE_NIC_PE_MAPPING = 0, device 0 setting dev_id = 8
glusterfs-06:1695:1695 [3] NVSHMEM INFO NVSHMEM_ENABLE_NIC_PE_MAPPING = 0, device 0 setting dev_id = 4
/work/nvshmem_src/src/modules/bootstrap/uid/bootstrap_uid.cpp:bootstrap_uid_alltoall:516: rank 6 nranks 8 size 48
/work/nvshmem_src/src/modules/bootstrap/uid/bootstrap_uid.cpp:bootstrap_uid_alltoall:516: rank 0 nranks 8 size 48
/work/nvshmem_src/src/modules/bootstrap/uid/bootstrap_uid.cpp:bootstrap_uid_alltoall:516: rank 5 nranks 8 size 48
/work/nvshmem_src/src/modules/bootstrap/uid/bootstrap_uid.cpp:bootstrap_uid_alltoall:516: rank 1 nranks 8 size 48
/work/nvshmem_src/src/modules/bootstrap/uid/bootstrap_uid.cpp:bootstrap_uid_alltoall:516: rank 4 nranks 8 size 48
/work/nvshmem_src/src/modules/bootstrap/uid/bootstrap_uid.cpp:bootstrap_uid_alltoall:516: rank 2 nranks 8 size 48
/work/nvshmem_src/src/modules/bootstrap/uid/bootstrap_uid.cpp:bootstrap_uid_alltoall:516: rank 7 nranks 8 size 48
/work/nvshmem_src/src/modules/bootstrap/uid/bootstrap_uid.cpp:bootstrap_uid_alltoall:516: rank 3 nranks 8 size 48
/work/nvshmem_src/src/modules/bootstrap/uid/bootstrap_uid.cpp:bootstrap_uid_alltoall:546: rank 5 nranks 8 size 48 - DONE
/work/nvshmem_src/src/modules/transport/common/transport_ib_common.h 103 NVSHMEM_IB_ADDR_FAMILY set by environment to AF_INET
/work/nvshmem_src/src/modules/transport/common/transport_ib_common.h 126 NVSHMEM_IB_ADDR_RANGE set by environment to ::/0
/work/nvshmem_src/src/modules/transport/common/transport_ib_common.h 138 NET/IB: Ip address '::' is invalid for family AF_INET, ignoring address
/work/nvshmem_src/src/modules/bootstrap/uid/bootstrap_uid.cpp:bootstrap_uid_alltoall:546: rank 6 nranks 8 size 48 - DONE
/work/nvshmem_src/src/modules/transport/common/transport_ib_common.h 103 NVSHMEM_IB_ADDR_FAMILY set by environment to AF_INET
/work/nvshmem_src/src/modules/transport/common/transport_ib_common.h 126 NVSHMEM_IB_ADDR_RANGE set by environment to ::/0
/work/nvshmem_src/src/modules/transport/common/transport_ib_common.h 138 NET/IB: Ip address '::' is invalid for family AF_INET, ignoring address
/work/nvshmem_src/src/modules/bootstrap/uid/bootstrap_uid.cpp:bootstrap_uid_alltoall:546: rank 4 nranks 8 size 48 - DONE
/work/nvshmem_src/src/modules/bootstrap/uid/bootstrap_uid.cpp:bootstrap_uid_alltoall:546: rank 3 nranks 8 size 48 - DONE
/work/nvshmem_src/src/modules/transport/common/transport_ib_common.h 103 NVSHMEM_IB_ADDR_FAMILY set by environment to AF_INET
/work/nvshmem_src/src/modules/transport/common/transport_ib_common.h 126 NVSHMEM_IB_ADDR_RANGE set by environment to ::/0
/work/nvshmem_src/src/modules/transport/common/transport_ib_common.h 103 NVSHMEM_IB_ADDR_FAMILY set by environment to AF_INET
/work/nvshmem_src/src/modules/transport/common/transport_ib_common.h 138 NET/IB: Ip address '::' is invalid for family AF_INET, ignoring address
/work/nvshmem_src/src/modules/transport/common/transport_ib_common.h 126 NVSHMEM_IB_ADDR_RANGE set by environment to ::/0
/work/nvshmem_src/src/modules/transport/common/transport_ib_common.h 138 NET/IB: Ip address '::' is invalid for family AF_INET, ignoring address
/work/nvshmem_src/src/modules/bootstrap/uid/bootstrap_uid.cpp:bootstrap_uid_alltoall:546: rank 2 nranks 8 size 48 - DONE
/work/nvshmem_src/src/modules/transport/common/transport_ib_common.h 103 NVSHMEM_IB_ADDR_FAMILY set by environment to AF_INET
/work/nvshmem_src/src/modules/transport/common/transport_ib_common.h 126 NVSHMEM_IB_ADDR_RANGE set by environment to ::/0
/work/nvshmem_src/src/modules/transport/common/transport_ib_common.h 138 NET/IB: Ip address '::' is invalid for family AF_INET, ignoring address
/work/nvshmem_src/src/modules/bootstrap/uid/bootstrap_uid.cpp:bootstrap_uid_alltoall:546: rank 1 nranks 8 size 48 - DONE
/work/nvshmem_src/src/modules/transport/common/transport_ib_common.h 103 NVSHMEM_IB_ADDR_FAMILY set by environment to AF_INET
/work/nvshmem_src/src/modules/transport/common/transport_ib_common.h 126 NVSHMEM_IB_ADDR_RANGE set by environment to ::/0
/work/nvshmem_src/src/modules/transport/common/transport_ib_common.h 138 NET/IB: Ip address '::' is invalid for family AF_INET, ignoring address
/work/nvshmem_src/src/modules/bootstrap/uid/bootstrap_uid.cpp:bootstrap_uid_alltoall:546: rank 0 nranks 8 size 48 - DONE
/work/nvshmem_src/src/modules/transport/common/transport_ib_common.h 103 NVSHMEM_IB_ADDR_FAMILY set by environment to AF_INET
/work/nvshmem_src/src/modules/transport/common/transport_ib_common.h 126 NVSHMEM_IB_ADDR_RANGE set by environment to ::/0
/work/nvshmem_src/src/modules/transport/common/transport_ib_common.h 138 NET/IB: Ip address '::' is invalid for family AF_INET, ignoring address
/work/nvshmem_src/src/modules/bootstrap/uid/bootstrap_uid.cpp:bootstrap_uid_alltoall:546: rank 7 nranks 8 size 48 - DONE
/work/nvshmem_src/src/modules/transport/common/transport_ib_common.h 103 NVSHMEM_IB_ADDR_FAMILY set by environment to AF_INET
/work/nvshmem_src/src/modules/transport/common/transport_ib_common.h 126 NVSHMEM_IB_ADDR_RANGE set by environment to ::/0
/work/nvshmem_src/src/modules/transport/common/transport_ib_common.h 138 NET/IB: Ip address '::' is invalid for family AF_INET, ignoring address
/work/nvshmem_src/src/modules/transport/common/transport_ib_common.h 103 NVSHMEM_IB_ADDR_FAMILY set by environment to AF_INET
/work/nvshmem_src/src/modules/transport/common/transport_ib_common.h 126 NVSHMEM_IB_ADDR_RANGE set by environment to ::/0
/work/nvshmem_src/src/modules/transport/common/transport_ib_common.h 138 NET/IB: Ip address '::' is invalid for family AF_INET, ignoring address
/work/nvshmem_src/src/modules/transport/common/transport_ib_common.h 103 NVSHMEM_IB_ADDR_FAMILY set by environment to AF_INET
/work/nvshmem_src/src/modules/transport/common/transport_ib_common.h 126 NVSHMEM_IB_ADDR_RANGE set by environment to ::/0
/work/nvshmem_src/src/modules/transport/common/transport_ib_common.h 138 NET/IB: Ip address '::' is invalid for family AF_INET, ignoring address
/work/nvshmem_src/src/modules/transport/ibrc/ibrc.cpp:418: non-zero status: 110 ibv_modify_qp failed
/work/nvshmem_src/src/modules/transport/ibrc/ibrc.cpp:418: non-zero status: 110 /work/nvshmem_src/src/modules/transport/ibrc/ibrc.cpp:418: non-zero status: 110 /work/nvshmem_src/src/modules/transport/ibrc/ibrc.cpp:418: non-zero status: 110 /work/nvshmem_src/src/modules/transport/ibrc/ibrc.cpp:1433: non-zero status: 7 /work/nvshmem_src/src/modules/transport/ibrc/ibrc.cpp:418: non-zero status: 110 ibv_modify_qp failed
ibv_modify_qp failed
ibv_modify_qp failed
ep_connect failed
ibv_modify_qp failed
/work/nvshmem_src/src/modules/transport/ibrc/ibrc.cpp:1433: non-zero status: 7
/work/nvshmem_src/src/modules/transport/ibrc/ibrc.cpp:418: non-zero status: 110 /work/nvshmem_src/src/modules/transport/ibrc/ibrc.cpp:418: non-zero status: 110 /work/nvshmem_src/src/modules/transport/ibrc/ibrc.cpp:1433: non-zero status: 7 /work/nvshmem_src/src/modules/transport/ibrc/ibrc.cpp:418: non-zero status: 110 /work/nvshmem_src/src/modules/transport/ibrc/ibrc.cpp:1433: non-zero status: 7 /work/nvshmem_src/src/modules/transport/ibrc/ibrc.cpp:1500: non-zero status: 7 ep_connect failed
/work/nvshmem_src/src/modules/transport/ibrc/ibrc.cpp:1433: non-zero status: 7 ibv_modify_qp failed
ibv_modify_qp failed
ep_connect failed
ibv_modify_qp failed
ep_connect failed
transport create connect failed
ep_connect failed
/work/nvshmem_src/src/modules/transport/ibrc/ibrc.cpp:1500: non-zero status: 7
/work/nvshmem_src/src/modules/transport/ibrc/ibrc.cpp:1500: non-zero status: 7
/work/nvshmem_src/src/modules/transport/ibrc/ibrc.cpp:1433: non-zero status: 7 /work/nvshmem_src/src/modules/transport/ibrc/ibrc.cpp:1433: non-zero status: 7 /work/nvshmem_src/src/modules/transport/ibrc/ibrc.cpp:1433: non-zero status: 7 transport create connect failed
/work/nvshmem_src/src/modules/transport/ibrc/ibrc.cpp:1500: non-zero status: 7 /work/nvshmem_src/src/host/transport/transport.cpp:400: non-zero status: 7 transport create connect failed
/work/nvshmem_src/src/modules/transport/ibrc/ibrc.cpp:1500: non-zero status: 7 ep_connect failed
ep_connect failed
ep_connect failed
transport create connect failed
connect EPS failed
transport create connect failed
/work/nvshmem_src/src/host/transport/transport.cpp:400: non-zero status: 7
/work/nvshmem_src/src/host/transport/transport.cpp:400: non-zero status: 7
/work/nvshmem_src/src/modules/transport/ibrc/ibrc.cpp:1500: non-zero status: 7 /work/nvshmem_src/src/modules/transport/ibrc/ibrc.cpp:1500: non-zero status: 7 /work/nvshmem_src/src/modules/transport/ibrc/ibrc.cpp:1500: non-zero status: 7 connect EPS failed
/work/nvshmem_src/src/host/init/init.cu:1007: non-zero status: 7 /work/nvshmem_src/src/host/transport/transport.cpp:400: non-zero status: 7 connect EPS failed
transport create connect failed
/work/nvshmem_src/src/host/transport/transport.cpp:400: non-zero status: 7 transport create connect failed
transport create connect failed
nvshmem setup connections failed
connect EPS failed
connect EPS failed
/work/nvshmem_src/src/host/init/init.cu:1007: non-zero status: 7
nvshmem setup connections failed
/work/nvshmem_src/src/host/init/init.cu:1007: non-zero status: 7
/work/nvshmem_src/src/host/transport/transport.cpp:400: non-zero status: 7 /work/nvshmem_src/src/host/transport/transport.cpp:400: non-zero status: 7 /work/nvshmem_src/src/host/init/init.cu:nvshmemi_check_state_and_init:1080: /work/nvshmem_src/src/host/transport/transport.cpp:400: non-zero status: 7 /work/nvshmem_src/src/host/init/init.cu:1007: non-zero status: 7
nvshmem setup connections failed
/work/nvshmem_src/src/host/init/init.cu:1007: non-zero status: 7 connect EPS failed
connect EPS failed
nvshmem initialization failed, exiting
connect EPS failed
nvshmem setup connections failed
/work/nvshmem_src/src/host/init/init.cu:nvshmemi_check_state_and_init:1080:
nvshmem setup connections failed
/work/nvshmem_src/src/host/init/init.cu:nvshmemi_check_state_and_init:1080:
nvshmem initialization failed, exiting
nvshmem initialization failed, exiting
/work/nvshmem_src/src/host/init/init.cu:1007: non-zero status: 7 /work/nvshmem_src/src/host/init/init.cu:1007: non-zero status: 7 /work/nvshmem_src/src/host/init/init.cu:1007: non-zero status: 7 /work/nvshmem_src/src/host/init/init.cu:nvshmemi_check_state_and_init:1080:
/work/nvshmem_src/src/host/init/init.cu:nvshmemi_check_state_and_init:1080:
nvshmem setup connections failed
nvshmem setup connections failed
nvshmem setup connections failed
nvshmem initialization failed, exiting
nvshmem initialization failed, exiting
/work/nvshmem_src/src/host/init/init.cu:nvshmemi_check_state_and_init:1080: /work/nvshmem_src/src/host/init/init.cu:nvshmemi_check_state_and_init:1080: /work/nvshmem_src/src/host/init/init.cu:nvshmemi_check_state_and_init:1080: nvshmem initialization failed, exiting
nvshmem initialization failed, exiting
nvshmem initialization failed, exiting
W0307 15:04:39.317000 1627 site-packages/torch/multiprocessing/spawn.py:169] Terminating process 1692 via signal SIGTERM
W0307 15:04:39.318000 1627 site-packages/torch/multiprocessing/spawn.py:169] Terminating process 1693 via signal SIGTERM
W0307 15:04:39.319000 1627 site-packages/torch/multiprocessing/spawn.py:169] Terminating process 1694 via signal SIGTERM
W0307 15:04:39.319000 1627 site-packages/torch/multiprocessing/spawn.py:169] Terminating process 1695 via signal SIGTERM
W0307 15:04:39.319000 1627 site-packages/torch/multiprocessing/spawn.py:169] Terminating process 1696 via signal SIGTERM
W0307 15:04:39.320000 1627 site-packages/torch/multiprocessing/spawn.py:169] Terminating process 1697 via signal SIGTERM
W0307 15:04:39.320000 1627 site-packages/torch/multiprocessing/spawn.py:169] Terminating process 1699 via signal SIGTERM
Traceback (most recent call last):
File "/work/DeepEP-roce-support/tests/test_low_latency.py", line 160, in <module>
torch.multiprocessing.spawn(test_loop, args=(num_processes,), nprocs=num_processes)
File "/root/miniconda3/lib/python3.12/site-packages/torch/multiprocessing/spawn.py", line 340, in spawn
return start_processes(fn, args, nprocs, join, daemon, start_method="spawn")
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/miniconda3/lib/python3.12/site-packages/torch/multiprocessing/spawn.py", line 296, in start_processes
while not context.join():
^^^^^^^^^^^^^^
File "/root/miniconda3/lib/python3.12/site-packages/torch/multiprocessing/spawn.py", line 204, in join
raise ProcessExitedException(
torch.multiprocessing.spawn.ProcessExitedException: process 6 terminated with exit code 255
NCCL_DEBUG=INFO MASTER_ADDR=xxx WORLD_SIZE=2 RANK=0 python tests/test_internode.py
After modifying the wrong
OOBconfiguration, the current speed of the 4 network NICs is:[tuning] Best dispatch (FP8): SMs 24, NVL chunk 24, RDMA chunk 32: 41.14 GB/s (RDMA), 134.29 GB/s (NVL) [tuning] Best dispatch (BF16): SMs 24, NVL chunk 8, RDMA chunk 28: 45.36 GB/s (RDMA), 148.06 GB/s (NVL) [tuning] Best combine: SMs 24, NVL chunk 3, RDMA chunk 12: 45.20 GB/s (RDMA), 147.55 GB/s (NVL)Is this reasonable, since I have 4 network NICs? @sphish
@Baibaifan What is OOB configuration? Regarding the performance issue, I agree. It appears that the bandwidth is limited by the NICs.
perftest
Could you please send me the command to run perftest for reference? @kunfupanda-hw
@Baibaifan What is
OOBconfiguration? Regarding the performance issue, I agree. It appears that the bandwidth is limited by the NICs.
OOBconfiguration isNCCL_SOCKET_IFNAME, there was something wrong with the previous test setup. I was doingtest_low_latency.pytest and encountered the same problem as @kunfupanda-hw. @sphish
Hi @sphish, The process works, but the performance does not seem to meet expectations. env:嗨@sphish,这个过程是有效的,但性能似乎没有达到预期。
- H100 80GB HBM3 *8/HPC h10080gb hbm3 *8/ hpc
- 4 * CX7 NICs 400gb/s/HPC4 * CX7网卡400gb/s/HPC
- RoCE
rank0 rank1
NCCL_DEBUG=INFO MASTER_ADDR=xxx WORLD_SIZE=2 RANK=0 python tests/test_internode.py``NCCL_DEBUG=INFO MASTER_ADDR=xxx WORLD_SIZE=2 RANK=1 python tests/test_internode.py[tuning] Best dispatch (FP8): SMs 24, NVL chunk 28, RDMA chunk 8: 4.09 GB/s (RDMA), 13.36 GB/s (NVL) [tuning] Best dispatch (BF16): SMs 24, NVL chunk 28, RDMA chunk 32: 4.57 GB/s (RDMA), 14.91 GB/s (NVL) [tuning] Best combine: SMs 24, NVL chunk 3, RDMA chunk 32: 2.97 GB/s (RDMA), 9.69 GB/s (NVL)The difference between the above speed and the official data is greater than the difference between the network cards. How can I test the performance with ? Is there any reference?以上速度与官方数据的差异大于网卡之间的差异。如何测试性能?有参考吗?
perftestHi, @Baibaifan, are MASTER_ADDRs of rank0/1 same? Are they IP of CX7 cards or host IP?嗨,@百百范,排名0/1的master_addr是一样的吗?是CX7卡的IP还是主机IP?
MASTER_ADDRs are the same, both are host ip. It is to build Bootstrap use.master_addr是一样的,都是主机ip。它是用来构建Bootstrap的。
Hi @Baibaifan what's configuration have you changed, i have 4 IB per node, there is also the large difference between self-testing data (4.47GB/s RDMA) and offical data.
perftest
Could you please send me the command to run
perftestfor reference? @kunfupanda-hw
server: ib_write_bw -d your_card_name -a --report_gbits -n 1000 -q 16 --CPU-freq -p 1 client: ib_write_bw -d your_card_name -a --report_gbits -n 1000 -q 16 your_ip --CPU-freq -p 1
ib_write_bw
I mean the tests in perftest/perftest_install, in the nvshmem_src directory. @kunfupanda-hw
Hi @sphish, The process works, but the performance does not seem to meet expectations. env:嗨@sphish,这个过程是有效的,但性能似乎没有达到预期。
- H100 80GB HBM3 *8/HPC h10080gb hbm3 *8/ hpc
- 4 * CX7 NICs 400gb/s/HPC4 * CX7网卡400gb/s/HPC
- RoCE
rank0 rank1
NCCL_DEBUG=INFO MASTER_ADDR=xxx WORLD_SIZE=2 RANK=0 python tests/test_internode.pyNCCL_DEBUG=INFO MASTER_ADDR=xxx WORLD_SIZE=2 RANK=1 python tests/test_internode.py ``[tuning] Best dispatch (FP8): SMs 24, NVL chunk 28, RDMA chunk 8: 4.09 GB/s (RDMA), 13.36 GB/s (NVL) [tuning] Best dispatch (BF16): SMs 24, NVL chunk 28, RDMA chunk 32: 4.57 GB/s (RDMA), 14.91 GB/s (NVL) [tuning] Best combine: SMs 24, NVL chunk 3, RDMA chunk 32: 2.97 GB/s (RDMA), 9.69 GB/s (NVL)The difference between the above speed and the official data is greater than the difference between the network cards. How can I test the performance with ? Is there any reference?以上速度与官方数据的差异大于网卡之间的差异。如何测试性能?有参考吗?
perftestHi, @Baibaifan, are MASTER_ADDRs of rank0/1 same? Are they IP of CX7 cards or host IP?嗨,@百百范,排名0/1的master_addr是一样的吗?是CX7卡的IP还是主机IP?
MASTER_ADDRs are the same, both are host ip. It is to build Bootstrap use.master_addr是一样的,都是主机ip。它是用来构建Bootstrap的。
Hi @Baibaifan what's configuration have you changed, i have 4 IB per node, there is also the large difference between self-testing data (4.47GB/s RDMA) and offical data.
Please check the nccl network configuration. For example: NCCL_IB_HCA and NCCL_SOCKET_IFNAME. @MinhuiWan
Hi @sphish, The process works, but the performance does not seem to meet expectations.Hi @sphish, 进程正常运行,但性能似乎未达到预期。 env:嗨@sphish,这个过程是有效的,但性能似乎没有达到预期。
- H100 80GB HBM3 *8/HPC h10080gb hbm3 *8/ hpcH100 80GB HBM3 *8/HPC H100 80GB HBM3 *8/ HPC
- 4 * CX7 NICs 400gb/s/HPC4 * CX7网卡400gb/s/HPC4 * CX7 NICs 400gb/s/HPC4 * CX7 网卡 400gb/s/HPC
- RoCE
rank0 rank1
NCCL_DEBUG=INFO MASTER_ADDR=xxx WORLD_SIZE=2 RANK=0 python tests/test_internode.pyNCCL_DEBUG=INFO MASTER_ADDR=xxx WORLD_SIZE=2 RANK=1 python tests/test_internode.py ``rank1NCCL_DEBUG=INFO MASTER_ADDR=xxx WORLD_SIZE=2 RANK=0 python tests/test_internode.pyNCCL_DEBUG=INFO MASTER_ADDR=xxx WORLD_SIZE=2 RANK=1 python tests/test_internode.py[tuning] Best dispatch (FP8): SMs 24, NVL chunk 28, RDMA chunk 8: 4.09 GB/s (RDMA), 13.36 GB/s (NVL) [tuning] Best dispatch (BF16): SMs 24, NVL chunk 28, RDMA chunk 32: 4.57 GB/s (RDMA), 14.91 GB/s (NVL) [tuning] Best combine: SMs 24, NVL chunk 3, RDMA chunk 32: 2.97 GB/s (RDMA), 9.69 GB/s (NVL)The difference between the above speed and the official data is greater than the difference between the network cards. How can I test the performance with ? Is there any reference?以上速度与官方数据的差异大于网卡之间的差异。如何测试性能?有参考吗?
perftest以上速度与官方数据的差异大于网卡之间的差异。如何测试性能?有参考吗?perftestHi, @Baibaifan, are MASTER_ADDRs of rank0/1 same? Are they IP of CX7 cards or host IP?嗨,@百百范,排名0/1的master_addr是一样的吗?是CX7卡的IP还是主机IP?嗨,@百百范,rank0/1 的 MASTER_ADDR 是一样的吗?是 CX7 卡的 IP 还是主机 IP?
MASTER_ADDRs are the same, both are host ip. It is to build Bootstrap use.master_addr是一样的,都是主机ip。它是用来构建Bootstrap的。master_addr 是一样的,都是主机 ip。它是用来构建 Bootstrap 的。
Hi @Baibaifan what's configuration have you changed, i have 4 IB per node, there is also the large difference between self-testing data (4.47GB/s RDMA) and offical data.你好 @Baibaifan 你更改了什么配置?我这里有 4 个 IB 每节点,自测数据(4.47GB/s RDMA)和官方数据之间也有很大的差异。
Please check the nccl network configuration. For example:
NCCL_IB_HCAandNCCL_SOCKET_IFNAME. 请检查 nccl 网络配置。例如:NCCL_IB_HCA和NCCL_SOCKET_IFNAME。@MinhuiWan
@Baibaifan If you are looking for some performance references for nvshmem perftest, there are some official P2P benchmark results in this link. As for collective communication, we haven't tested it either.
By following the logs, we found the calltrace was
test_loop() --> buffer = deep_ep.Buffer() --> __init__ --> self.runtime.sync(device_ids, ipc_handles, root_unique_id)in DeepEP and thennvshmemt_ibrc_connect_endpoints --> nvshmemt_ibrc_ep_connect --> ep_connect --> status = ftable.modify_qp(ep->qp, &attr, flags);in nvshmem_src_3.2.5-1. The status110is usuallyETIMEDOUT。通过查看日志,我们发现调用跟踪在 DeepEP 中为test_loop() --> buffer = deep_ep.Buffer() --> __init__ --> self.runtime.sync(device_ids, ipc_handles, root_unique_id),然后在 nvshmem_src_3.2.5-1 中为nvshmemt_ibrc_connect_endpoints --> nvshmemt_ibrc_ep_connect --> ep_connect --> status = ftable.modify_qp(ep->qp, &attr, flags);。状态110通常为ETIMEDOUT。Could you add the environment variable NVSHMEM_DEBUG=TRACE, run it again, and provide the log?
Hi, @sphish,
In my setting, I run
test_internode.pyon two nodes andtest_intranode.pyon one node successfully, buttest_low_latency.pystill cannot pass withibv_modfiy_qpfailed. Then I runperftestcmd and get some observations:
- TCP based QP connection is always failed for different indexed CX cards with
RTRstatus, e.g.,mlx5_0tomlx5_1, while succeed for same indexed CX cards, e.g.,mlx5_0tomlx5_0.- CM based QP connection always succeed and
perftestcmd pass.And I also set environment variable NVSHMEM_DEBUG=TRACE, and run it. Please see the part of logs as below. The logs show that the nvshmem driver tries to build EP connection with mutiple times, and finally failed.
/work/nvshmem_src/src/modules/transport/ibgda/ibgda.cpp 3867 Begin - Ordered list of devices for assignment (after processing user provdied env vars (if any)) - /work/nvshmem_src/src/modules/transport/ibgda/ibgda.cpp 3871 Ordered list of devices for assignment - idx=0 (of 9), device id=0, port_num=1 /work/nvshmem_src/src/modules/transport/ibgda/ibgda.cpp 3871 Ordered list of devices for assignment - idx=1 (of 9), device id=1, port_num=1 /work/nvshmem_src/src/modules/transport/ibgda/ibgda.cpp 3871 Ordered list of devices for assignment - idx=2 (of 9), device id=2, port_num=1 /work/nvshmem_src/src/modules/transport/ibgda/ibgda.cpp 3871 Ordered list of devices for assignment - idx=3 (of 9), device id=3, port_num=1 /work/nvshmem_src/src/modules/transport/ibgda/ibgda.cpp 3871 Ordered list of devices for assignment - idx=4 (of 9), device id=5, port_num=1 /work/nvshmem_src/src/modules/transport/ibgda/ibgda.cpp 3871 Ordered list of devices for assignment - idx=5 (of 9), device id=6, port_num=1 /work/nvshmem_src/src/modules/transport/ibgda/ibgda.cpp 3871 Ordered list of devices for assignment - idx=6 (of 9), device id=7, port_num=1 /work/nvshmem_src/src/modules/transport/ibgda/ibgda.cpp 3871 Ordered list of devices for assignment - idx=7 (of 9), device id=8, port_num=1 /work/nvshmem_src/src/modules/transport/ibgda/ibgda.cpp 3871 Ordered list of devices for assignment - idx=8 (of 9), device id=9, port_num=1 /work/nvshmem_src/src/modules/transport/ibgda/ibgda.cpp 3880 End - Ordered list of devices for assignment (after processing user provdied env vars (if any)) /work/nvshmem_src/src/modules/transport/ibgda/ibgda.cpp 3905 NIC buffer will be on GPU memory. /work/nvshmem_src/src/modules/transport/ibgda/ibgda.cpp 3916 NIC handler will be GPU. /work/nvshmem_src/src/modules/bootstrap/uid/bootstrap_uid.cpp:bootstrap_uid_allgather:480: rank 3 nranks 8 size 32 /work/nvshmem_src/src/modules/bootstrap/uid/bootstrap_uid.cpp:bootstrap_uid_allgather:503: rank 2 nranks 8 size 32 - DONE /work/nvshmem_src/src/modules/bootstrap/uid/bootstrap_uid.cpp:bootstrap_uid_allgather:503: rank 1 nranks 8 size 32 - DONE /work/nvshmem_src/src/modules/bootstrap/uid/bootstrap_uid.cpp:bootstrap_uid_allgather:503: rank 7 nranks 8 size 32 - DONE /work/nvshmem_src/src/modules/bootstrap/uid/bootstrap_uid.cpp:bootstrap_uid_allgather:503: rank 6 nranks 8 size 32 - DONE /work/nvshmem_src/src/modules/bootstrap/uid/bootstrap_uid.cpp:bootstrap_uid_allgather:503: rank 0 nranks 8 size 32 - DONE /work/nvshmem_src/src/modules/bootstrap/uid/bootstrap_uid.cpp:bootstrap_uid_allgather:503: rank 5 nranks 8 size 32 - DONE /work/nvshmem_src/src/modules/bootstrap/uid/bootstrap_uid.cpp:bootstrap_uid_allgather:503: rank 4 nranks 8 size 32 - DONE /work/nvshmem_src/src/modules/bootstrap/uid/bootstrap_uid.cpp:bootstrap_uid_allgather:480: rank 2 nranks 8 size 16 /work/nvshmem_src/src/modules/bootstrap/uid/bootstrap_uid.cpp:bootstrap_uid_allgather:480: rank 1 nranks 8 size 16 /work/nvshmem_src/src/modules/bootstrap/uid/bootstrap_uid.cpp:bootstrap_uid_allgather:503: rank 3 nranks 8 size 32 - DONE /work/nvshmem_src/src/modules/bootstrap/uid/bootstrap_uid.cpp:bootstrap_uid_allgather:480: rank 6 nranks 8 size 16 /work/nvshmem_src/src/modules/bootstrap/uid/bootstrap_uid.cpp:bootstrap_uid_allgather:480: rank 5 nranks 8 size 16 /work/nvshmem_src/src/modules/bootstrap/uid/bootstrap_uid.cpp:bootstrap_uid_allgather:480: rank 0 nranks 8 size 16 /work/nvshmem_src/src/modules/bootstrap/uid/bootstrap_uid.cpp:bootstrap_uid_allgather:480: rank 7 nranks 8 size 16 /work/nvshmem_src/src/modules/bootstrap/uid/bootstrap_uid.cpp:bootstrap_uid_allgather:480: rank 4 nranks 8 size 16 /work/nvshmem_src/src/modules/bootstrap/uid/bootstrap_uid.cpp:bootstrap_uid_allgather:480: rank 3 nranks 8 size 16 /work/nvshmem_src/src/modules/bootstrap/uid/bootstrap_uid.cpp:bootstrap_uid_allgather:503: rank 2 nranks 8 size 16 - DONE /work/nvshmem_src/src/modules/bootstrap/uid/bootstrap_uid.cpp:bootstrap_uid_allgather:503: rank 1 nranks 8 size 16 - DONE /work/nvshmem_src/src/modules/bootstrap/uid/bootstrap_uid.cpp:bootstrap_uid_allgather:503: rank 6 nranks 8 size 16 - DONE /work/nvshmem_src/src/modules/bootstrap/uid/bootstrap_uid.cpp:bootstrap_uid_allgather:503: rank 0 nranks 8 size 16 - DONE /work/nvshmem_src/src/modules/bootstrap/uid/bootstrap_uid.cpp:bootstrap_uid_allgather:503: rank 5 nranks 8 size 16 - DONE /work/nvshmem_src/src/modules/bootstrap/uid/bootstrap_uid.cpp:bootstrap_uid_allgather:503: rank 7 nranks 8 size 16 - DONE /work/nvshmem_src/src/modules/bootstrap/uid/bootstrap_uid.cpp:bootstrap_uid_allgather:503: rank 4 nranks 8 size 16 - DONE /work/nvshmem_src/src/modules/bootstrap/uid/bootstrap_uid.cpp:bootstrap_uid_allgather:503: rank 3 nranks 8 size 16 - DONE glusterfs-06:1692:1692 [0] NVSHMEM INFO NVSHMEM_ENABLE_NIC_PE_MAPPING = 0, device 0 setting dev_id = 0 glusterfs-06:1697:1697 [5] NVSHMEM INFO NVSHMEM_ENABLE_NIC_PE_MAPPING = 0, device 0 setting dev_id = 6 glusterfs-06:1693:1693 [1] NVSHMEM INFO NVSHMEM_ENABLE_NIC_PE_MAPPING = 0, device 0 setting dev_id = 1 glusterfs-06:1696:1696 [4] NVSHMEM INFO NVSHMEM_ENABLE_NIC_PE_MAPPING = 0, device 0 setting dev_id = 5 glusterfs-06:1698:1698 [6] NVSHMEM INFO NVSHMEM_ENABLE_NIC_PE_MAPPING = 0, device 0 setting dev_id = 7 glusterfs-06:1694:1694 [2] NVSHMEM INFO NVSHMEM_ENABLE_NIC_PE_MAPPING = 0, device 0 setting dev_id = 2 glusterfs-06:1699:1699 [7] NVSHMEM INFO NVSHMEM_ENABLE_NIC_PE_MAPPING = 0, device 0 setting dev_id = 8 glusterfs-06:1695:1695 [3] NVSHMEM INFO NVSHMEM_ENABLE_NIC_PE_MAPPING = 0, device 0 setting dev_id = 4 /work/nvshmem_src/src/modules/bootstrap/uid/bootstrap_uid.cpp:bootstrap_uid_alltoall:516: rank 6 nranks 8 size 48 /work/nvshmem_src/src/modules/bootstrap/uid/bootstrap_uid.cpp:bootstrap_uid_alltoall:516: rank 0 nranks 8 size 48 /work/nvshmem_src/src/modules/bootstrap/uid/bootstrap_uid.cpp:bootstrap_uid_alltoall:516: rank 5 nranks 8 size 48 /work/nvshmem_src/src/modules/bootstrap/uid/bootstrap_uid.cpp:bootstrap_uid_alltoall:516: rank 1 nranks 8 size 48 /work/nvshmem_src/src/modules/bootstrap/uid/bootstrap_uid.cpp:bootstrap_uid_alltoall:516: rank 4 nranks 8 size 48 /work/nvshmem_src/src/modules/bootstrap/uid/bootstrap_uid.cpp:bootstrap_uid_alltoall:516: rank 2 nranks 8 size 48 /work/nvshmem_src/src/modules/bootstrap/uid/bootstrap_uid.cpp:bootstrap_uid_alltoall:516: rank 7 nranks 8 size 48 /work/nvshmem_src/src/modules/bootstrap/uid/bootstrap_uid.cpp:bootstrap_uid_alltoall:516: rank 3 nranks 8 size 48 /work/nvshmem_src/src/modules/bootstrap/uid/bootstrap_uid.cpp:bootstrap_uid_alltoall:546: rank 5 nranks 8 size 48 - DONE /work/nvshmem_src/src/modules/transport/common/transport_ib_common.h 103 NVSHMEM_IB_ADDR_FAMILY set by environment to AF_INET /work/nvshmem_src/src/modules/transport/common/transport_ib_common.h 126 NVSHMEM_IB_ADDR_RANGE set by environment to ::/0 /work/nvshmem_src/src/modules/transport/common/transport_ib_common.h 138 NET/IB: Ip address '::' is invalid for family AF_INET, ignoring address /work/nvshmem_src/src/modules/bootstrap/uid/bootstrap_uid.cpp:bootstrap_uid_alltoall:546: rank 6 nranks 8 size 48 - DONE /work/nvshmem_src/src/modules/transport/common/transport_ib_common.h 103 NVSHMEM_IB_ADDR_FAMILY set by environment to AF_INET /work/nvshmem_src/src/modules/transport/common/transport_ib_common.h 126 NVSHMEM_IB_ADDR_RANGE set by environment to ::/0 /work/nvshmem_src/src/modules/transport/common/transport_ib_common.h 138 NET/IB: Ip address '::' is invalid for family AF_INET, ignoring address /work/nvshmem_src/src/modules/bootstrap/uid/bootstrap_uid.cpp:bootstrap_uid_alltoall:546: rank 4 nranks 8 size 48 - DONE /work/nvshmem_src/src/modules/bootstrap/uid/bootstrap_uid.cpp:bootstrap_uid_alltoall:546: rank 3 nranks 8 size 48 - DONE /work/nvshmem_src/src/modules/transport/common/transport_ib_common.h 103 NVSHMEM_IB_ADDR_FAMILY set by environment to AF_INET /work/nvshmem_src/src/modules/transport/common/transport_ib_common.h 126 NVSHMEM_IB_ADDR_RANGE set by environment to ::/0 /work/nvshmem_src/src/modules/transport/common/transport_ib_common.h 103 NVSHMEM_IB_ADDR_FAMILY set by environment to AF_INET /work/nvshmem_src/src/modules/transport/common/transport_ib_common.h 138 NET/IB: Ip address '::' is invalid for family AF_INET, ignoring address /work/nvshmem_src/src/modules/transport/common/transport_ib_common.h 126 NVSHMEM_IB_ADDR_RANGE set by environment to ::/0 /work/nvshmem_src/src/modules/transport/common/transport_ib_common.h 138 NET/IB: Ip address '::' is invalid for family AF_INET, ignoring address /work/nvshmem_src/src/modules/bootstrap/uid/bootstrap_uid.cpp:bootstrap_uid_alltoall:546: rank 2 nranks 8 size 48 - DONE /work/nvshmem_src/src/modules/transport/common/transport_ib_common.h 103 NVSHMEM_IB_ADDR_FAMILY set by environment to AF_INET /work/nvshmem_src/src/modules/transport/common/transport_ib_common.h 126 NVSHMEM_IB_ADDR_RANGE set by environment to ::/0 /work/nvshmem_src/src/modules/transport/common/transport_ib_common.h 138 NET/IB: Ip address '::' is invalid for family AF_INET, ignoring address /work/nvshmem_src/src/modules/bootstrap/uid/bootstrap_uid.cpp:bootstrap_uid_alltoall:546: rank 1 nranks 8 size 48 - DONE /work/nvshmem_src/src/modules/transport/common/transport_ib_common.h 103 NVSHMEM_IB_ADDR_FAMILY set by environment to AF_INET /work/nvshmem_src/src/modules/transport/common/transport_ib_common.h 126 NVSHMEM_IB_ADDR_RANGE set by environment to ::/0 /work/nvshmem_src/src/modules/transport/common/transport_ib_common.h 138 NET/IB: Ip address '::' is invalid for family AF_INET, ignoring address /work/nvshmem_src/src/modules/bootstrap/uid/bootstrap_uid.cpp:bootstrap_uid_alltoall:546: rank 0 nranks 8 size 48 - DONE /work/nvshmem_src/src/modules/transport/common/transport_ib_common.h 103 NVSHMEM_IB_ADDR_FAMILY set by environment to AF_INET /work/nvshmem_src/src/modules/transport/common/transport_ib_common.h 126 NVSHMEM_IB_ADDR_RANGE set by environment to ::/0 /work/nvshmem_src/src/modules/transport/common/transport_ib_common.h 138 NET/IB: Ip address '::' is invalid for family AF_INET, ignoring address /work/nvshmem_src/src/modules/bootstrap/uid/bootstrap_uid.cpp:bootstrap_uid_alltoall:546: rank 7 nranks 8 size 48 - DONE /work/nvshmem_src/src/modules/transport/common/transport_ib_common.h 103 NVSHMEM_IB_ADDR_FAMILY set by environment to AF_INET /work/nvshmem_src/src/modules/transport/common/transport_ib_common.h 126 NVSHMEM_IB_ADDR_RANGE set by environment to ::/0 /work/nvshmem_src/src/modules/transport/common/transport_ib_common.h 138 NET/IB: Ip address '::' is invalid for family AF_INET, ignoring address /work/nvshmem_src/src/modules/transport/common/transport_ib_common.h 103 NVSHMEM_IB_ADDR_FAMILY set by environment to AF_INET /work/nvshmem_src/src/modules/transport/common/transport_ib_common.h 126 NVSHMEM_IB_ADDR_RANGE set by environment to ::/0 /work/nvshmem_src/src/modules/transport/common/transport_ib_common.h 138 NET/IB: Ip address '::' is invalid for family AF_INET, ignoring address /work/nvshmem_src/src/modules/transport/common/transport_ib_common.h 103 NVSHMEM_IB_ADDR_FAMILY set by environment to AF_INET /work/nvshmem_src/src/modules/transport/common/transport_ib_common.h 126 NVSHMEM_IB_ADDR_RANGE set by environment to ::/0 /work/nvshmem_src/src/modules/transport/common/transport_ib_common.h 138 NET/IB: Ip address '::' is invalid for family AF_INET, ignoring address /work/nvshmem_src/src/modules/transport/ibrc/ibrc.cpp:418: non-zero status: 110 ibv_modify_qp failed /work/nvshmem_src/src/modules/transport/ibrc/ibrc.cpp:418: non-zero status: 110 /work/nvshmem_src/src/modules/transport/ibrc/ibrc.cpp:418: non-zero status: 110 /work/nvshmem_src/src/modules/transport/ibrc/ibrc.cpp:418: non-zero status: 110 /work/nvshmem_src/src/modules/transport/ibrc/ibrc.cpp:1433: non-zero status: 7 /work/nvshmem_src/src/modules/transport/ibrc/ibrc.cpp:418: non-zero status: 110 ibv_modify_qp failed ibv_modify_qp failed ibv_modify_qp failed ep_connect failed ibv_modify_qp failed /work/nvshmem_src/src/modules/transport/ibrc/ibrc.cpp:1433: non-zero status: 7 /work/nvshmem_src/src/modules/transport/ibrc/ibrc.cpp:418: non-zero status: 110 /work/nvshmem_src/src/modules/transport/ibrc/ibrc.cpp:418: non-zero status: 110 /work/nvshmem_src/src/modules/transport/ibrc/ibrc.cpp:1433: non-zero status: 7 /work/nvshmem_src/src/modules/transport/ibrc/ibrc.cpp:418: non-zero status: 110 /work/nvshmem_src/src/modules/transport/ibrc/ibrc.cpp:1433: non-zero status: 7 /work/nvshmem_src/src/modules/transport/ibrc/ibrc.cpp:1500: non-zero status: 7 ep_connect failed /work/nvshmem_src/src/modules/transport/ibrc/ibrc.cpp:1433: non-zero status: 7 ibv_modify_qp failed ibv_modify_qp failed ep_connect failed ibv_modify_qp failed ep_connect failed transport create connect failed ep_connect failed /work/nvshmem_src/src/modules/transport/ibrc/ibrc.cpp:1500: non-zero status: 7 /work/nvshmem_src/src/modules/transport/ibrc/ibrc.cpp:1500: non-zero status: 7 /work/nvshmem_src/src/modules/transport/ibrc/ibrc.cpp:1433: non-zero status: 7 /work/nvshmem_src/src/modules/transport/ibrc/ibrc.cpp:1433: non-zero status: 7 /work/nvshmem_src/src/modules/transport/ibrc/ibrc.cpp:1433: non-zero status: 7 transport create connect failed /work/nvshmem_src/src/modules/transport/ibrc/ibrc.cpp:1500: non-zero status: 7 /work/nvshmem_src/src/host/transport/transport.cpp:400: non-zero status: 7 transport create connect failed /work/nvshmem_src/src/modules/transport/ibrc/ibrc.cpp:1500: non-zero status: 7 ep_connect failed ep_connect failed ep_connect failed transport create connect failed connect EPS failed transport create connect failed /work/nvshmem_src/src/host/transport/transport.cpp:400: non-zero status: 7 /work/nvshmem_src/src/host/transport/transport.cpp:400: non-zero status: 7 /work/nvshmem_src/src/modules/transport/ibrc/ibrc.cpp:1500: non-zero status: 7 /work/nvshmem_src/src/modules/transport/ibrc/ibrc.cpp:1500: non-zero status: 7 /work/nvshmem_src/src/modules/transport/ibrc/ibrc.cpp:1500: non-zero status: 7 connect EPS failed /work/nvshmem_src/src/host/init/init.cu:1007: non-zero status: 7 /work/nvshmem_src/src/host/transport/transport.cpp:400: non-zero status: 7 connect EPS failed transport create connect failed /work/nvshmem_src/src/host/transport/transport.cpp:400: non-zero status: 7 transport create connect failed transport create connect failed nvshmem setup connections failed connect EPS failed connect EPS failed /work/nvshmem_src/src/host/init/init.cu:1007: non-zero status: 7 nvshmem setup connections failed /work/nvshmem_src/src/host/init/init.cu:1007: non-zero status: 7 /work/nvshmem_src/src/host/transport/transport.cpp:400: non-zero status: 7 /work/nvshmem_src/src/host/transport/transport.cpp:400: non-zero status: 7 /work/nvshmem_src/src/host/init/init.cu:nvshmemi_check_state_and_init:1080: /work/nvshmem_src/src/host/transport/transport.cpp:400: non-zero status: 7 /work/nvshmem_src/src/host/init/init.cu:1007: non-zero status: 7 nvshmem setup connections failed /work/nvshmem_src/src/host/init/init.cu:1007: non-zero status: 7 connect EPS failed connect EPS failed nvshmem initialization failed, exiting connect EPS failed nvshmem setup connections failed /work/nvshmem_src/src/host/init/init.cu:nvshmemi_check_state_and_init:1080: nvshmem setup connections failed /work/nvshmem_src/src/host/init/init.cu:nvshmemi_check_state_and_init:1080: nvshmem initialization failed, exiting nvshmem initialization failed, exiting /work/nvshmem_src/src/host/init/init.cu:1007: non-zero status: 7 /work/nvshmem_src/src/host/init/init.cu:1007: non-zero status: 7 /work/nvshmem_src/src/host/init/init.cu:1007: non-zero status: 7 /work/nvshmem_src/src/host/init/init.cu:nvshmemi_check_state_and_init:1080: /work/nvshmem_src/src/host/init/init.cu:nvshmemi_check_state_and_init:1080: nvshmem setup connections failed nvshmem setup connections failed nvshmem setup connections failed nvshmem initialization failed, exiting nvshmem initialization failed, exiting /work/nvshmem_src/src/host/init/init.cu:nvshmemi_check_state_and_init:1080: /work/nvshmem_src/src/host/init/init.cu:nvshmemi_check_state_and_init:1080: /work/nvshmem_src/src/host/init/init.cu:nvshmemi_check_state_and_init:1080: nvshmem initialization failed, exiting nvshmem initialization failed, exiting nvshmem initialization failed, exiting W0307 15:04:39.317000 1627 site-packages/torch/multiprocessing/spawn.py:169] Terminating process 1692 via signal SIGTERM W0307 15:04:39.318000 1627 site-packages/torch/multiprocessing/spawn.py:169] Terminating process 1693 via signal SIGTERM W0307 15:04:39.319000 1627 site-packages/torch/multiprocessing/spawn.py:169] Terminating process 1694 via signal SIGTERM W0307 15:04:39.319000 1627 site-packages/torch/multiprocessing/spawn.py:169] Terminating process 1695 via signal SIGTERM W0307 15:04:39.319000 1627 site-packages/torch/multiprocessing/spawn.py:169] Terminating process 1696 via signal SIGTERM W0307 15:04:39.320000 1627 site-packages/torch/multiprocessing/spawn.py:169] Terminating process 1697 via signal SIGTERM W0307 15:04:39.320000 1627 site-packages/torch/multiprocessing/spawn.py:169] Terminating process 1699 via signal SIGTERM Traceback (most recent call last): File "/work/DeepEP-roce-support/tests/test_low_latency.py", line 160, in <module> torch.multiprocessing.spawn(test_loop, args=(num_processes,), nprocs=num_processes) File "/root/miniconda3/lib/python3.12/site-packages/torch/multiprocessing/spawn.py", line 340, in spawn return start_processes(fn, args, nprocs, join, daemon, start_method="spawn") ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/root/miniconda3/lib/python3.12/site-packages/torch/multiprocessing/spawn.py", line 296, in start_processes while not context.join(): ^^^^^^^^^^^^^^ File "/root/miniconda3/lib/python3.12/site-packages/torch/multiprocessing/spawn.py", line 204, in join raise ProcessExitedException( torch.multiprocessing.spawn.ProcessExitedException: process 6 terminated with exit code 255
@kunfupanda-hw I'm sorry, I don't have any ideas. Are you using a RoCE network? You can try running the shmem_put_bw in nvshmem. If that doesn't work, it might be an issue with your IP configuration.
Hi @sphish, The process works, but the performance does not seem to meet expectations. env:嗨@sphish,这个过程是有效的,但性能似乎没有达到预期。
- H100 80GB HBM3 *8/HPC h10080gb hbm3 *8/ hpc
- 4 * CX7 NICs 400gb/s/HPC4 * CX7网卡400gb/s/HPC
- RoCE
rank0 rank1
NCCL_DEBUG=INFO MASTER_ADDR=xxx WORLD_SIZE=2 RANK=0 python tests/test_internode.pyNCCL_DEBUG=INFO MASTER_ADDR=xxx WORLD_SIZE=2 RANK=1 python tests/test_internode.py ``[tuning] Best dispatch (FP8): SMs 24, NVL chunk 28, RDMA chunk 8: 4.09 GB/s (RDMA), 13.36 GB/s (NVL) [tuning] Best dispatch (BF16): SMs 24, NVL chunk 28, RDMA chunk 32: 4.57 GB/s (RDMA), 14.91 GB/s (NVL) [tuning] Best combine: SMs 24, NVL chunk 3, RDMA chunk 32: 2.97 GB/s (RDMA), 9.69 GB/s (NVL)The difference between the above speed and the official data is greater than the difference between the network cards. How can I test the performance with ? Is there any reference?以上速度与官方数据的差异大于网卡之间的差异。如何测试性能?有参考吗?
perftestHi, @Baibaifan, are MASTER_ADDRs of rank0/1 same? Are they IP of CX7 cards or host IP?嗨,@百百范,排名0/1的master_addr是一样的吗?是CX7卡的IP还是主机IP?
MASTER_ADDRs are the same, both are host ip. It is to build Bootstrap use.master_addr是一样的,都是主机ip。它是用来构建Bootstrap的。
Hi @Baibaifan what's configuration have you changed, i have 4 IB per node, there is also the large difference between self-testing data (4.47GB/s RDMA) and offical data.
Please check the nccl network configuration. For example:
NCCL_IB_HCAandNCCL_SOCKET_IFNAME. @MinhuiWanHi @sphish, The process works, but the performance does not seem to meet expectations.Hi @sphish, 进程正常运行,但性能似乎未达到预期。 env:嗨@sphish,这个过程是有效的,但性能似乎没有达到预期。
- H100 80GB HBM3 *8/HPC h10080gb hbm3 *8/ hpcH100 80GB HBM3 *8/HPC H100 80GB HBM3 *8/ HPC
- 4 * CX7 NICs 400gb/s/HPC4 * CX7网卡400gb/s/HPC4 * CX7 NICs 400gb/s/HPC4 * CX7 网卡 400gb/s/HPC
- RoCE
rank0 rank1
NCCL_DEBUG=INFO MASTER_ADDR=xxx WORLD_SIZE=2 RANK=0 python tests/test_internode.pyNCCL_DEBUG=INFO MASTER_ADDR=xxx WORLD_SIZE=2 RANK=1 python tests/test_internode.py ``rank1NCCL_DEBUG=INFO MASTER_ADDR=xxx WORLD_SIZE=2 RANK=0 python tests/test_internode.pyNCCL_DEBUG=INFO MASTER_ADDR=xxx WORLD_SIZE=2 RANK=1 python tests/test_internode.py[tuning] Best dispatch (FP8): SMs 24, NVL chunk 28, RDMA chunk 8: 4.09 GB/s (RDMA), 13.36 GB/s (NVL) [tuning] Best dispatch (BF16): SMs 24, NVL chunk 28, RDMA chunk 32: 4.57 GB/s (RDMA), 14.91 GB/s (NVL) [tuning] Best combine: SMs 24, NVL chunk 3, RDMA chunk 32: 2.97 GB/s (RDMA), 9.69 GB/s (NVL)The difference between the above speed and the official data is greater than the difference between the network cards. How can I test the performance with ? Is there any reference?以上速度与官方数据的差异大于网卡之间的差异。如何测试性能?有参考吗?
perftest以上速度与官方数据的差异大于网卡之间的差异。如何测试性能?有参考吗?perftestHi, @Baibaifan, are MASTER_ADDRs of rank0/1 same? Are they IP of CX7 cards or host IP?嗨,@百百范,排名0/1的master_addr是一样的吗?是CX7卡的IP还是主机IP?嗨,@百百范,rank0/1 的 MASTER_ADDR 是一样的吗?是 CX7 卡的 IP 还是主机 IP?
MASTER_ADDRs are the same, both are host ip. It is to build Bootstrap use.master_addr是一样的,都是主机ip。它是用来构建Bootstrap的。master_addr 是一样的,都是主机 ip。它是用来构建 Bootstrap 的。
Hi @Baibaifan what's configuration have you changed, i have 4 IB per node, there is also the large difference between self-testing data (4.47GB/s RDMA) and offical data.你好 @Baibaifan 你更改了什么配置?我这里有 4 个 IB 每节点,自测数据(4.47GB/s RDMA)和官方数据之间也有很大的差异。
Please check the nccl network configuration. For example:
NCCL_IB_HCAandNCCL_SOCKET_IFNAME. 请检查 nccl 网络配置。例如:NCCL_IB_HCA和NCCL_SOCKET_IFNAME。@MinhuiWan@Baibaifan If you are looking for some performance references for nvshmem perftest, there are some official P2P benchmark results in this link. As for collective communication, we haven't tested it either.
I am using RoCE with 4NICs. I have a problem with running
test_low_latency.py. I want to useperftest/perftest_installin the nvshmem_src directory, for example,shmem_put_bw. I want to refer to the following process to see how to use perftest. @sphish
Hi @sphish, The process works, but the performance does not seem to meet expectations. env:嗨@sphish,这个过程是有效的,但性能似乎没有达到预期。嗨@sphish,这个过程是有效的,但性能似乎没有达到预期。
- H100 80GB HBM3 *8/HPC h10080gb hbm3 *8/ hpch10080gb HBM3 *8/HPC
- 4 * CX7 NICs 400gb/s/HPC4 * CX7网卡400gb/s/HPC
- RoCE
rank0 rank1
NCCL_DEBUG=INFO MASTER_ADDR=xxx WORLD_SIZE=2 RANK=0 python tests/test_internode.pyNCCL_DEBUG=INFO MASTER_ADDR=xxx WORLD_SIZE=2 RANK=1 python tests/test_internode.py ``[tuning] Best dispatch (FP8): SMs 24, NVL chunk 28, RDMA chunk 8: 4.09 GB/s (RDMA), 13.36 GB/s (NVL) [tuning] Best dispatch (BF16): SMs 24, NVL chunk 28, RDMA chunk 32: 4.57 GB/s (RDMA), 14.91 GB/s (NVL) [tuning] Best combine: SMs 24, NVL chunk 3, RDMA chunk 32: 2.97 GB/s (RDMA), 9.69 GB/s (NVL)The difference between the above speed and the official data is greater than the difference between the network cards. How can I test the performance with ? Is there any reference?以上速度与官方数据的差异大于网卡之间的差异。如何测试性能?有参考吗?
perftestHi, @Baibaifan, are MASTER_ADDRs of rank0/1 same? Are they IP of CX7 cards or host IP?嗨,@百百范,排名0/1的master_addr是一样的吗?是CX7卡的IP还是主机IP?
MASTER_ADDRs are the same, both are host ip. It is to build Bootstrap use.master_addr是一样的,都是主机ip。它是用来构建Bootstrap的。
Hi @Baibaifan what's configuration have you changed, i have 4 IB per node, there is also the large difference between self-testing data (4.47GB/s RDMA) and offical data.
Please check the nccl network configuration. For example:
NCCL_IB_HCAandNCCL_SOCKET_IFNAME. @MinhuiWan
thanks for your reply @Baibaifan. I will set NCCL_IB_HCA and NCCL_SOCKET_IFNAME to test. But i have a question: DeepEP does not rely on NCCL, so why can NCCL environment variables be set to control it?
Hi @sphish, The process works, but the performance does not seem to meet expectations. env:嗨@sphish,这个过程是有效的,但性能似乎没有达到预期。嗨@sphish,这个过程是有效的,但性能似乎没有达到预期。
- H100 80GB HBM3 *8/HPC h10080gb hbm3 *8/ hpch10080gb HBM3 *8/HPC
- 4 * CX7 NICs 400gb/s/HPC4 * CX7网卡400gb/s/HPC
- RoCE
rank0 rank1
NCCL_DEBUG=INFO MASTER_ADDR=xxx WORLD_SIZE=2 RANK=0 python tests/test_internode.pyNCCL_DEBUG=INFO MASTER_ADDR=xxx WORLD_SIZE=2 RANK=1 python tests/test_internode.py ``[tuning] Best dispatch (FP8): SMs 24, NVL chunk 28, RDMA chunk 8: 4.09 GB/s (RDMA), 13.36 GB/s (NVL) [tuning] Best dispatch (BF16): SMs 24, NVL chunk 28, RDMA chunk 32: 4.57 GB/s (RDMA), 14.91 GB/s (NVL) [tuning] Best combine: SMs 24, NVL chunk 3, RDMA chunk 32: 2.97 GB/s (RDMA), 9.69 GB/s (NVL)The difference between the above speed and the official data is greater than the difference between the network cards. How can I test the performance with ? Is there any reference?以上速度与官方数据的差异大于网卡之间的差异。如何测试性能?有参考吗?
perftestHi, @Baibaifan, are MASTER_ADDRs of rank0/1 same? Are they IP of CX7 cards or host IP?嗨,@百百范,排名0/1的master_addr是一样的吗?是CX7卡的IP还是主机IP?
MASTER_ADDRs are the same, both are host ip. It is to build Bootstrap use.master_addr是一样的,都是主机ip。它是用来构建Bootstrap的。
Hi @Baibaifan what's configuration have you changed, i have 4 IB per node, there is also the large difference between self-testing data (4.47GB/s RDMA) and offical data.
Please check the nccl network configuration. For example:
NCCL_IB_HCAandNCCL_SOCKET_IFNAME. @MinhuiWanthanks for your reply @Baibaifan. I will set NCCL_IB_HCA and NCCL_SOCKET_IFNAME to test. But i have a question: DeepEP does not rely on NCCL, so why can NCCL environment variables be set to control it?
The unit test used nccl to create a communication group, and I found that there was a problem with my previous settings. @MinhuiWan