[Bug] multinode fails with RuntimeError: Gloo connectFullMesh failed with [../third_party/gloo/gloo/transport/tcp/pair.cc:144] no error
Checklist
- [x] 1. I have searched related issues but cannot get the expected help.
- [x] 2. The bug has not been fixed in the latest version.
- [x] 3. Please note that if the bug-related issue you submitted lacks corresponding environment info and a minimal reproducible demo, it will be challenging for us to reproduce and resolve the issue, reducing the likelihood of receiving feedback.
- [x] 4. If the issue you raised is not a bug but a question, please raise a discussion at https://github.com/sgl-project/sglang/discussions/new/choose Otherwise, it will be closed.
- [x] 5. Please use English, otherwise it will be closed.
Describe the bug
As title says.
[2025-01-30 18:11:28 TP8] Init torch distributed begin.
[2025-01-30 18:11:28 TP14] MLA optimization is turned on. Use triton backend.
[2025-01-30 18:11:28 TP14] Init torch distributed begin.
[rank10]:[E130 18:11:34.707542570 ProcessGroupGloo.cpp:143] Gloo connectFullMesh failed with [../third_party/gloo/gloo/transport/tcp/pair.cc:144] no error
[2025-01-30 18:11:34 TP10] Scheduler hit an exception: Traceback (most recent call last):
File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 1773, in run_scheduler_process
scheduler = Scheduler(server_args, port_args, gpu_id, tp_rank, dp_rank)
File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 239, in __init__
self.tp_worker = TpWorkerClass(
File "/sgl-workspace/sglang/python/sglang/srt/managers/tp_worker_overlap_thread.py", line 63, in __init__
self.worker = TpModelWorker(server_args, gpu_id, tp_rank, dp_rank, nccl_port)
File "/sgl-workspace/sglang/python/sglang/srt/managers/tp_worker.py", line 68, in __init__
self.model_runner = ModelRunner(
File "/sgl-workspace/sglang/python/sglang/srt/model_executor/model_runner.py", line 177, in __init__
min_per_gpu_memory = self.init_torch_distributed()
File "/sgl-workspace/sglang/python/sglang/srt/model_executor/model_runner.py", line 245, in init_torch_distributed
init_distributed_environment(
File "/sgl-workspace/sglang/python/sglang/srt/distributed/parallel_state.py", line 997, in init_distributed_environment
_WORLD = init_world_group(ranks, local_rank, backend)
File "/sgl-workspace/sglang/python/sglang/srt/distributed/parallel_state.py", line 868, in init_world_group
return GroupCoordinator(
File "/sgl-workspace/sglang/python/sglang/srt/distributed/parallel_state.py", line 202, in __init__
cpu_group = torch.distributed.new_group(ranks, backend="gloo")
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/c10d_logger.py", line 97, in wrapper
func_return = func(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/distributed_c10d.py", line 4565, in new_group
return _new_group_with_tag(
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/distributed_c10d.py", line 4648, in _new_group_with_tag
pg, pg_store = _new_process_group_helper(
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/distributed_c10d.py", line 1744, in _new_process_group_helper
backend_class = ProcessGroupGloo(
RuntimeError: Gloo connectFullMesh failed with [../third_party/gloo/gloo/transport/tcp/pair.cc:144] no error
[2025-01-30 18:11:34] Received sigquit from a child proces. It usually means the child failed.
[2025-01-30 18:11:34] Received sigquit from a child proces. It usually means the child failed.
Reproduction
2 * 8 * H200 on 400 gbps ethernet, so not infiniband, but should still function. $IPRANK0 is set to IP of rank 0
##############################################
# SGLANG 2 nodes
# node 1
docker run -d --gpus all \
--shm-size 32g \
--network=host \
-v ~/.cache/huggingface:/root/.cache/huggingface \
--name sglang_multinode1 \
--ipc=host \
lmsysorg/sglang:latest \
python3 -m sglang.launch_server --model-path deepseek-ai/DeepSeek-V3 --tp 16 --dist-init-addr $IPRANK0:20000 --nnodes 2 --node-rank 0 --trust-remote-code --host 0.0.0.0 --port 5000
# node 2
docker run -d --gpus all \
--shm-size 32g \
--network=host \
-v ~/.cache/huggingface:/root/.cache/huggingface \
--name sglang_multinode2 \
--ipc=host \
lmsysorg/sglang:latest \
python3 -m sglang.launch_server --model-path deepseek-ai/DeepSeek-V3 --tp 16 --dist-init-addr $IPRANK0:20000 --nnodes 2 --node-rank 1 --trust-remote-code --host 0.0.0.0 --port 5000
This leads both to fail. I tried to start them at same time even, but still fails.
Logs from main node:
==========
== CUDA ==
==========
CUDA Version 12.4.1
Container image Copyright (c) 2016-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
This container image and its contents are governed by the NVIDIA Deep Learning Container License.
By pulling and using the container, you accept the terms and conditions of this license:
https://developer.nvidia.com/ngc/nvidia-deep-learning-container-license
A copy of this license is made available in this container at /NGC-DL-CONTAINER-LICENSE for your convenience.
[2025-01-30 18:11:14] server_args=ServerArgs(model_path='deepseek-ai/DeepSeek-V3', tokenizer_path='deepseek-ai/DeepSeek-V3', tokenizer_mode='auto', load_format='auto', trust_remote_code=True, dtype='auto', kv_cache_dtype='auto', quantiza>
[2025-01-30 18:11:27 TP1] MLA optimization is turned on. Use triton backend.
[2025-01-30 18:11:27 TP1] Init torch distributed begin.
[2025-01-30 18:11:28 TP2] MLA optimization is turned on. Use triton backend.
[2025-01-30 18:11:28 TP2] Init torch distributed begin.
[2025-01-30 18:11:28 TP3] MLA optimization is turned on. Use triton backend.
[2025-01-30 18:11:28 TP3] Init torch distributed begin.
[2025-01-30 18:11:28 TP7] MLA optimization is turned on. Use triton backend.
[2025-01-30 18:11:28 TP7] Init torch distributed begin.
[2025-01-30 18:11:29 TP0] MLA optimization is turned on. Use triton backend.
[2025-01-30 18:11:29 TP0] Init torch distributed begin.
[2025-01-30 18:11:29 TP6] MLA optimization is turned on. Use triton backend.
[2025-01-30 18:11:29 TP6] Init torch distributed begin.
[2025-01-30 18:11:29 TP4] MLA optimization is turned on. Use triton backend.
[2025-01-30 18:11:29 TP4] Init torch distributed begin.
[2025-01-30 18:11:29 TP5] MLA optimization is turned on. Use triton backend.
[2025-01-30 18:11:29 TP5] Init torch distributed begin.
[rank3]:[E130 18:11:34.760420724 ProcessGroupGloo.cpp:143] Gloo connectFullMesh failed with [../third_party/gloo/gloo/transport/tcp/pair.cc:144] no error
[rank1]:[E130 18:11:34.760436798 ProcessGroupGloo.cpp:143] Gloo connectFullMesh failed with [../third_party/gloo/gloo/transport/tcp/pair.cc:144] no error
[rank7]:[E130 18:11:34.770302141 ProcessGroupGloo.cpp:143] Gloo connectFullMesh failed with [../third_party/gloo/gloo/transport/tcp/pair.cc:144] no error
[rank4]:[E130 18:11:34.770875861 ProcessGroupGloo.cpp:143] Gloo connectFullMesh failed with [../third_party/gloo/gloo/transport/tcp/pair.cc:534] Connection closed by peer [127.0.0.1]:19590
[rank5]:[E130 18:11:34.771009690 ProcessGroupGloo.cpp:143] Gloo connectFullMesh failed with [../third_party/gloo/gloo/transport/tcp/pair.cc:534] Connection closed by peer [127.0.0.1]:12426
[rank6]:[E130 18:11:34.771044792 ProcessGroupGloo.cpp:143] Gloo connectFullMesh failed with [../third_party/gloo/gloo/transport/tcp/pair.cc:534] Connection closed by peer [127.0.0.1]:21205
[rank0]:[E130 18:11:34.771086854 ProcessGroupGloo.cpp:143] Gloo connectFullMesh failed with [../third_party/gloo/gloo/transport/tcp/pair.cc:534] Connection closed by peer [127.0.0.1]:704
[2025-01-30 18:11:34 TP1] Scheduler hit an exception: Traceback (most recent call last):
File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 1773, in run_scheduler_process
scheduler = Scheduler(server_args, port_args, gpu_id, tp_rank, dp_rank)
File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 239, in __init__
self.tp_worker = TpWorkerClass(
File "/sgl-workspace/sglang/python/sglang/srt/managers/tp_worker_overlap_thread.py", line 63, in __init__
self.worker = TpModelWorker(server_args, gpu_id, tp_rank, dp_rank, nccl_port)
File "/sgl-workspace/sglang/python/sglang/srt/managers/tp_worker.py", line 68, in __init__
self.model_runner = ModelRunner(
File "/sgl-workspace/sglang/python/sglang/srt/model_executor/model_runner.py", line 177, in __init__
min_per_gpu_memory = self.init_torch_distributed()
File "/sgl-workspace/sglang/python/sglang/srt/model_executor/model_runner.py", line 245, in init_torch_distributed
init_distributed_environment(
...
Logs from rank 1:
==========
== CUDA ==
==========
CUDA Version 12.4.1
Container image Copyright (c) 2016-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
This container image and its contents are governed by the NVIDIA Deep Learning Container License.
By pulling and using the container, you accept the terms and conditions of this license:
https://developer.nvidia.com/ngc/nvidia-deep-learning-container-license
A copy of this license is made available in this container at /NGC-DL-CONTAINER-LICENSE for your convenience.
[2025-01-30 18:11:14] server_args=ServerArgs(model_path='deepseek-ai/DeepSeek-V3', tokenizer_path='deepseek-ai/DeepSeek-V3', tokenizer_mode='auto', load_format='auto', trust_remote_code=True, dtype='auto', kv_cache_dtype='auto', quantiza>
[2025-01-30 18:11:27 TP13] MLA optimization is turned on. Use triton backend.
[2025-01-30 18:11:27 TP13] Init torch distributed begin.
[2025-01-30 18:11:28 TP11] MLA optimization is turned on. Use triton backend.
[2025-01-30 18:11:28 TP11] Init torch distributed begin.
[2025-01-30 18:11:28 TP15] MLA optimization is turned on. Use triton backend.
[2025-01-30 18:11:28 TP15] Init torch distributed begin.
[2025-01-30 18:11:28 TP10] MLA optimization is turned on. Use triton backend.
[2025-01-30 18:11:28 TP10] Init torch distributed begin.
[2025-01-30 18:11:28 TP12] MLA optimization is turned on. Use triton backend.
[2025-01-30 18:11:28 TP12] Init torch distributed begin.
[2025-01-30 18:11:28 TP9] MLA optimization is turned on. Use triton backend.
[2025-01-30 18:11:28 TP9] Init torch distributed begin.
[2025-01-30 18:11:28 TP8] MLA optimization is turned on. Use triton backend.
[2025-01-30 18:11:28 TP8] Init torch distributed begin.
[2025-01-30 18:11:28 TP14] MLA optimization is turned on. Use triton backend.
[2025-01-30 18:11:28 TP14] Init torch distributed begin.
[rank10]:[E130 18:11:34.707542570 ProcessGroupGloo.cpp:143] Gloo connectFullMesh failed with [../third_party/gloo/gloo/transport/tcp/pair.cc:144] no error
[2025-01-30 18:11:34 TP10] Scheduler hit an exception: Traceback (most recent call last):
File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 1773, in run_scheduler_process
scheduler = Scheduler(server_args, port_args, gpu_id, tp_rank, dp_rank)
File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 239, in __init__
self.tp_worker = TpWorkerClass(
File "/sgl-workspace/sglang/python/sglang/srt/managers/tp_worker_overlap_thread.py", line 63, in __init__
self.worker = TpModelWorker(server_args, gpu_id, tp_rank, dp_rank, nccl_port)
I've truncated the long line that has API key etc in it.
Environment
sglang) shadeform@shadeform-system14:~$ docker run --gpus all -ti --entrypoint=bash lmsysorg/sglang:latest
root@c43d1bb6e7c7:/sgl-workspace# python3 -m sglang.check_env
/usr/local/lib/python3.10/dist-packages/pydantic/_internal/_config.py:345: UserWarning: Valid config keys have changed in V2:
* 'fields' has been removed
warnings.warn(message, UserWarning)
Python: 3.10.16 (main, Dec 4 2024, 08:53:37) [GCC 9.4.0]
CUDA available: True
GPU 0,1,2,3,4,5,6,7: NVIDIA H200
GPU 0,1,2,3,4,5,6,7 Compute Capability: 9.0
CUDA_HOME: /usr/local/cuda
NVCC: Cuda compilation tools, release 12.4, V12.4.131
CUDA Driver Version: 550.144.03
PyTorch: 2.5.1+cu124
flashinfer: 0.1.6+cu124torch2.4
triton: 3.1.0
transformers: 4.48.1
torchao: 0.8.0
numpy: 1.26.4
aiohttp: 3.11.11
fastapi: 0.115.7
hf_transfer: 0.1.9
huggingface_hub: 0.27.1
interegular: 0.3.3
modelscope: 1.22.3
orjson: 3.10.15
packaging: 24.2
psutil: 6.1.1
pydantic: 2.10.6
multipart: 0.0.20
zmq: 26.2.0
uvicorn: 0.34.0
uvloop: 0.21.0
vllm: 0.6.4.post1
openai: 1.60.1
anthropic: 0.45.0
decord: 0.6.0
NVIDIA Topology:
GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7 NIC0 NIC1 NIC2 NIC3 NIC4 NIC5 NIC6 NIC7 NIC8 NIC9 NIC10 NIC11 CPU Affinity NUMA Affinity GPU NUMA ID
GPU0 X NV18 NV18 NV18 NV18 NV18 NV18 NV18 NODE NODE NODE NODE SYS NODE PIX SYS SYS SYS SYS SYS 0-95,192-287 0 N/A
GPU1 NV18 X NV18 NV18 NV18 NV18 NV18 NV18 NODE PIX PHB PHB SYS NODE NODE SYS SYS SYS SYS SYS 0-95,192-287 0 N/A
GPU2 NV18 NV18 X NV18 NV18 NV18 NV18 NV18 PIX NODE NODE NODE SYS NODE NODE SYS SYS SYS SYS SYS 0-95,192-287 0 N/A
GPU3 NV18 NV18 NV18 X NV18 NV18 NV18 NV18 NODE NODE NODE NODE SYS PIX NODE SYS SYS SYS SYS SYS 0-95,192-287 0 N/A
GPU4 NV18 NV18 NV18 NV18 X NV18 NV18 NV18 SYS SYS SYS SYS NODE SYS SYS NODE NODE NODE PIX NODE 96-191,288-383 1 N/A
GPU5 NV18 NV18 NV18 NV18 NV18 X NV18 NV18 SYS SYS SYS SYS NODE SYS SYS NODE NODE NODE NODE PIX 96-191,288-383 1 N/A
GPU6 NV18 NV18 NV18 NV18 NV18 NV18 X NV18 SYS SYS SYS SYS NODE SYS SYS PIX PHB PHB NODE NODE 96-191,288-383 1 N/A
GPU7 NV18 NV18 NV18 NV18 NV18 NV18 NV18 X SYS SYS SYS SYS PIX SYS SYS NODE NODE NODE NODE NODE 96-191,288-383 1 N/A
NIC0 NODE NODE PIX NODE SYS SYS SYS SYS X NODE NODE NODE SYS NODE NODE SYS SYS SYS SYS SYS
NIC1 NODE PIX NODE NODE SYS SYS SYS SYS NODE X PHB PHB SYS NODE NODE SYS SYS SYS SYS SYS
NIC2 NODE PHB NODE NODE SYS SYS SYS SYS NODE PHB X PIX SYS NODE NODE SYS SYS SYS SYS SYS
NIC3 NODE PHB NODE NODE SYS SYS SYS SYS NODE PHB PIX X SYS NODE NODE SYS SYS SYS SYS SYS
NIC4 SYS SYS SYS SYS NODE NODE NODE PIX SYS SYS SYS SYS X SYS SYS NODE NODE NODE NODE NODE
NIC5 NODE NODE NODE PIX SYS SYS SYS SYS NODE NODE NODE NODE SYS X NODE SYS SYS SYS SYS SYS
NIC6 PIX NODE NODE NODE SYS SYS SYS SYS NODE NODE NODE NODE SYS NODE X SYS SYS SYS SYS SYS
NIC7 SYS SYS SYS SYS NODE NODE PIX NODE SYS SYS SYS SYS NODE SYS SYS X PHB PHB NODE NODE
NIC8 SYS SYS SYS SYS NODE NODE PHB NODE SYS SYS SYS SYS NODE SYS SYS PHB X PIX NODE NODE
NIC9 SYS SYS SYS SYS NODE NODE PHB NODE SYS SYS SYS SYS NODE SYS SYS PHB PIX X NODE NODE
NIC10 SYS SYS SYS SYS PIX NODE NODE NODE SYS SYS SYS SYS NODE SYS SYS NODE NODE NODE X NODE
NIC11 SYS SYS SYS SYS NODE PIX NODE NODE SYS SYS SYS SYS NODE SYS SYS NODE NODE NODE NODE X
Legend:
X = Self
SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
PIX = Connection traversing at most a single PCIe bridge
NV# = Connection traversing a bonded set of # NVLinks
NIC Legend:
NIC0: mlx5_0
NIC1: mlx5_1
NIC2: mlx5_2
NIC3: mlx5_3
NIC4: mlx5_4
NIC5: mlx5_5
NIC6: mlx5_6
NIC7: mlx5_7
NIC8: mlx5_8
NIC9: mlx5_9
NIC10: mlx5_10
NIC11: mlx5_11
ulimit soft: 1048576
root@c43d1bb6e7c7:/sgl-workspace#
hi, How do you slove this
@pseudotensor how do you solve this? Is this a hardware problem?
I didn't solve it. I don't know why I closed the issue. I just moved on as I couldn't use multinode.
Set Env GLOO_SOCKET_IFNAME works for me
Set Env GLOO_SOCKET_IFNAME works for me
Did you use IB network?
I was using 400Gbps network, not IB, but couldn't get it to work with that no matter how I passed the device names or IPs.
Set Env GLOO_SOCKET_IFNAME works for me
Did you use IB network?
yes
Same error here
root@zentek:/sgl-workspace# python3 -m sglang.check_env
INFO 02-16 12:57:32 __init__.py:190] Automatically detected platform cuda.
Python: 3.10.12 (main, Jan 17 2025, 14:35:34) [GCC 11.4.0]
CUDA available: True
GPU 0,1,2,3,4,5,6,7: NVIDIA H20
GPU 0,1,2,3,4,5,6,7 Compute Capability: 9.0
CUDA_HOME: /usr/local/cuda
NVCC: Cuda compilation tools, release 12.4, V12.4.131
CUDA Driver Version: 550.144.03
PyTorch: 2.5.1+cu124
sgl_kernel: 0.0.3.post6
flashinfer: 0.2.1.post1+cu124torch2.5
triton: 3.1.0
transformers: 4.48.3
torchao: 0.8.0
numpy: 1.26.4
aiohttp: 3.11.12
fastapi: 0.115.8
hf_transfer: 0.1.9
huggingface_hub: 0.28.1
interegular: 0.3.3
modelscope: 1.22.3
orjson: 3.10.15
packaging: 24.2
psutil: 7.0.0
pydantic: 2.10.6
multipart: 0.0.20
zmq: 26.2.1
uvicorn: 0.34.0
uvloop: 0.21.0
vllm: 0.7.2
openai: 1.63.0
tiktoken: 0.8.0
anthropic: 0.45.2
decord: 0.6.0
NVIDIA Topology:
GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7 NIC0 NIC1 NIC2 NIC3 NIC4 NIC5 NIC6 NIC7 CPU Affinity NUMA Affinity GPU NUMA ID
GPU0 X NV18 NV18 NV18 NV18 NV18 NV18 NV18 PIX NODE NODE NODE SYS SYS SYS SYS 0-55,112-167 0 N/A
GPU1 NV18 X NV18 NV18 NV18 NV18 NV18 NV18 NODE NODE NODE NODE SYS SYS SYS SYS 0-55,112-167 0 N/A
GPU2 NV18 NV18 X NV18 NV18 NV18 NV18 NV18 NODE NODE NODE PIX SYS SYS SYS SYS 0-55,112-167 0 N/A
GPU3 NV18 NV18 NV18 X NV18 NV18 NV18 NV18 NODE NODE NODE NODE SYS SYS SYS SYS 0-55,112-167 0 N/A
GPU4 NV18 NV18 NV18 NV18 X NV18 NV18 NV18 SYS SYS SYS SYS PIX NODE NODE NODE 56-111,168-223 1 N/A
GPU5 NV18 NV18 NV18 NV18 NV18 X NV18 NV18 SYS SYS SYS SYS NODE NODE NODE NODE 56-111,168-223 1 N/A
GPU6 NV18 NV18 NV18 NV18 NV18 NV18 X NV18 SYS SYS SYS SYS NODE NODE NODE PIX 56-111,168-223 1 N/A
GPU7 NV18 NV18 NV18 NV18 NV18 NV18 NV18 X SYS SYS SYS SYS NODE NODE NODE NODE 56-111,168-223 1 N/A
NIC0 PIX NODE NODE NODE SYS SYS SYS SYS X NODE NODE NODE SYS SYS SYS SYS
NIC1 NODE NODE NODE NODE SYS SYS SYS SYS NODE X PIX NODE SYS SYS SYS SYS
NIC2 NODE NODE NODE NODE SYS SYS SYS SYS NODE PIX X NODE SYS SYS SYS SYS
NIC3 NODE NODE PIX NODE SYS SYS SYS SYS NODE NODE NODE X SYS SYS SYS SYS
NIC4 SYS SYS SYS SYS PIX NODE NODE NODE SYS SYS SYS SYS X NODE NODE NODE
NIC5 SYS SYS SYS SYS NODE NODE NODE NODE SYS SYS SYS SYS NODE X PIX NODE
NIC6 SYS SYS SYS SYS NODE NODE NODE NODE SYS SYS SYS SYS NODE PIX X NODE
NIC7 SYS SYS SYS SYS NODE NODE PIX NODE SYS SYS SYS SYS NODE NODE NODE X
Legend:
X = Self
SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
PIX = Connection traversing at most a single PCIe bridge
NV# = Connection traversing a bonded set of # NVLinks
NIC Legend:
NIC0: mlx5_0
NIC1: mlx5_1
NIC2: mlx5_2
NIC3: mlx5_3
NIC4: mlx5_4
NIC5: mlx5_5
NIC6: mlx5_6
NIC7: mlx5_7
ulimit soft: 1048576
This is because GLOO connection cannot be established.
import torch.distributed as dist
dist.init_process_group(backend="gloo")
can you run this script using torchrun on two nodes?
I'm also experiencing this issue, I'm using two H100 Nodes with 400Gbps RoCE v2 between the H100s. I have tested nccl and gloo between two nodes using the following program.
import torch
import torch.distributed as dist
import argparse
import time
import torch
import torch.distributed as dist
import torch.multiprocessing as mp
import os
def setup(rank):
print(f"Before INIT")
dist.init_process_group(backend="gloo") # nccl
print(f"AFTER INIT")
torch.cuda.set_device(dist.get_rank() % torch.cuda.device_count())
def cleanup():
dist.destroy_process_group()
def ping_pong(rank, num_iters, tensor_size):
device = torch.device("cpu") # cpu for gloo and cuda:0 for GPU
tensor = torch.ones(tensor_size, dtype=torch.float32, device=device)
latencies = []
for i in range(num_iters):
torch.cuda.synchronize()
start_time = time.time()
if rank == 0:
dist.send(tensor, dst=1)
# dist.recv(tensor, src=1)
else:
dist.recv(tensor, src=0)
# dist.send(tensor, dst=0)
torch.cuda.synchronize()
end_time = time.time()
rtt = (end_time - start_time) * 1000
latencies.append(rtt)
if rank == 0:
print(f"Iteration {i+1}: RTT = {rtt:.3f} ms")
if rank == 0:
avg_latency = sum(latencies) / num_iters
print(f"Tensor size: {tensor_size*4 / 1024 / 1024} MB")
print(f"Average RTT over {num_iters} iterations: {avg_latency:.3f} ms")
print(f"Throughput: {tensor_size*4 / avg_latency * 1000 / 1024 / 1024 / 1024}")
print("\n")
def run_ping_pong(rank, world_size):
setup(rank)
for ts in [32*1024, 128*1024, 1024*1024, 4096*1024, 4096*4096, 4096*4096*4, 4096*4096*16]:
ping_pong(rank, 5, ts)
cleanup()
def main():
world_size = int(os.environ["WORLD_SIZE"])
rank = int(os.environ["RANK"])
run_ping_pong(rank, world_size)
if __name__ == "__main__":
main()
The env setting as follows:
export NCCL_IB_HCA=mlx5_0
# ens255np0 and enp92s0np0 are two different nic of roce shown from ip a
export NCCL_SOCKET_IFNAME=ens255np0
export GLOO_SOCKET_IFNAME=enp92s0np0
export NCCL_IB_GID_INDEX=3
export NCCL_IB_TC=106
export NCCL_NET_GDR_LEVEL=2
export NCCL_DEBUG=INFO
export NCCL_P2P_LEVEL=NVL
export NCCL_IB_TIMEOUT=22
But I found that using
python3 -m sglang.launch_server --model-path DeepSeek-R1 --tp 16 --dist-init-addr ip:port --nnodes 2 --node-rank 0 --trust-remote-code
Sometimes(about 1/8) it loads ckpt fine, but I get other errors :-), and the rest of the time(about 7/8) the same error as above. This causes me to be very confuse.
Here is the part of my conda env, the sglang is build from source. I will test the sglang from pip.
python
torch 2.5.1
triton 3.1.0
vllm 0.7.2
sgl-kernel 0.0.3.post6
sglang 0.4.3 post2
nvidia-nccl-cu12 2.24.3(manually modify)
I also randomly encounter the same problem on two H100 ib networks, and the only way to fix it is to re-execute the command, which is very confusing。
same issue. details
This issue has been automatically closed due to inactivity. Please feel free to reopen it if needed.