sglang [Bug] multinode fails with RuntimeError: Gloo connectFullMesh failed with [../third_party/gloo/gloo/transport/tcp/pair.cc:144] no error

Checklist

[x] 1. I have searched related issues but cannot get the expected help.
[x] 2. The bug has not been fixed in the latest version.
[x] 3. Please note that if the bug-related issue you submitted lacks corresponding environment info and a minimal reproducible demo, it will be challenging for us to reproduce and resolve the issue, reducing the likelihood of receiving feedback.
[x] 4. If the issue you raised is not a bug but a question, please raise a discussion at https://github.com/sgl-project/sglang/discussions/new/choose Otherwise, it will be closed.
[x] 5. Please use English, otherwise it will be closed.

Describe the bug

As title says.

[2025-01-30 18:11:28 TP8] Init torch distributed begin.
[2025-01-30 18:11:28 TP14] MLA optimization is turned on. Use triton backend.
[2025-01-30 18:11:28 TP14] Init torch distributed begin.
[rank10]:[E130 18:11:34.707542570 ProcessGroupGloo.cpp:143] Gloo connectFullMesh failed with [../third_party/gloo/gloo/transport/tcp/pair.cc:144] no error
[2025-01-30 18:11:34 TP10] Scheduler hit an exception: Traceback (most recent call last):
  File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 1773, in run_scheduler_process
    scheduler = Scheduler(server_args, port_args, gpu_id, tp_rank, dp_rank)
  File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 239, in __init__
    self.tp_worker = TpWorkerClass(
  File "/sgl-workspace/sglang/python/sglang/srt/managers/tp_worker_overlap_thread.py", line 63, in __init__
    self.worker = TpModelWorker(server_args, gpu_id, tp_rank, dp_rank, nccl_port)
  File "/sgl-workspace/sglang/python/sglang/srt/managers/tp_worker.py", line 68, in __init__
    self.model_runner = ModelRunner(
  File "/sgl-workspace/sglang/python/sglang/srt/model_executor/model_runner.py", line 177, in __init__
    min_per_gpu_memory = self.init_torch_distributed()
  File "/sgl-workspace/sglang/python/sglang/srt/model_executor/model_runner.py", line 245, in init_torch_distributed
    init_distributed_environment(
  File "/sgl-workspace/sglang/python/sglang/srt/distributed/parallel_state.py", line 997, in init_distributed_environment
    _WORLD = init_world_group(ranks, local_rank, backend)
  File "/sgl-workspace/sglang/python/sglang/srt/distributed/parallel_state.py", line 868, in init_world_group
    return GroupCoordinator(
  File "/sgl-workspace/sglang/python/sglang/srt/distributed/parallel_state.py", line 202, in __init__
    cpu_group = torch.distributed.new_group(ranks, backend="gloo")
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/c10d_logger.py", line 97, in wrapper
    func_return = func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/distributed_c10d.py", line 4565, in new_group
    return _new_group_with_tag(
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/distributed_c10d.py", line 4648, in _new_group_with_tag
    pg, pg_store = _new_process_group_helper(
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/distributed_c10d.py", line 1744, in _new_process_group_helper
    backend_class = ProcessGroupGloo(
RuntimeError: Gloo connectFullMesh failed with [../third_party/gloo/gloo/transport/tcp/pair.cc:144] no error

[2025-01-30 18:11:34] Received sigquit from a child proces. It usually means the child failed.
[2025-01-30 18:11:34] Received sigquit from a child proces. It usually means the child failed.

Reproduction

2 * 8 * H200 on 400 gbps ethernet, so not infiniband, but should still function. $IPRANK0 is set to IP of rank 0

##############################################
# SGLANG 2 nodes
# node 1
docker run -d --gpus all \
    --shm-size 32g \
    --network=host \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --name sglang_multinode1 \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server --model-path deepseek-ai/DeepSeek-V3 --tp 16 --dist-init-addr $IPRANK0:20000 --nnodes 2 --node-rank 0 --trust-remote-code --host 0.0.0.0 --port 5000 

# node 2
docker run -d --gpus all \
    --shm-size 32g \
    --network=host \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --name sglang_multinode2 \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server --model-path deepseek-ai/DeepSeek-V3 --tp 16 --dist-init-addr $IPRANK0:20000 --nnodes 2 --node-rank 1 --trust-remote-code --host 0.0.0.0 --port 5000

This leads both to fail. I tried to start them at same time even, but still fails.

Logs from main node:

==========
== CUDA ==
==========

CUDA Version 12.4.1

Container image Copyright (c) 2016-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.

This container image and its contents are governed by the NVIDIA Deep Learning Container License.
By pulling and using the container, you accept the terms and conditions of this license:
https://developer.nvidia.com/ngc/nvidia-deep-learning-container-license

A copy of this license is made available in this container at /NGC-DL-CONTAINER-LICENSE for your convenience.

[2025-01-30 18:11:14] server_args=ServerArgs(model_path='deepseek-ai/DeepSeek-V3', tokenizer_path='deepseek-ai/DeepSeek-V3', tokenizer_mode='auto', load_format='auto', trust_remote_code=True, dtype='auto', kv_cache_dtype='auto', quantiza>
[2025-01-30 18:11:27 TP1] MLA optimization is turned on. Use triton backend.
[2025-01-30 18:11:27 TP1] Init torch distributed begin.
[2025-01-30 18:11:28 TP2] MLA optimization is turned on. Use triton backend.
[2025-01-30 18:11:28 TP2] Init torch distributed begin.
[2025-01-30 18:11:28 TP3] MLA optimization is turned on. Use triton backend.
[2025-01-30 18:11:28 TP3] Init torch distributed begin.
[2025-01-30 18:11:28 TP7] MLA optimization is turned on. Use triton backend.
[2025-01-30 18:11:28 TP7] Init torch distributed begin.
[2025-01-30 18:11:29 TP0] MLA optimization is turned on. Use triton backend.
[2025-01-30 18:11:29 TP0] Init torch distributed begin.
[2025-01-30 18:11:29 TP6] MLA optimization is turned on. Use triton backend.
[2025-01-30 18:11:29 TP6] Init torch distributed begin.
[2025-01-30 18:11:29 TP4] MLA optimization is turned on. Use triton backend.
[2025-01-30 18:11:29 TP4] Init torch distributed begin.
[2025-01-30 18:11:29 TP5] MLA optimization is turned on. Use triton backend.
[2025-01-30 18:11:29 TP5] Init torch distributed begin.
[rank3]:[E130 18:11:34.760420724 ProcessGroupGloo.cpp:143] Gloo connectFullMesh failed with [../third_party/gloo/gloo/transport/tcp/pair.cc:144] no error
[rank1]:[E130 18:11:34.760436798 ProcessGroupGloo.cpp:143] Gloo connectFullMesh failed with [../third_party/gloo/gloo/transport/tcp/pair.cc:144] no error
[rank7]:[E130 18:11:34.770302141 ProcessGroupGloo.cpp:143] Gloo connectFullMesh failed with [../third_party/gloo/gloo/transport/tcp/pair.cc:144] no error
[rank4]:[E130 18:11:34.770875861 ProcessGroupGloo.cpp:143] Gloo connectFullMesh failed with [../third_party/gloo/gloo/transport/tcp/pair.cc:534] Connection closed by peer [127.0.0.1]:19590
[rank5]:[E130 18:11:34.771009690 ProcessGroupGloo.cpp:143] Gloo connectFullMesh failed with [../third_party/gloo/gloo/transport/tcp/pair.cc:534] Connection closed by peer [127.0.0.1]:12426
[rank6]:[E130 18:11:34.771044792 ProcessGroupGloo.cpp:143] Gloo connectFullMesh failed with [../third_party/gloo/gloo/transport/tcp/pair.cc:534] Connection closed by peer [127.0.0.1]:21205
[rank0]:[E130 18:11:34.771086854 ProcessGroupGloo.cpp:143] Gloo connectFullMesh failed with [../third_party/gloo/gloo/transport/tcp/pair.cc:534] Connection closed by peer [127.0.0.1]:704
[2025-01-30 18:11:34 TP1] Scheduler hit an exception: Traceback (most recent call last):
  File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 1773, in run_scheduler_process
    scheduler = Scheduler(server_args, port_args, gpu_id, tp_rank, dp_rank)
  File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 239, in __init__
    self.tp_worker = TpWorkerClass(
  File "/sgl-workspace/sglang/python/sglang/srt/managers/tp_worker_overlap_thread.py", line 63, in __init__
    self.worker = TpModelWorker(server_args, gpu_id, tp_rank, dp_rank, nccl_port)
  File "/sgl-workspace/sglang/python/sglang/srt/managers/tp_worker.py", line 68, in __init__
    self.model_runner = ModelRunner(
  File "/sgl-workspace/sglang/python/sglang/srt/model_executor/model_runner.py", line 177, in __init__
    min_per_gpu_memory = self.init_torch_distributed()
  File "/sgl-workspace/sglang/python/sglang/srt/model_executor/model_runner.py", line 245, in init_torch_distributed
    init_distributed_environment(
...

Logs from rank 1:

==========
== CUDA ==
==========

CUDA Version 12.4.1

Container image Copyright (c) 2016-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.

This container image and its contents are governed by the NVIDIA Deep Learning Container License.
By pulling and using the container, you accept the terms and conditions of this license:
https://developer.nvidia.com/ngc/nvidia-deep-learning-container-license

A copy of this license is made available in this container at /NGC-DL-CONTAINER-LICENSE for your convenience.

[2025-01-30 18:11:14] server_args=ServerArgs(model_path='deepseek-ai/DeepSeek-V3', tokenizer_path='deepseek-ai/DeepSeek-V3', tokenizer_mode='auto', load_format='auto', trust_remote_code=True, dtype='auto', kv_cache_dtype='auto', quantiza>
[2025-01-30 18:11:27 TP13] MLA optimization is turned on. Use triton backend.
[2025-01-30 18:11:27 TP13] Init torch distributed begin.
[2025-01-30 18:11:28 TP11] MLA optimization is turned on. Use triton backend.
[2025-01-30 18:11:28 TP11] Init torch distributed begin.
[2025-01-30 18:11:28 TP15] MLA optimization is turned on. Use triton backend.
[2025-01-30 18:11:28 TP15] Init torch distributed begin.
[2025-01-30 18:11:28 TP10] MLA optimization is turned on. Use triton backend.
[2025-01-30 18:11:28 TP10] Init torch distributed begin.
[2025-01-30 18:11:28 TP12] MLA optimization is turned on. Use triton backend.
[2025-01-30 18:11:28 TP12] Init torch distributed begin.
[2025-01-30 18:11:28 TP9] MLA optimization is turned on. Use triton backend.
[2025-01-30 18:11:28 TP9] Init torch distributed begin.
[2025-01-30 18:11:28 TP8] MLA optimization is turned on. Use triton backend.
[2025-01-30 18:11:28 TP8] Init torch distributed begin.
[2025-01-30 18:11:28 TP14] MLA optimization is turned on. Use triton backend.
[2025-01-30 18:11:28 TP14] Init torch distributed begin.
[rank10]:[E130 18:11:34.707542570 ProcessGroupGloo.cpp:143] Gloo connectFullMesh failed with [../third_party/gloo/gloo/transport/tcp/pair.cc:144] no error
[2025-01-30 18:11:34 TP10] Scheduler hit an exception: Traceback (most recent call last):
  File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 1773, in run_scheduler_process
    scheduler = Scheduler(server_args, port_args, gpu_id, tp_rank, dp_rank)
  File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 239, in __init__
    self.tp_worker = TpWorkerClass(
  File "/sgl-workspace/sglang/python/sglang/srt/managers/tp_worker_overlap_thread.py", line 63, in __init__
    self.worker = TpModelWorker(server_args, gpu_id, tp_rank, dp_rank, nccl_port)

I've truncated the long line that has API key etc in it.

Environment

sglang) shadeform@shadeform-system14:~$ docker run --gpus all -ti --entrypoint=bash lmsysorg/sglang:latest
root@c43d1bb6e7c7:/sgl-workspace# python3 -m sglang.check_env
/usr/local/lib/python3.10/dist-packages/pydantic/_internal/_config.py:345: UserWarning: Valid config keys have changed in V2:
* 'fields' has been removed
  warnings.warn(message, UserWarning)
Python: 3.10.16 (main, Dec  4 2024, 08:53:37) [GCC 9.4.0]
CUDA available: True
GPU 0,1,2,3,4,5,6,7: NVIDIA H200
GPU 0,1,2,3,4,5,6,7 Compute Capability: 9.0
CUDA_HOME: /usr/local/cuda
NVCC: Cuda compilation tools, release 12.4, V12.4.131
CUDA Driver Version: 550.144.03
PyTorch: 2.5.1+cu124
flashinfer: 0.1.6+cu124torch2.4
triton: 3.1.0
transformers: 4.48.1
torchao: 0.8.0
numpy: 1.26.4
aiohttp: 3.11.11
fastapi: 0.115.7
hf_transfer: 0.1.9
huggingface_hub: 0.27.1
interegular: 0.3.3
modelscope: 1.22.3
orjson: 3.10.15
packaging: 24.2
psutil: 6.1.1
pydantic: 2.10.6
multipart: 0.0.20
zmq: 26.2.0
uvicorn: 0.34.0
uvloop: 0.21.0
vllm: 0.6.4.post1
openai: 1.60.1
anthropic: 0.45.0
decord: 0.6.0
NVIDIA Topology: 
        GPU0    GPU1    GPU2    GPU3    GPU4    GPU5    GPU6    GPU7    NIC0    NIC1    NIC2    NIC3    NIC4    NIC5    NIC6    NIC7    NIC8    NIC9    NIC10   NIC11   CPU Affinity    NUMA Affinity   GPU NUMA ID
GPU0     X      NV18    NV18    NV18    NV18    NV18    NV18    NV18    NODE    NODE    NODE    NODE    SYS     NODE    PIX     SYS     SYS     SYS     SYS     SYS     0-95,192-287    0               N/A
GPU1    NV18     X      NV18    NV18    NV18    NV18    NV18    NV18    NODE    PIX     PHB     PHB     SYS     NODE    NODE    SYS     SYS     SYS     SYS     SYS     0-95,192-287    0               N/A
GPU2    NV18    NV18     X      NV18    NV18    NV18    NV18    NV18    PIX     NODE    NODE    NODE    SYS     NODE    NODE    SYS     SYS     SYS     SYS     SYS     0-95,192-287    0               N/A
GPU3    NV18    NV18    NV18     X      NV18    NV18    NV18    NV18    NODE    NODE    NODE    NODE    SYS     PIX     NODE    SYS     SYS     SYS     SYS     SYS     0-95,192-287    0               N/A
GPU4    NV18    NV18    NV18    NV18     X      NV18    NV18    NV18    SYS     SYS     SYS     SYS     NODE    SYS     SYS     NODE    NODE    NODE    PIX     NODE    96-191,288-383  1               N/A
GPU5    NV18    NV18    NV18    NV18    NV18     X      NV18    NV18    SYS     SYS     SYS     SYS     NODE    SYS     SYS     NODE    NODE    NODE    NODE    PIX     96-191,288-383  1               N/A
GPU6    NV18    NV18    NV18    NV18    NV18    NV18     X      NV18    SYS     SYS     SYS     SYS     NODE    SYS     SYS     PIX     PHB     PHB     NODE    NODE    96-191,288-383  1               N/A
GPU7    NV18    NV18    NV18    NV18    NV18    NV18    NV18     X      SYS     SYS     SYS     SYS     PIX     SYS     SYS     NODE    NODE    NODE    NODE    NODE    96-191,288-383  1               N/A
NIC0    NODE    NODE    PIX     NODE    SYS     SYS     SYS     SYS      X      NODE    NODE    NODE    SYS     NODE    NODE    SYS     SYS     SYS     SYS     SYS
NIC1    NODE    PIX     NODE    NODE    SYS     SYS     SYS     SYS     NODE     X      PHB     PHB     SYS     NODE    NODE    SYS     SYS     SYS     SYS     SYS
NIC2    NODE    PHB     NODE    NODE    SYS     SYS     SYS     SYS     NODE    PHB      X      PIX     SYS     NODE    NODE    SYS     SYS     SYS     SYS     SYS
NIC3    NODE    PHB     NODE    NODE    SYS     SYS     SYS     SYS     NODE    PHB     PIX      X      SYS     NODE    NODE    SYS     SYS     SYS     SYS     SYS
NIC4    SYS     SYS     SYS     SYS     NODE    NODE    NODE    PIX     SYS     SYS     SYS     SYS      X      SYS     SYS     NODE    NODE    NODE    NODE    NODE
NIC5    NODE    NODE    NODE    PIX     SYS     SYS     SYS     SYS     NODE    NODE    NODE    NODE    SYS      X      NODE    SYS     SYS     SYS     SYS     SYS
NIC6    PIX     NODE    NODE    NODE    SYS     SYS     SYS     SYS     NODE    NODE    NODE    NODE    SYS     NODE     X      SYS     SYS     SYS     SYS     SYS
NIC7    SYS     SYS     SYS     SYS     NODE    NODE    PIX     NODE    SYS     SYS     SYS     SYS     NODE    SYS     SYS      X      PHB     PHB     NODE    NODE
NIC8    SYS     SYS     SYS     SYS     NODE    NODE    PHB     NODE    SYS     SYS     SYS     SYS     NODE    SYS     SYS     PHB      X      PIX     NODE    NODE
NIC9    SYS     SYS     SYS     SYS     NODE    NODE    PHB     NODE    SYS     SYS     SYS     SYS     NODE    SYS     SYS     PHB     PIX      X      NODE    NODE
NIC10   SYS     SYS     SYS     SYS     PIX     NODE    NODE    NODE    SYS     SYS     SYS     SYS     NODE    SYS     SYS     NODE    NODE    NODE     X      NODE
NIC11   SYS     SYS     SYS     SYS     NODE    PIX     NODE    NODE    SYS     SYS     SYS     SYS     NODE    SYS     SYS     NODE    NODE    NODE    NODE     X 

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

NIC Legend:

  NIC0: mlx5_0
  NIC1: mlx5_1
  NIC2: mlx5_2
  NIC3: mlx5_3
  NIC4: mlx5_4
  NIC5: mlx5_5
  NIC6: mlx5_6
  NIC7: mlx5_7
  NIC8: mlx5_8
  NIC9: mlx5_9
  NIC10: mlx5_10
  NIC11: mlx5_11


ulimit soft: 1048576
root@c43d1bb6e7c7:/sgl-workspace#

Jan 31 '25 02:01 pseudotensor

hi, How do you slove this

Feb 10 '25 08:02 homily707

@pseudotensor how do you solve this? Is this a hardware problem?

Feb 13 '25 11:02 ehuaa

I didn't solve it. I don't know why I closed the issue. I just moved on as I couldn't use multinode.

Feb 13 '25 18:02 pseudotensor

Set Env GLOO_SOCKET_IFNAME works for me

Feb 14 '25 08:02 homily707

Set Env GLOO_SOCKET_IFNAME works for me

Did you use IB network?

Feb 14 '25 08:02 ehuaa

I was using 400Gbps network, not IB, but couldn't get it to work with that no matter how I passed the device names or IPs.

Feb 14 '25 08:02 pseudotensor

Set Env GLOO_SOCKET_IFNAME works for me

Did you use IB network?

yes

Feb 14 '25 08:02 homily707

Same error here

root@zentek:/sgl-workspace# python3 -m sglang.check_env
INFO 02-16 12:57:32 __init__.py:190] Automatically detected platform cuda.
Python: 3.10.12 (main, Jan 17 2025, 14:35:34) [GCC 11.4.0]
CUDA available: True
GPU 0,1,2,3,4,5,6,7: NVIDIA H20
GPU 0,1,2,3,4,5,6,7 Compute Capability: 9.0
CUDA_HOME: /usr/local/cuda
NVCC: Cuda compilation tools, release 12.4, V12.4.131
CUDA Driver Version: 550.144.03
PyTorch: 2.5.1+cu124
sgl_kernel: 0.0.3.post6
flashinfer: 0.2.1.post1+cu124torch2.5
triton: 3.1.0
transformers: 4.48.3
torchao: 0.8.0
numpy: 1.26.4
aiohttp: 3.11.12
fastapi: 0.115.8
hf_transfer: 0.1.9
huggingface_hub: 0.28.1
interegular: 0.3.3
modelscope: 1.22.3
orjson: 3.10.15
packaging: 24.2
psutil: 7.0.0
pydantic: 2.10.6
multipart: 0.0.20
zmq: 26.2.1
uvicorn: 0.34.0
uvloop: 0.21.0
vllm: 0.7.2
openai: 1.63.0
tiktoken: 0.8.0
anthropic: 0.45.2
decord: 0.6.0
NVIDIA Topology: 
        GPU0    GPU1    GPU2    GPU3    GPU4    GPU5    GPU6    GPU7    NIC0    NIC1    NIC2    NIC3    NIC4    NIC5    NIC6    NIC7    CPU Affinity    NUMA Affinity  GPU NUMA ID
GPU0     X      NV18    NV18    NV18    NV18    NV18    NV18    NV18    PIX     NODE    NODE    NODE    SYS     SYS     SYS     SYS     0-55,112-167    0     N/A
GPU1    NV18     X      NV18    NV18    NV18    NV18    NV18    NV18    NODE    NODE    NODE    NODE    SYS     SYS     SYS     SYS     0-55,112-167    0     N/A
GPU2    NV18    NV18     X      NV18    NV18    NV18    NV18    NV18    NODE    NODE    NODE    PIX     SYS     SYS     SYS     SYS     0-55,112-167    0     N/A
GPU3    NV18    NV18    NV18     X      NV18    NV18    NV18    NV18    NODE    NODE    NODE    NODE    SYS     SYS     SYS     SYS     0-55,112-167    0     N/A
GPU4    NV18    NV18    NV18    NV18     X      NV18    NV18    NV18    SYS     SYS     SYS     SYS     PIX     NODE    NODE    NODE    56-111,168-223  1     N/A
GPU5    NV18    NV18    NV18    NV18    NV18     X      NV18    NV18    SYS     SYS     SYS     SYS     NODE    NODE    NODE    NODE    56-111,168-223  1     N/A
GPU6    NV18    NV18    NV18    NV18    NV18    NV18     X      NV18    SYS     SYS     SYS     SYS     NODE    NODE    NODE    PIX     56-111,168-223  1     N/A
GPU7    NV18    NV18    NV18    NV18    NV18    NV18    NV18     X      SYS     SYS     SYS     SYS     NODE    NODE    NODE    NODE    56-111,168-223  1     N/A
NIC0    PIX     NODE    NODE    NODE    SYS     SYS     SYS     SYS      X      NODE    NODE    NODE    SYS     SYS     SYS     SYS
NIC1    NODE    NODE    NODE    NODE    SYS     SYS     SYS     SYS     NODE     X      PIX     NODE    SYS     SYS     SYS     SYS
NIC2    NODE    NODE    NODE    NODE    SYS     SYS     SYS     SYS     NODE    PIX      X      NODE    SYS     SYS     SYS     SYS
NIC3    NODE    NODE    PIX     NODE    SYS     SYS     SYS     SYS     NODE    NODE    NODE     X      SYS     SYS     SYS     SYS
NIC4    SYS     SYS     SYS     SYS     PIX     NODE    NODE    NODE    SYS     SYS     SYS     SYS      X      NODE    NODE    NODE
NIC5    SYS     SYS     SYS     SYS     NODE    NODE    NODE    NODE    SYS     SYS     SYS     SYS     NODE     X      PIX     NODE
NIC6    SYS     SYS     SYS     SYS     NODE    NODE    NODE    NODE    SYS     SYS     SYS     SYS     NODE    PIX      X      NODE
NIC7    SYS     SYS     SYS     SYS     NODE    NODE    PIX     NODE    SYS     SYS     SYS     SYS     NODE    NODE    NODE     X 

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

NIC Legend:

  NIC0: mlx5_0
  NIC1: mlx5_1
  NIC2: mlx5_2
  NIC3: mlx5_3
  NIC4: mlx5_4
  NIC5: mlx5_5
  NIC6: mlx5_6
  NIC7: mlx5_7


ulimit soft: 1048576

Feb 16 '25 13:02 hitsub2

This is because GLOO connection cannot be established.

import torch.distributed as dist

dist.init_process_group(backend="gloo")

can you run this script using torchrun on two nodes?

Feb 22 '25 06:02 FrankLeeeee

I'm also experiencing this issue, I'm using two H100 Nodes with 400Gbps RoCE v2 between the H100s. I have tested nccl and gloo between two nodes using the following program.

import torch
import torch.distributed as dist
import argparse
import time
import torch
import torch.distributed as dist
import torch.multiprocessing as mp
import os

def setup(rank):
    print(f"Before INIT")
    dist.init_process_group(backend="gloo") # nccl
    print(f"AFTER INIT")
    torch.cuda.set_device(dist.get_rank() % torch.cuda.device_count())

def cleanup():
    dist.destroy_process_group()

def ping_pong(rank, num_iters, tensor_size):
    device = torch.device("cpu") # cpu for gloo and cuda:0 for GPU
    tensor = torch.ones(tensor_size, dtype=torch.float32, device=device)

    latencies = []

    for i in range(num_iters):
        torch.cuda.synchronize()
        start_time = time.time()

        if rank == 0:
            dist.send(tensor, dst=1)
           # dist.recv(tensor, src=1)
        else:
            dist.recv(tensor, src=0)
           # dist.send(tensor, dst=0)

        torch.cuda.synchronize()
        end_time = time.time()

        rtt = (end_time - start_time) * 1000 
        latencies.append(rtt)

        if rank == 0:
            print(f"Iteration {i+1}: RTT = {rtt:.3f} ms")

    if rank == 0:
        avg_latency = sum(latencies) / num_iters
        print(f"Tensor size: {tensor_size*4 / 1024 / 1024} MB")
        print(f"Average RTT over {num_iters} iterations: {avg_latency:.3f} ms")
        print(f"Throughput: {tensor_size*4 / avg_latency * 1000 / 1024 / 1024 / 1024}")
        print("\n")


def run_ping_pong(rank, world_size):
    setup(rank)

    for ts in [32*1024, 128*1024, 1024*1024, 4096*1024, 4096*4096, 4096*4096*4, 4096*4096*16]:
        ping_pong(rank, 5, ts)

    cleanup()

def main():
    world_size = int(os.environ["WORLD_SIZE"])
    rank = int(os.environ["RANK"]) 

    run_ping_pong(rank, world_size)

if __name__ == "__main__":
    main()

The env setting as follows:

export NCCL_IB_HCA=mlx5_0
# ens255np0 and enp92s0np0 are two different nic of roce shown from ip a
export NCCL_SOCKET_IFNAME=ens255np0 
export GLOO_SOCKET_IFNAME=enp92s0np0
export NCCL_IB_GID_INDEX=3   
export NCCL_IB_TC=106  
export NCCL_NET_GDR_LEVEL=2  

export NCCL_DEBUG=INFO  
export NCCL_P2P_LEVEL=NVL  
export NCCL_IB_TIMEOUT=22

But I found that using

python3 -m sglang.launch_server --model-path DeepSeek-R1 --tp 16 --dist-init-addr ip:port --nnodes 2 --node-rank 0 --trust-remote-code

Sometimes(about 1/8) it loads ckpt fine, but I get other errors :-), and the rest of the time(about 7/8) the same error as above. This causes me to be very confuse.

Here is the part of my conda env, the sglang is build from source. I will test the sglang from pip.

python
torch 2.5.1
triton 3.1.0
vllm 0.7.2
sgl-kernel  0.0.3.post6
sglang 0.4.3 post2
nvidia-nccl-cu12 2.24.3(manually modify)

Feb 22 '25 09:02 galeselee

I also randomly encounter the same problem on two H100 ib networks, and the only way to fix it is to re-execute the command, which is very confusing。

Mar 26 '25 07:03 jt-z

same issue. details

Mar 31 '25 12:03 o0olele

This issue has been automatically closed due to inactivity. Please feel free to reopen it if needed.

May 31 '25 00:05 github-actions[bot]