sglang [Bug] DeepSeek-R1 NCCL WatchDog Timeout Error

[Bug] DeepSeek-R1 NCCL WatchDog Timeout Error

Open sitabulaixizawaluduo opened this issue 2 weeks ago • 3 comments

Checklist

[x] 1. I have searched related issues but cannot get the expected help.
[x] 2. The bug has not been fixed in the latest version.
[x] 3. Please note that if the bug-related issue you submitted lacks corresponding environment info and a minimal reproducible demo, it will be challenging for us to reproduce and resolve the issue, reducing the likelihood of receiving feedback.
[x] 4. If the issue you raised is not a bug but a question, please raise a discussion at https://github.com/sgl-project/sglang/discussions/new/choose Otherwise, it will be closed.
[x] 5. Please use English, otherwise it will be closed.

Describe the bug

When I used H800 to deploy DeepSeek-R1, I found a very strange bug. I used 4 machines to deploy 2 instances, each with two nodes. On one of the instances, when I turned on the prefix cache, when the prefix cache reached two tokens, I found that the process would be blocked until watchdog timeout. When I turned off the prefix cache, some specific cases must also trigger this behavior. But for other requests, this problem does not occur. Is there something wrong with the current multi-machine communication?

Reproduction

Server Start with prefix caching node rank 0 python3 -m sglang.launch_server --mem-fraction-static 0.85 --disable-custom-all-reduce --trust-remote-code --chunked-prefill-size -1 --host 0.0.0.0 --port 8000 --model-path /models/DeepSeek-R1 --tensor-parallel-size 16 --log-requests --dist-init-addr maas-deepseek-r1-rank0:29500 --nnodes 2 --node-rank 0 node rank 1 python3 -m sglang.launch_server --mem-fraction-static 0.85 --disable-custom-all-reduce --trust-remote-code --chunked-prefill-size -1 --host 0.0.0.0 --port 8000 --model-path /models/DeepSeek-R1 --tensor-parallel-size 16 --log-requests --dist-init-addr maas-deepseek-r1-rank0:29500 --nnodes 2 --node-rank 1

Server Start without prefix caching node rank 0 python3 -m sglang.launch_server --mem-fraction-static 0.85 --disable-custom-all-reduce --trust-remote-code --chunked-prefill-size -1 --host 0.0.0.0 --port 8000 --model-path /models/DeepSeek-R1 --tensor-parallel-size 16 --log-requests --dist-init-addr maas-deepseek-r1-rank0:29500 --nnodes 2 --node-rank 0 --disable-radix-cache node rank 1 python3 -m sglang.launch_server --mem-fraction-static 0.85 --disable-custom-all-reduce --trust-remote-code --chunked-prefill-size -1 --host 0.0.0.0 --port 8000 --model-path /models/DeepSeek-R1 --tensor-parallel-size 16 --log-requests --dist-init-addr maas-deepseek-r1-rank0:29500 --nnodes 2 --node-rank 1 --disable-radix-cache

When prefix caching is enabled, when I execute first `import requests import json import time from openai import OpenAI

model = "/models/deepseek-r1" client = OpenAI( base_url="url/v1", api_key="NOKEY", )

completion = client.chat.completions.create( model=model, messages = [{"role":"user","content":"理财产品投入500，7日年化收益1.7070%，能赚多少"}], temperature=0.8, max_tokens=10, ) print(completion)Then runimport requests import json import time from openai import OpenAI

model = "/models/deepseek-r1" client = OpenAI( base_url="url/v1", api_key="NOKEY", )

completion = client.completions.create( model=model, #messages = [{"role":"user","content":"世界上最古老的已知宗教之一是佛教，它起源于哪个国家？现在信徒分布在哪里？"}], temperature=0.8, max_tokens=10, ) print(completion)`

It will definitely trigger a blocking situation

Environment

`Python: 3.10.16 (main, Dec CUDA available: True GPU 0,1,2,3,4,5,6,7: NVIDIA H800 GPU 0,1,2,3,4,5,6,7 CUDA_HOME: /usr/local/cuda NVCC: Cuda CUDA Driver Version: 550.90.07 PyTorch: 2.5.1+cu124 sglang: 0.4.1.post7 flashinfer: 0.1.6+cu124torch2.4 triton: 3.1.0 transformers: 4.48.0 torchao: 0.8.0 numpy: 1.26.4 aiohttp: 3.11.11 fastapi: 0.115.6 hf_transfer: 0.1.9 huggingface_hub: 0.27.1 interegular: 0.3.3 modelscope: 1.22.3 orjson: 3.10.15 packaging: 24.2 psutil: 6.1.1 pydantic: 2.10.5 multipart: 0.0.20 zmq: 26.2.0 uvicorn: 0.34.0 uvloop: 0.21.0 vllm: 0.6.4.post1 openai: 1.59.8 anthropic: 0.43.1 decord: 0.6.0 NVIDIA Topology: GPU0 GPU1 GPU0 X GPU1 NV8 GPU2 NV8 NV8 GPU3 NV8 NV8 NV8 GPU4 NV8 GPU5 NV8 GPU6 NV8 GPU7 NV8 NIC0 PIX NIC1 NODE PIX NIC2 NODE NODE PIX NIC3 NODE NIC4 SYS NIC5 SYS NIC6 SYS NIC7 SYS NIC8 SYS 4 2024, 08:53:37) [GCC 9.4.0] Compute Capability: 9.0 compilation tools, release 12.4, V12.4.131 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7 NIC0 NIC1 NIC2 NIC3 NIC4 NIC5 NIC6 NIC7 NIC8 CPU Affinity NUMA Affinity GPU NUMA ID NV8 NV8 NV8 NV8 NV8 NV8 NV8 PIX NODE NODE NODE SYS SYS SYS SYS SYS 0-55,112-167 N/A X NV8 NV8 NV8 NV8 NV8 NV8 NODE PIX NODE NODE SYS SYS SYS SYS SYS 0-55,112-167 N/A X NV8 NV8 NV8 NV8 NV8 NODE NODE PIX NODE SYS SYS SYS SYS SYS 0-55,112-167 N/A X NV8 NV8 NV8 NV8 NODE NODE NODE PIX SYS SYS SYS SYS SYS 0-55,112-167 N/A NV8 NV8 NV8 X NV8 NV8 NV8 SYS SYS SYS SYS PIX NODE NODE NODE NODE 56-111,168-223 N/A NV8 NV8 NV8 NV8 X NV8 NV8 SYS SYS SYS SYS NODE PIX NODE NODE NODE 56-111,168-223 N/A NV8 NV8 NV8 NV8 NV8 X NV8 SYS SYS SYS SYS NODE NODE PIX NODE NODE 56-111,168-223 N/A NV8 NV8 NV8 NV8 NV8 NV8 X SYS SYS SYS SYS NODE NODE NODE PIX NODE 56-111,168-223 N/A NODE NODE NODE SYS SYS SYS SYS X NODE NODE NODE SYS SYS SYS SYS SYS NODE NODE SYS SYS SYS SYS NODE X NODE NODE SYS SYS SYS SYS SYS NODE SYS SYS SYS SYS NODE NODE X NODE SYS SYS SYS SYS SYS NODE NODE PIX SYS SYS SYS SYS NODE NODE NODE X SYS SYS SYS SYS SYS SYS SYS SYS PIX NODE NODE NODE SYS SYS SYS SYS X NODE NODE NODE NODE SYS SYS SYS NODE PIX NODE NODE SYS SYS SYS SYS NODE X NODE NODE NODE SYS SYS SYS NODE NODE PIX NODE SYS SYS SYS SYS NODE NODE X NODE NODE SYS SYS SYS NODE NODE NODE PIX SYS SYS SYS SYS NODE NODE NODE X NODE SYS SYS SYS NODE NODE NODE NODE SYS SYS SYS SYS NODE NODE NODE NODE X

Legend:

X = Self SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI) NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU) PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge) PIX = Connection traversing at most a single PCIe bridge NV# = Connection traversing a bonded set of # NVLinks

NIC Legend:

NIC0: mlx5_0 NIC1: mlx5_1 NIC2: mlx5_2 NIC3: mlx5_3 NIC4: mlx5_4 NIC5: mlx5_7 NIC6: mlx5_8 NIC7: mlx5_9 NIC8: mlx5_bond_0

ulimit soft: 2108278520`

Feb 07 '25 03:02 sitabulaixizawaluduo

sglang sglang copied to clipboard

[Bug] DeepSeek-R1 NCCL WatchDog Timeout Error

Checklist

Describe the bug

Reproduction

Environment

sglang
sglang copied to clipboard