sglang icon indicating copy to clipboard operation
sglang copied to clipboard

[Bug] --enable-torch-compile encounter Watchdog timeout error

Open wangdaw2023 opened this issue 10 months ago • 1 comments

Checklist

  • [x] 1. I have searched related issues but cannot get the expected help.
  • [x] 2. The bug has not been fixed in the latest version.
  • [x] 3. Please note that if the bug-related issue you submitted lacks corresponding environment info and a minimal reproducible demo, it will be challenging for us to reproduce and resolve the issue, reducing the likelihood of receiving feedback.
  • [x] 4. If the issue you raised is not a bug but a question, please raise a discussion at https://github.com/sgl-project/sglang/discussions/new/choose Otherwise, it will be closed.
  • [x] 5. Please use English, otherwise it will be closed.

Describe the bug

I use sglang to deploy deepseek r1 bf16 on 4 8*80GB A800 nodes. Without --enable-torch-compile , it could run successfully. I try to enable torch.compile , expect to improve overall TPS and Time to First Token.

sglang version: sglang_0.4.2.post4

Reproduction

I start up server with below commands. node 0 python3 -m sglang.launch_server --model-path /cephfs/public_model/DeepSeek-R1-BF16 --tp 32 --dist-init-addr 10.13.10.10:5000 --nnodes 4 --node-rank 0 --trust-remote-code --enable-torch-compile --host 0.0.0.0 --port 30000

node 1 python3 -m sglang.launch_server --model-path /cephfs/public_model/DeepSeek-R1-BF16 --tp 32 --dist-init-addr 10.13.10.10:5000 --nnodes 4 --node-rank 1 --trust-remote-code --enable-torch-compile

node 2 python3 -m sglang.launch_server --model-path /cephfs/public_model/DeepSeek-R1-BF16 --tp 32 --dist-init-addr 10.13.10.10:5000 --nnodes 4 --node-rank 2 --trust-remote-code --enable-torch-compile

node 3 python3 -m sglang.launch_server --model-path /cephfs/public_model/DeepSeek-R1-BF16 --tp 32 --dist-init-addr 10.13.10.10:5000 --nnodes 4 --node-rank 3 --trust-remote-code --enable-torch-compile

I start up benchmark with below commands. python3 -m sglang.bench_serving --backend sglang --dataset-name random --num-prompts 500 --random-input 1024 --random-output 1024--host 10.13.10.10 --port 30000 --model /cephfs/public_model/DeepSeek-R1-BF16

After a while(hang, no output information) , I find all nodes encounter below error.

[2025-02-21 16:52:22 TP4] Watchdog timeout (self.watchdog_timeout=300) [2025-02-21 16:52:22 TP5] Watchdog timeout (self.watchdog_timeout=300) [2025-02-21 16:52:22 TP7] Watchdog timeout (self.watchdog_timeout=300) [2025-02-21 16:52:22 TP1] Watchdog timeout (self.watchdog_timeout=300) [2025-02-21 16:52:22 TP2] Watchdog timeout (self.watchdog_timeout=300) [2025-02-21 16:52:22 TP0] Watchdog timeout (self.watchdog_timeout=300) [2025-02-21 16:52:22 TP3] Watchdog timeout (self.watchdog_timeout=300) [2025-02-21 16:52:22 TP6] Watchdog timeout (self.watchdog_timeout=300)

Environment

sglang version: sglang_0.4.2.post4

wangdaw2023 avatar Feb 21 '25 09:02 wangdaw2023