[Bug] --enable-torch-compile encounter Watchdog timeout error
Checklist
- [x] 1. I have searched related issues but cannot get the expected help.
- [x] 2. The bug has not been fixed in the latest version.
- [x] 3. Please note that if the bug-related issue you submitted lacks corresponding environment info and a minimal reproducible demo, it will be challenging for us to reproduce and resolve the issue, reducing the likelihood of receiving feedback.
- [x] 4. If the issue you raised is not a bug but a question, please raise a discussion at https://github.com/sgl-project/sglang/discussions/new/choose Otherwise, it will be closed.
- [x] 5. Please use English, otherwise it will be closed.
Describe the bug
I use sglang to deploy deepseek r1 bf16 on 4 8*80GB A800 nodes. Without --enable-torch-compile , it could run successfully. I try to enable torch.compile , expect to improve overall TPS and Time to First Token.
sglang version: sglang_0.4.2.post4
Reproduction
I start up server with below commands. node 0 python3 -m sglang.launch_server --model-path /cephfs/public_model/DeepSeek-R1-BF16 --tp 32 --dist-init-addr 10.13.10.10:5000 --nnodes 4 --node-rank 0 --trust-remote-code --enable-torch-compile --host 0.0.0.0 --port 30000
node 1 python3 -m sglang.launch_server --model-path /cephfs/public_model/DeepSeek-R1-BF16 --tp 32 --dist-init-addr 10.13.10.10:5000 --nnodes 4 --node-rank 1 --trust-remote-code --enable-torch-compile
node 2 python3 -m sglang.launch_server --model-path /cephfs/public_model/DeepSeek-R1-BF16 --tp 32 --dist-init-addr 10.13.10.10:5000 --nnodes 4 --node-rank 2 --trust-remote-code --enable-torch-compile
node 3 python3 -m sglang.launch_server --model-path /cephfs/public_model/DeepSeek-R1-BF16 --tp 32 --dist-init-addr 10.13.10.10:5000 --nnodes 4 --node-rank 3 --trust-remote-code --enable-torch-compile
I start up benchmark with below commands. python3 -m sglang.bench_serving --backend sglang --dataset-name random --num-prompts 500 --random-input 1024 --random-output 1024--host 10.13.10.10 --port 30000 --model /cephfs/public_model/DeepSeek-R1-BF16
After a while(hang, no output information) , I find all nodes encounter below error.
[2025-02-21 16:52:22 TP4] Watchdog timeout (self.watchdog_timeout=300) [2025-02-21 16:52:22 TP5] Watchdog timeout (self.watchdog_timeout=300) [2025-02-21 16:52:22 TP7] Watchdog timeout (self.watchdog_timeout=300) [2025-02-21 16:52:22 TP1] Watchdog timeout (self.watchdog_timeout=300) [2025-02-21 16:52:22 TP2] Watchdog timeout (self.watchdog_timeout=300) [2025-02-21 16:52:22 TP0] Watchdog timeout (self.watchdog_timeout=300) [2025-02-21 16:52:22 TP3] Watchdog timeout (self.watchdog_timeout=300) [2025-02-21 16:52:22 TP6] Watchdog timeout (self.watchdog_timeout=300)
Environment
sglang version: sglang_0.4.2.post4