sglang icon indicating copy to clipboard operation
sglang copied to clipboard

[Bug] DeepSeek R1 serve crash occasionally on 2*H100

Open zui-jiang opened this issue 10 months ago • 17 comments

Checklist

  • [x] 1. I have searched related issues but cannot get the expected help.
  • [x] 2. The bug has not been fixed in the latest version.
  • [x] 3. Please note that if the bug-related issue you submitted lacks corresponding environment info and a minimal reproducible demo, it will be challenging for us to reproduce and resolve the issue, reducing the likelihood of receiving feedback.
  • [x] 4. If the issue you raised is not a bug but a question, please raise a discussion at https://github.com/sgl-project/sglang/discussions/new/choose Otherwise, it will be closed.
  • [x] 5. Please use English, otherwise it will be closed.

Describe the bug

When deploying dpsk-r1 using IB communication on two H100 units, unexpected crashes frequently occur, with a probability of occurring under tests with different numbers of requests.

Reproduction

env installation

following the installation in https://docs.sglang.ai/start/install.html

pip install --upgrade pip
conda install -c nvidia/label/cuda-12.4.0 cuda=12.4.0
pip install sgl-kernel --force-reinstall --no-deps
pip install "sglang[all]>=0.4.2.post2" --find-links https://flashinfer.ai/whl/cu124/torch2.5/flashinfer/

package information

aiohappyeyeballs==2.4.4
aiohttp==3.11.12
aiosignal==1.3.2
annotated-types==0.7.0
anthropic==0.45.2
anyio==4.8.0
asttokens==3.0.0
async-timeout==5.0.1
attrs==25.1.0
certifi==2025.1.31
charset-normalizer==3.4.1
click==8.1.8
cloudpickle==3.1.1
compressed-tensors==0.8.0
cuda-bindings==12.8.0
cuda-python==12.8.0
datasets==3.2.0
decorator==5.1.1
decord==0.6.0
dill==0.3.8
diskcache==5.6.3
distro==1.9.0
einops==0.8.0
exceptiongroup==1.2.2
executing==2.2.0
fastapi==0.115.8
filelock==3.17.0
flashinfer-python==0.2.0.post2+cu124torch2.5
frozenlist==1.5.0
fsspec==2024.9.0
gguf==0.10.0
h11==0.14.0
hf_transfer==0.1.9
httpcore==1.0.7
httptools==0.6.4
httpx==0.28.1
huggingface-hub==0.28.1
idna==3.10
importlib_metadata==8.6.1
iniconfig==2.0.0
interegular==0.3.3
ipython==8.32.0
jedi==0.19.2
Jinja2==3.1.5
jiter==0.8.2
jsonschema==4.23.0
jsonschema-specifications==2024.10.1
lark==1.2.2
litellm==1.60.6
llvmlite==0.44.0
lm-format-enforcer==0.10.9
MarkupSafe==3.0.2
matplotlib-inline==0.1.7
mistral_common==1.5.2
modelscope==1.22.3
mpmath==1.3.0
msgpack==1.1.0
msgspec==0.19.0
multidict==6.1.0
multiprocess==0.70.16
nest-asyncio==1.6.0
networkx==3.4.2
numba==0.61.0
numpy==1.26.4
nvidia-cublas-cu12==12.4.5.8
nvidia-cuda-cupti-cu12==12.4.127
nvidia-cuda-nvrtc-cu12==12.4.127
nvidia-cuda-runtime-cu12==12.4.127
nvidia-cudnn-cu12==9.1.0.70
nvidia-cufft-cu12==11.2.1.3
nvidia-curand-cu12==10.3.5.147
nvidia-cusolver-cu12==11.6.1.9
nvidia-cusparse-cu12==12.3.1.170
nvidia-ml-py==12.570.86
nvidia-nccl-cu12==2.21.5
nvidia-nvjitlink-cu12==12.4.127
nvidia-nvtx-cu12==12.4.127
openai==1.61.1
opencv-python-headless==4.11.0.86
orjson==3.10.15
outlines==0.0.46
packaging==24.2
pandas==2.2.3
parso==0.8.4
partial-json-parser==0.2.1.1.post5
pexpect==4.9.0
pillow==10.4.0
pluggy==1.5.0
prometheus-fastapi-instrumentator==7.0.2
prometheus_client==0.21.1
prompt_toolkit==3.0.50
propcache==0.2.1
protobuf==5.29.3
psutil==6.1.1
ptyprocess==0.7.0
pure_eval==0.2.3
py-cpuinfo==9.0.0
pyairports==2.1.1
pyarrow==19.0.0
pybind11==2.13.6
pycountry==24.6.1
pydantic==2.10.6
pydantic_core==2.27.2
Pygments==2.19.1
pytest==8.3.4
python-dateutil==2.9.0.post0
python-dotenv==1.0.1
python-multipart==0.0.20
pytz==2025.1
PyYAML==6.0.2
pyzmq==26.2.1
ray==2.42.0
referencing==0.36.2
regex==2024.11.6
requests==2.32.3
rpds-py==0.22.3
safetensors==0.5.2
sentencepiece==0.2.0
setproctitle==1.3.4
sgl-kernel==0.0.3.post1
sglang==0.4.2.post2
six==1.17.0
sniffio==1.3.1
stack-data==0.6.3
starlette==0.45.3
sympy==1.13.1
tiktoken==0.7.0
tokenizers==0.21.0
tomli==2.2.1
torch==2.5.1
torchao==0.8.0
torchvision==0.20.1
tqdm==4.67.1
traitlets==5.14.3
transformers==4.48.2
triton==3.1.0
typing_extensions==4.12.2
tzdata==2025.1
urllib3==2.3.0
uvicorn==0.34.0
uvloop==0.21.0
vllm==0.6.4.post1
watchfiles==1.0.4
wcwidth==0.2.13
websockets==14.2
xformers==0.0.28.post3
xgrammar==0.1.11
xxhash==3.5.0
yarl==1.18.3
zipp==3.21.0

serve setup

# node 1
python -m sglang.launch_server --model-path  $MODEL_PATH --tp 16 --dist-init-addr ${IB_IP}:5000 --nnodes 2 --node-rank 0 --trust-remote-code --served-model-name dpsk-r1 --host 0.0.0.0
 
# node 2
python -m sglang.launch_server --model-path $MODEL_PATH --tp 16 --dist-init-addr ${IB_IP}:5000 --nnodes 2 --node-rank 1 --trust-remote-code --served-model-name dpsk-r1 --host 0.0.0.0

benchmark

Image

Error Info

on node 1

2025-02-09 10:14:05 TP0] Decode batch. #running-req: 2, #token: 3698, token usage: 0.01, gen throughput (token/s): 50.46, #queue-req: 0
[2025-02-09 10:14:06 TP0] Decode batch. #running-req: 2, #token: 3778, token usage: 0.01, gen throughput (token/s): 50.47, #queue-req: 0
[2025-02-09 10:14:08 TP0] Decode batch. #running-req: 2, #token: 3858, token usage: 0.01, gen throughput (token/s): 50.41, #queue-req: 0
[2025-02-09 10:14:09 TP0] Decode batch. #running-req: 2, #token: 3938, token usage: 0.01, gen throughput (token/s): 50.34, #queue-req: 0
[2025-02-09 10:14:11] INFO:     10.39.13.144:48590 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-02-09 10:14:11 TP0] Decode batch. #running-req: 1, #token: 2162, token usage: 0.01, gen throughput (token/s): 46.21, #queue-req: 0
[2025-02-09 10:14:12] INFO:     10.39.13.144:48606 - "POST /v1/chat/completions HTTP/1.1" 200 OK

[2025-02-09 11:06:40 TP0] Prefill batch. #new-seq: 2, #new-token: 632, #cached-token: 126, cache hit rate: 80.75%, token usage: 0.00, #running-req: 0, #queue-req: 0
[2025-02-09 11:06:41 TP0] Prefill batch. #new-seq: 3, #new-token: 170, #cached-token: 837, cache hit rate: 80.76%, token usage: 0.00, #running-req: 2, #queue-req: 0
[2025-02-09 11:06:41 TP0] Prefill batch. #new-seq: 2, #new-token: 359, #cached-token: 558, cache hit rate: 80.68%, token usage: 0.00, #running-req: 5, #queue-req: 0
[2025-02-09 11:06:42 TP0] Decode batch. #running-req: 7, #token: 1029, token usage: 0.00, gen throughput (token/s): 0.02, #queue-req: 0
[2025-02-09 11:06:42 TP0] Prefill batch. #new-seq: 1, #new-token: 76, #cached-token: 279, cache hit rate: 80.67%, token usage: 0.00, #running-req: 7, #queue-req: 0
[2025-02-09 11:06:42 TP0] Prefill batch. #new-seq: 1, #new-token: 15, #cached-token: 279, cache hit rate: 80.69%, token usage: 0.00, #running-req: 8, #queue-req: 0
[2025-02-09 11:06:42 TP0] Prefill batch. #new-seq: 4, #new-token: 128, #cached-token: 1116, cache hit rate: 80.74%, token usage: 0.00, #running-req: 9, #queue-req: 0
[2025-02-09 11:06:43 TP0] Prefill batch. #new-seq: 18, #new-token: 1652, #cached-token: 5022, cache hit rate: 80.58%, token usage: 0.00, #running-req: 13, #queue-req: 0
[2025-02-09 11:06:43 TP0] Prefill batch. #new-seq: 52, #new-token: 3359, #cached-token: 14512, cache hit rate: 80.63%, token usage: 0.01, #running-req: 31, #queue-req: 2
[2025-02-09 11:13:23 TP4] Watchdog timeout (self.watchdog_timeout=300)
[2025-02-09 11:13:24 TP5] Watchdog timeout (self.watchdog_timeout=300)
[2025-02-09 11:13:24 TP2] Watchdog timeout (self.watchdog_timeout=300)
[2025-02-09 11:13:24 TP6] Watchdog timeout (self.watchdog_timeout=300)
[2025-02-09 11:13:24 TP1] Watchdog timeout (self.watchdog_timeout=300)
[2025-02-09 11:13:24 TP3] Watchdog timeout (self.watchdog_timeout=300)
[2025-02-09 11:13:24 TP7] Watchdog timeout (self.watchdog_timeout=300)
[2025-02-09 11:13:25 TP0] Watchdog timeout (self.watchdog_timeout=300)
[2025-02-09 11:13:28] Received sigquit from a child proces. It usually means the child failed.
[1]    25183 killed     python -m sglang.launch_server --model-path  --tp 16 --dist-init-addr   2  0 

on node 2

[2025-02-09 07:00:58 TP12] Using default W8A8 Block FP8 kernel config. Performance might be sub-optimal! Config file not found at /cpfs/user/liuyanjiang/miniconda3/envs/sglang/lib/python3.10/site-packages/sglang/srt/layers/quantization/configs/N=2048,K=512,device_name=NVIDIA_L20Z,dtype=fp8_w8a8,block_shape=[128, 128].json
[2025-02-09 07:00:58 TP8] Using default W8A8 Block FP8 kernel config. Performance might be sub-optimal! Config file not found at /cpfs/user/liuyanjiang/miniconda3/envs/sglang/lib/python3.10/site-packages/sglang/srt/layers/quantization/configs/N=2048,K=512,device_name=NVIDIA_L20Z,dtype=fp8_w8a8,block_shape=[128, 128].json
[2025-02-09 07:00:58 TP9] Using default W8A8 Block FP8 kernel config. Performance might be sub-optimal! Config file not found at /cpfs/user/liuyanjiang/miniconda3/envs/sglang/lib/python3.10/site-packages/sglang/srt/layers/quantization/configs/N=2048,K=512,device_name=NVIDIA_L20Z,dtype=fp8_w8a8,block_shape=[128, 128].json
[2025-02-09 11:13:23 TP9] Watchdog timeout (self.watchdog_timeout=300)
[2025-02-09 11:13:23 TP11] Watchdog timeout (self.watchdog_timeout=300)
[2025-02-09 11:13:23 TP15] Watchdog timeout (self.watchdog_timeout=300)
[2025-02-09 11:13:24 TP8] Watchdog timeout (self.watchdog_timeout=300)
[2025-02-09 11:13:24 TP12] Watchdog timeout (self.watchdog_timeout=300)
[2025-02-09 11:13:24 TP13] Watchdog timeout (self.watchdog_timeout=300)
[2025-02-09 11:13:24 TP14] Watchdog timeout (self.watchdog_timeout=300)
[2025-02-09 11:13:25 TP10] Watchdog timeout (self.watchdog_timeout=300)
[2025-02-09 11:13:28] Received sigquit from a child proces. It usually means the child failed.
[1]    21769 killed     python -m sglang.launch_server --model-path  --tp 16 --dist-init-addr   2  1 

system monitor

both nodes' gpu memory usage is round 75%, not even the crash moment node 1 gpu utilization

Image

node 2 gpu utilization

Image

occurrence frequency

0.5-4hours when requests any parallel degree of 32-128

Environment

Python: 3.10.16 (main, Dec 11 2024, 16:24:50) [GCC 11.2.0] CUDA available: True GPU 0,1,2,3,4,5,6,7: NVIDIA L20Z GPU 0,1,2,3,4,5,6,7 Compute Capability: 9.0 CUDA_HOME: /usr/local/cuda NVCC: Cuda compilation tools, release 12.5, V12.5.82 CUDA Driver Version: 550.127.08 PyTorch: 2.5.1+cu124 sglang: 0.4.2.post2 flashinfer: 0.2.0.post2+cu124torch2.5 triton: 3.1.0 transformers: 4.48.2 torchao: 0.8.0 numpy: 1.26.4 aiohttp: 3.11.12 fastapi: 0.115.8 hf_transfer: 0.1.9 huggingface_hub: 0.28.1 interegular: 0.3.3 modelscope: 1.22.3 orjson: 3.10.15 packaging: 24.2 psutil: 6.1.1 pydantic: 2.10.6 multipart: 0.0.20 zmq: 26.2.1 uvicorn: 0.34.0 uvloop: 0.21.0 vllm: 0.6.4.post1 openai: 1.61.1 anthropic: 0.45.2 decord: 0.6.0 NVIDIA Topology: [4mGPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7 NIC0 NIC1 NIC2 NIC3 NIC4 NIC5 NIC6 NIC7 CPU Affinity NUMA Affinity GPU NUMA ID[0m GPU0 X NV18 NV18 NV18 NV18 NV18 NV18 NV18 PHB PHB PHB PHB PHB PHB PHB PHB 0-127 0-1 N/A GPU1 NV18 X NV18 NV18 NV18 NV18 NV18 NV18 PHB PHB PHB PHB PHB PHB PHB PHB 0-127 0-1 N/A GPU2 NV18 NV18 X NV18 NV18 NV18 NV18 NV18 PHB PHB PHB PHB PHB PHB PHB PHB 0-127 0-1 N/A GPU3 NV18 NV18 NV18 X NV18 NV18 NV18 NV18 PHB PHB PHB PHB PHB PHB PHB PHB 0-127 0-1 N/A GPU4 NV18 NV18 NV18 NV18 X NV18 NV18 NV18 PHB PHB PHB PHB PHB PHB PHB PHB 0-127 0-1 N/A GPU5 NV18 NV18 NV18 NV18 NV18 X NV18 NV18 PHB PHB PHB PHB PHB PHB PHB PHB 0-127 0-1 N/A GPU6 NV18 NV18 NV18 NV18 NV18 NV18 X NV18 PHB PHB PHB PHB PHB PHB PHB PHB 0-127 0-1 N/A GPU7 NV18 NV18 NV18 NV18 NV18 NV18 NV18 X PHB PHB PHB PHB PHB PHB PHB PHB 0-127 0-1 N/A NIC0 PHB PHB PHB PHB PHB PHB PHB PHB X PHB PHB PHB PHB PHB PHB PHB NIC1 PHB PHB PHB PHB PHB PHB PHB PHB PHB X PHB PHB PHB PHB PHB PHB NIC2 PHB PHB PHB PHB PHB PHB PHB PHB PHB PHB X PHB PHB PHB PHB PHB NIC3 PHB PHB PHB PHB PHB PHB PHB PHB PHB PHB PHB X PHB PHB PHB PHB NIC4 PHB PHB PHB PHB PHB PHB PHB PHB PHB PHB PHB PHB X PHB PHB PHB NIC5 PHB PHB PHB PHB PHB PHB PHB PHB PHB PHB PHB PHB PHB X PHB PHB NIC6 PHB PHB PHB PHB PHB PHB PHB PHB PHB PHB PHB PHB PHB PHB X PHB NIC7 PHB PHB PHB PHB PHB PHB PHB PHB PHB PHB PHB PHB PHB PHB PHB X

Legend:

X = Self SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI) NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU) PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge) PIX = Connection traversing at most a single PCIe bridge NV# = Connection traversing a bonded set of # NVLinks

NIC Legend:

NIC0: mlx5_0 NIC1: mlx5_1 NIC2: mlx5_2 NIC3: mlx5_3 NIC4: mlx5_4 NIC5: mlx5_5 NIC6: mlx5_6 NIC7: mlx5_7

Hypervisor vendor: KVM ulimit soft: 102400

zui-jiang avatar Feb 09 '25 11:02 zui-jiang

additional error log

[2025-02-10 01:56:09 TP0] Prefill batch. #new-seq: 1, #new-token: 184, #cached-token: 1048, cache hit rate: 70.90%, token usage: 0.94, #running-req: 56, #queue-req: 67                                          
[2025-02-10 01:56:10 TP0] Decode batch. #running-req: 57, #token: 290889, token usage: 0.95, gen throughput (token/s): 656.96, #queue-req: 68                                                                    
[2025-02-10 01:56:13 TP0] Decode batch. #running-req: 57, #token: 293169, token usage: 0.95, gen throughput (token/s): 715.76, #queue-req: 68                                                                    
[2025-02-10 01:56:17 TP0] Decode batch. #running-req: 57, #token: 295449, token usage: 0.96, gen throughput (token/s): 719.65, #queue-req: 68                                                                    
[2025-02-10 01:56:20 TP0] Decode batch. #running-req: 57, #token: 297729, token usage: 0.97, gen throughput (token/s): 717.03, #queue-req: 68                                                                    
[2025-02-10 01:56:20] INFO:     10.39.127.44:34146 - "POST /v1/chat/completions HTTP/1.1" 200 OK                                                                                                                 
[2025-02-10 01:56:23 TP0] Decode batch. #running-req: 56, #token: 298298, token usage: 0.97, gen throughput (token/s): 705.06, #queue-req: 70                                                                    
[2025-02-10 01:56:23] INFO:     10.39.127.44:33672 - "POST /v1/chat/completions HTTP/1.1" 200 OK                                                                                                                 
[2025-02-10 01:56:26 TP0] Decode batch. #running-req: 55, #token: 293276, token usage: 0.95, gen throughput (token/s): 694.74, #queue-req: 70                                                                    
[2025-02-10 01:56:29 TP0] Decode batch. #running-req: 55, #token: 295476, token usage: 0.96, gen throughput (token/s): 695.83, #queue-req: 70                                                                    
[2025-02-10 01:56:32 TP0] Prefill batch. #new-seq: 37, #new-token: 7244, #cached-token: 13389, cache hit rate: 70.89%, token usage: 0.45, #running-req: 33, #queue-req: 55                                       
[rank6]:[E210 02:06:33.301528494 ProcessGroupNCCL.cpp:616] [Rank 6] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1522789, OpType=_ALLGATHER_BASE, NumelIn=298960, NumelOut=4783360, Timeout(ms)=
600000) ran for 600002 milliseconds before timing out.                                                                                                                                                           
[rank1]:[E210 02:06:33.302165623 ProcessGroupNCCL.cpp:616] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1522789, OpType=_ALLGATHER_BASE, NumelIn=298960, NumelOut=4783360, Timeout(ms)$
600000) ran for 600003 milliseconds before timing out.                                                                                                                                                           
[rank6]:[E210 02:06:33.302379661 ProcessGroupNCCL.cpp:1785] [PG ID 2 PG GUID 3 Rank 6] Exception (either an error or timeout) detected by watchdog at work: 1522789, last enqueued NCCL work: 1522789, last compl
eted NCCL work: 1522788.                                                                                                                                                                                         
[rank6]:[E210 02:06:33.302397811 ProcessGroupNCCL.cpp:1834] [PG ID 2 PG GUID 3 Rank 6] Timeout at NCCL work: 1522789, last enqueued NCCL work: 1522789, last completed NCCL work: 1522788.                       
[rank6]:[E210 02:06:33.302405183 ProcessGroupNCCL.cpp:630] [Rank 6] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupt
ed/incomplete data.                                                                                                                                                                                              
[rank6]:[E210 02:06:33.302412604 ProcessGroupNCCL.cpp:636] [Rank 6] To avoid data inconsistency, we are taking the entire process down.                                                                          
[rank1]:[E210 02:06:33.302687980 ProcessGroupNCCL.cpp:1785] [PG ID 2 PG GUID 3 Rank 1] Exception (either an error or timeout) detected by watchdog at work: 1522789, last enqueued NCCL work: 1522789, last compl
eted NCCL work: 1522788.                                                                                                                                                                                         
[rank1]:[E210 02:06:33.302703257 ProcessGroupNCCL.cpp:1834] [PG ID 2 PG GUID 3 Rank 1] Timeout at NCCL work: 1522789, last enqueued NCCL work: 1522789, last completed NCCL work: 1522788.                       
[rank1]:[E210 02:06:33.302735851 ProcessGroupNCCL.cpp:630] [Rank 1] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupt
ed/incomplete data.                                                                                                                                                                                              
[rank1]:[E210 02:06:33.302739874 ProcessGroupNCCL.cpp:636] [Rank 1] To avoid data inconsistency, we are taking the entire process down.                                                                          
[rank4]:[E210 02:06:33.321131395 ProcessGroupNCCL.cpp:616] [Rank 4] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1522789, OpType=_ALLGATHER_BASE, NumelIn=298960, NumelOut=4783360, Timeout(ms)=
600000) ran for 600022 milliseconds before timing out.                                                                                                                                                           
[rank4]:[E210 02:06:33.321659390 ProcessGroupNCCL.cpp:1785] [PG ID 2 PG GUID 3 Rank 4] Exception (either an error or timeout) detected by watchdog at work: 1522789, last enqueued NCCL work: 1522789, last compl
eted NCCL work: 1522788.                                                                                                                                                                                         
[rank4]:[E210 02:06:33.321676720 ProcessGroupNCCL.cpp:1834] [PG ID 2 PG GUID 3 Rank 4] Timeout at NCCL work: 1522789, last enqueued NCCL work: 1522789, last completed NCCL work: 1522788.                       
[rank4]:[E210 02:06:33.321682548 ProcessGroupNCCL.cpp:630] [Rank 4] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupt
ed/incomplete data.                                                                                                                                                                                              
[rank4]:[E210 02:06:33.321688076 ProcessGroupNCCL.cpp:636] [Rank 4] To avoid data inconsistency, we are taking the entire process down.                                                                          
[rank7]:[E210 02:06:33.322299140 ProcessGroupNCCL.cpp:616] [Rank 7] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1522789, OpType=_ALLGATHER_BASE, NumelIn=298960, NumelOut=4783360, Timeout(ms)=
600000) ran for 600023 milliseconds before timing out.                  
[rank7]:[E210 02:06:33.322760131 ProcessGroupNCCL.cpp:1785] [PG ID 2 PG GUID 3 Rank 7] Exception (either an error or timeout) detected by watchdog at work: 1522789, last enqueued NCCL work: 1522789, last compl
eted NCCL work: 1522788.
[rank7]:[E210 02:06:33.322773854 ProcessGroupNCCL.cpp:1834] [PG ID 2 PG GUID 3 Rank 7] Timeout at NCCL work: 1522789, last enqueued NCCL work: 1522789, last completed NCCL work: 1522788.
[rank7]:[E210 02:06:33.322779871 ProcessGroupNCCL.cpp:630] [Rank 7] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupt
ed/incomplete data.
[rank7]:[E210 02:06:33.322786241 ProcessGroupNCCL.cpp:636] [Rank 7] To avoid data inconsistency, we are taking the entire process down.
[rank5]:[E210 02:06:33.326486590 ProcessGroupNCCL.cpp:616] [Rank 5] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1522789, OpType=_ALLGATHER_BASE, NumelIn=298960, NumelOut=4783360, Timeout(ms)=
600000) ran for 600027 milliseconds before timing out.
[rank5]:[E210 02:06:33.326991748 ProcessGroupNCCL.cpp:1785] [PG ID 2 PG GUID 3 Rank 5] Exception (either an error or timeout) detected by watchdog at work: 1522789, last enqueued NCCL work: 1522789, last compl
eted NCCL work: 1522788.
[rank5]:[E210 02:06:33.327006446 ProcessGroupNCCL.cpp:1834] [PG ID 2 PG GUID 3 Rank 5] Timeout at NCCL work: 1522789, last enqueued NCCL work: 1522789, last completed NCCL work: 1522788.
[rank5]:[E210 02:06:33.327014717 ProcessGroupNCCL.cpp:630] [Rank 5] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupt
ed/incomplete data.
[rank5]:[E210 02:06:33.327019900 ProcessGroupNCCL.cpp:636] [Rank 5] To avoid data inconsistency, we are taking the entire process down.
[rank3]:[E210 02:06:33.328206009 ProcessGroupNCCL.cpp:616] [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1522789, OpType=_ALLGATHER_BASE, NumelIn=298960, NumelOut=4783360, Timeout(ms)=
600000) ran for 600029 milliseconds before timing out.
[rank3]:[E210 02:06:33.328687681 ProcessGroupNCCL.cpp:1785] [PG ID 2 PG GUID 3 Rank 3] Exception (either an error or timeout) detected by watchdog at work: 1522789, last enqueued NCCL work: 1522789, last compl
eted NCCL work: 1522788.
[rank3]:[E210 02:06:33.328703419 ProcessGroupNCCL.cpp:1834] [PG ID 2 PG GUID 3 Rank 3] Timeout at NCCL work: 1522789, last enqueued NCCL work: 1522789, last completed NCCL work: 1522788.
[rank3]:[E210 02:06:33.328709319 ProcessGroupNCCL.cpp:630] [Rank 3] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupt
ed/incomplete data.
[rank3]:[E210 02:06:33.328713742 ProcessGroupNCCL.cpp:636] [Rank 3] To avoid data inconsistency, we are taking the entire process down.
[rank5]:[E210 02:06:33.347825685 ProcessGroupNCCL.cpp:1595] [PG ID 2 PG GUID 3 Rank 5] Process group watchdog thread terminated with exception: [Rank 5] Watchdog caught collective operation timeout: WorkNCCL(S
eqNum=1522789, OpType=_ALLGATHER_BASE, NumelIn=298960, NumelOut=4783360, Timeout(ms)=600000) ran for 600027 milliseconds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7efff4b6c446 in /home/miniconda3/envs/sglang/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x282 (0x7effaa1cc772 in /home/miniconda3/envs/sglang/lib/python
3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7effaa1d3bb3 in /home/miniconda3/envs/sglang/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)                            
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7effaa1d561d in /home/miniconda3/envs/sglang/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)                           
frame #4: <unknown function> + 0x145c0 (0x7efff6e015c0 in /home/miniconda3/envs/sglang/lib/python3.10/site-packages/torch/lib/libtorch.so)                                                      
frame #5: <unknown function> + 0x94ac3 (0x7f0099294ac3 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #6: <unknown function> + 0x126850 (0x7f0099326850 in /usr/lib/x86_64-linux-gnu/libc.so.6)

[rank1]:[E210 02:06:33.347835395 ProcessGroupNCCL.cpp:1595] [PG ID 2 PG GUID 3 Rank 1] Process group watchdog thread terminated with exception: [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(S
eqNum=1522789, OpType=_ALLGATHER_BASE, NumelIn=298960, NumelOut=4783360, Timeout(ms)=600000) ran for 600003 milliseconds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7f55772b9446 in /home/miniconda3/envs/sglang/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x282 (0x7f552cdcc772 in /home/miniconda3/envs/sglang/lib/python
3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7f552cdd3bb3 in /home/miniconda3/envs/sglang/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7f552cdd561d in /home/miniconda3/envs/sglang/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #4: <unknown function> + 0x145c0 (0x7f55798545c0 in /home/miniconda3/envs/sglang/lib/python3.10/site-packages/torch/lib/libtorch.so)
frame #5: <unknown function> + 0x94ac3 (0x7f561bc94ac3 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #6: <unknown function> + 0x126850 (0x7f561bd26850 in /usr/lib/x86_64-linux-gnu/libc.so.6)

[rank3]:[E210 02:06:33.347824908 ProcessGroupNCCL.cpp:1595] [PG ID 2 PG GUID 3 Rank 3] Process group watchdog thread terminated with exception: [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(S
eqNum=1522789, OpType=_ALLGATHER_BASE, NumelIn=298960, NumelOut=4783360, Timeout(ms)=600000) ran for 600029 milliseconds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7f29a66b9446 in /home/miniconda3/envs/sglang/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x282 (0x7f295c1cc772 in /home/miniconda3/envs/sglang/lib/python
3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7f295c1d3bb3 in /home/miniconda3/envs/sglang/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7f295c1d561d in /home/miniconda3/envs/sglang/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #4: <unknown function> + 0x145c0 (0x7f29a8ce05c0 in /home/miniconda3/envs/sglang/lib/python3.10/site-packages/torch/lib/libtorch.so)
frame #5: <unknown function> + 0x94ac3 (0x7f2a4b094ac3 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #6: <unknown function> + 0x126850 (0x7f2a4b126850 in /usr/lib/x86_64-linux-gnu/libc.so.6)

[rank7]:[E210 02:06:33.347826419 ProcessGroupNCCL.cpp:1595] [PG ID 2 PG GUID 3 Rank 7] Process group watchdog thread terminated with exception: [Rank 7] Watchdog caught collective operation timeout: WorkNCCL(S
eqNum=1522789, OpType=_ALLGATHER_BASE, NumelIn=298960, NumelOut=4783360, Timeout(ms)=600000) ran for 600023 milliseconds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7f477676c446 in /home/miniconda3/envs/sglang/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x282 (0x7f472c1cc772 in /home/miniconda3/envs/sglang/lib/python
3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7f472c1d3bb3 in /home/miniconda3/envs/sglang/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7f472c1d561d in /home/miniconda3/envs/sglang/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #4: <unknown function> + 0x145c0 (0x7f4778dad5c0 in /home/miniconda3/envs/sglang/lib/python3.10/site-packages/torch/lib/libtorch.so)
frame #5: <unknown function> + 0x94ac3 (0x7f481b094ac3 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #6: <unknown function> + 0x126850 (0x7f481b126850 in /usr/lib/x86_64-linux-gnu/libc.so.6)
[rank4]:[E210 02:06:33.347834904 ProcessGroupNCCL.cpp:1595] [PG ID 2 PG GUID 3 Rank 4] Process group watchdog thread terminated with exception: [Rank 4] Watchdog caught collective operation timeout: WorkNCCL(S
eqNum=1522789, OpType=_ALLGATHER_BASE, NumelIn=298960, NumelOut=4783360, Timeout(ms)=600000) ran for 600022 milliseconds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7f4fb1f6c446 in /home/miniconda3/envs/sglang/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x282 (0x7f4f675cc772 in /home/miniconda3/envs/sglang/lib/python
3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7f4f675d3bb3 in /home/miniconda3/envs/sglang/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7f4f675d561d in /home/miniconda3/envs/sglang/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #4: <unknown function> + 0x145c0 (0x7f4fb42115c0 in /home/miniconda3/envs/sglang/lib/python3.10/site-packages/torch/lib/libtorch.so)
frame #5: <unknown function> + 0x94ac3 (0x7f5056694ac3 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #6: <unknown function> + 0x126850 (0x7f5056726850 in /usr/lib/x86_64-linux-gnu/libc.so.6)

[rank6]:[E210 02:06:33.347836420 ProcessGroupNCCL.cpp:1595] [PG ID 2 PG GUID 3 Rank 6] Process group watchdog thread terminated with exception: [Rank 6] Watchdog caught collective operation timeout: WorkNCCL(S
eqNum=1522789, OpType=_ALLGATHER_BASE, NumelIn=298960, NumelOut=4783360, Timeout(ms)=600000) ran for 600002 milliseconds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7fd75d96c446 in /home/miniconda3/envs/sglang/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x282 (0x7fd7133cc772 in /home/miniconda3/envs/sglang/lib/python
3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7fd7133d3bb3 in /home/miniconda3/envs/sglang/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7fd7133d561d in /home/miniconda3/envs/sglang/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #4: <unknown function> + 0x145c0 (0x7fd75ffa65c0 in /home/miniconda3/envs/sglang/lib/python3.10/site-packages/torch/lib/libtorch.so)
frame #5: <unknown function> + 0x94ac3 (0x7fd802294ac3 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #6: <unknown function> + 0x126850 (0x7fd802326850 in /usr/lib/x86_64-linux-gnu/libc.so.6)

Fatal Python error: Fatal Python error: Fatal Python error: AbortedAbortedFatal Python error: Aborted



Aborted

Thread 0xThread 0xFatal Python error: 

Thread 0x00007eeb027f8640Aborted00007f4080ff9640Fatal Python error: Thread 0x (most recent call first):
00007f14a47fc64000007f3295e69640

 (most recent call first):
Aborted (most recent call first):
 (most recent call first):
Thread 0x  File 
00007f3aaba00640  File "  File /home/miniconda3/envs/sglang/lib/python3.10/site-packages/sglang/srt/managers/scheduler.pyThread 0x  File  (most recent call first):
"""00007fc2587f8640/home/miniconda3/envs/sglang/lib/python3.10/site-packages/sglang/srt/managers/scheduler.py"/home/miniconda3/envs/sglang/lib/python3.10/site-packages/sglang/
srt/managers/scheduler.py, line  (most recent call first):
  File "/home/miniconda3/envs/sglang/lib/python3.10/site-packages/sglang/srt/managers/scheduler.py""462, line "  File , line /home/miniconda3/envs/sglang/lib/python3.10/site-p
ackages/sglang/srt/managers/scheduler.py in 462, line 462"watchdog_thread" in /home/miniconda3/envs/sglang/lib/python3.10/site-packages/sglang/srt/managers/scheduler.py462 in 
, line "watchdog_thread in watchdog_thread  File 462 
, line watchdog_thread
" in 462
  File /home/miniconda3/envs/sglang/lib/python3.10/threading.py  File "watchdog_thread, line  in "  File "
953/home/miniconda3/envs/sglang/lib/python3.10/threading.pywatchdog_thread"/home/miniconda3/envs/sglang/lib/python3.10/threading.py in "
  File /home/miniconda3/envs/sglang/lib/python3.10/threading.py"run, line , line ""
  File 953953, line /home/miniconda3/envs/sglang/lib/python3.10/threading.py" in   File  in 953"/home/miniconda3/envs/sglang/lib/python3.10/threading.py"runrun in , line "/cpf
s/user/liuyanjiang/miniconda3/envs/sglang/lib/python3.10/threading.py

953run, line "  File 
  File  in 953, line ""  File run in 1016/home/miniconda3/envs/sglang/lib/python3.10/threading.py/home/miniconda3/envs/sglang/lib/python3.10/threading.py"
 in run_bootstrap_inner""/home/miniconda3/envs/sglang/lib/python3.10/threading.py  File  

, line , line ""  File   File 10161016, line /home/miniconda3/envs/sglang/lib/python3.10/threading.py"" in  in 1016"/home/miniconda3/envs/sglang/lib/python3.10/threading.py/cp
fs/user/liuyanjiang/miniconda3/envs/sglang/lib/python3.10/threading.py_bootstrap_inner in _bootstrap_inner, line ""

_bootstrap_inner1016, line , line   File 
  File  in 9731016""  File _bootstrap_inner in  in /home/miniconda3/envs/sglang/lib/python3.10/threading.py/home/miniconda3/envs/sglang/lib/python3.10/threading.py""
_bootstrap_bootstrap_inner"/home/miniconda3/envs/sglang/lib/python3.10/threading.py, line   File 

, line   File "973"
973", line  in /home/miniconda3/envs/sglang/lib/python3.10/threading.pyThread 0x in /home/miniconda3/envs/sglang/lib/python3.10/threading.py973_bootstrap"_bootstrap00007eeb02f
f9640" in 
, line 
 (most recent call first):
, line _bootstrap
973
973  File 
Thread 0x in Thread 0x in "
00007f40817fa640_bootstrap00007f14a4ffd640_bootstrap/home/miniconda3/envs/sglang/lib/python3.10/threading.pyThread 0x (most recent call first):

 (most recent call first):
 "00007f32f5fff640
  File 
  File , line  (most recent call first):
Thread 0x"Thread 0x"320/home/miniconda3/envs/sglang/lib/python3.10/threading.py00007f3aac201640  File /home/miniconda3/envs/sglang/lib/python3.10/threading.py00007fc258ff9640 
in " (most recent call first):
"" (most recent call first):
wait, line /home/miniconda3/envs/sglang/lib/python3.10/threading.py  File , line 
  File 320""320" in , line /home/miniconda3/envs/sglang/lib/python3.10/threading.py  File  in /home/miniconda3/envs/sglang/lib/python3.10/threading.pywait320""wait"
 in , line /home/miniconda3/envs/sglang/lib/python3.10/queue.py
, line wait320"  File 320  File 
 in , line  in ""wait  File 171wait"/home/miniconda3/envs/sglang/lib/python3.10/queue.py"/home/miniconda3/envs/sglang/lib/python3.10/queue.py
 in 
/home/miniconda3/envs/sglang/lib/python3.10/queue.py, line "  File get  File "171" in , line "
, line /home/miniconda3/envs/sglang/lib/python3.10/queue.pyget171/home/miniconda3/envs/sglang/lib/python3.10/queue.py171"
 in   File " in , line get  File ", line get171
/home/miniconda3/envs/sglang/lib/python3.10/site-packages/sglang/srt/managers/tp_worker_overlap_thread.py"171
 in "/home/miniconda3/envs/sglang/lib/python3.10/site-packages/sglang/srt/managers/tp_worker_overlap_thread.py  File  in   File get, line ""get
121", line /home/miniconda3/envs/sglang/lib/python3.10/site-packages/sglang/srt/managers/tp_worker_overlap_thread.py/home/miniconda3/envs/sglang/lib/python3.10/site-packages/s
glang/srt/managers/tp_worker_overlap_thread.py
 in 121  File "" in forward_thread_func_  File 
", line , line forward_thread_func_"/home/miniconda3/envs/sglang/lib/python3.10/site-packages/sglang/srt/managers/tp_worker_overlap_thread.py121121  File 
/home/miniconda3/envs/sglang/lib/python3.10/site-packages/sglang/srt/managers/tp_worker_overlap_thread.py" in   File  in "", line forward_thread_func_"/home/miniconda3/envs/sg
lang/lib/python3.10/site-packages/torch/utils/_contextlib.pyforward_thread_func_, line 121
/home/miniconda3/envs/sglang/lib/python3.10/site-packages/torch/utils/_contextlib.py"
121 in "  File , line " in   File forward_thread_func_, line 116/home/miniconda3/envs/sglang/lib/python3.10/site-packages/torch/utils/_contextlib.pyforward_thread_func_"
116 in "
/home/miniconda3/envs/sglang/lib/python3.10/site-packages/torch/utils/_contextlib.py in , line decorate_context  File "  File decorate_context116
", line "
 in /home/miniconda3/envs/sglang/lib/python3.10/site-packages/torch/utils/_contextlib.py  File 116/home/miniconda3/envs/sglang/lib/python3.10/site-packages/torch/utils/_contex
tlib.py  File decorate_context"", line /home/miniconda3/envs/sglang/lib/python3.10/site-packages/sglang/srt/managers/tp_worker_overlap_thread.py in ""
116"decorate_context, line /home/miniconda3/envs/sglang/lib/python3.10/site-packages/sglang/srt/managers/tp_worker_overlap_thread.py in 
, line   File 116"  File 109"decorate_context in , line "109 in /home/miniconda3/envs/sglang/lib/python3.10/site-packages/sglang/srt/managers/tp_worker_overlap_thread.py
decorate_context/home/miniconda3/envs/sglang/lib/python3.10/site-packages/sglang/srt/managers/tp_worker_overlap_thread.py in forward_thread_func"
", line   File forward_thread_func
109, line "
  File  in   File 109/home/miniconda3/envs/sglang/lib/python3.10/site-packages/sglang/srt/managers/tp_worker_overlap_thread.py  File "forward_thread_func" in ""/home/miniconda
3/envs/sglang/lib/python3.10/site-packages/sglang/srt/managers/tp_worker_overlap_thread.py
/home/miniconda3/envs/sglang/lib/python3.10/threading.pyforward_thread_func, line /home/miniconda3/envs/sglang/lib/python3.10/threading.py"  File "
109", line ", line   File  in , line 109/home/miniconda3/envs/sglang/lib/python3.10/threading.py953"forward_thread_func953 in " in /home/miniconda3/envs/sglang/lib/python3.10/
threading.py
 in forward_thread_func, line run"run  File 
953
, line   File 
"  File  in 953"  File /home/miniconda3/envs/sglang/lib/python3.10/threading.py"run in /home/miniconda3/envs/sglang/lib/python3.10/threading.py""/home/minicon
da3/envs/sglang/lib/python3.10/threading.py
run"/home/miniconda3/envs/sglang/lib/python3.10/threading.py, line "  File 
, line "953, line "  File 1016, line  in 953/home/miniconda3/envs/sglang/lib/python3.10/threading.py" in 1016run in "/home/miniconda3/envs/sglang/lib/python3.10/threading.py_b
ootstrap_inner in run, line "
_bootstrap_inner  File 
1016, line 
  File "  File  in 1016"  File /home/miniconda3/envs/sglang/lib/python3.10/threading.py"_bootstrap_inner in /home/miniconda3/envs/sglang/lib/python3.10/threading.py""/cpfs/use
r/liuyanjiang/miniconda3/envs/sglang/lib/python3.10/threading.py
_bootstrap_inner"/home/miniconda3/envs/sglang/lib/python3.10/threading.py, line "  File  
, line   File "1016, line "973", line  in 1016/home/miniconda3/envs/sglang/lib/python3.10/threading.py in /home/miniconda3/envs/sglang/lib/python3.10/threading.py973_bootstrap
_inner in "_bootstrap" in 
_bootstrap_inner, line 
, line _bootstrap
  File 973
973
  File " in Thread 0x in 
"/home/miniconda3/envs/sglang/lib/python3.10/threading.py_bootstrap00007efb19b99640_bootstrapThread 0x/home/miniconda3/envs/sglang/lib/python3.10/threading.py"
 (most recent call first):

00007f24cbb99640", line 
  File 
 (most recent call first):
, line 973Thread 0x"Thread 0x  File 00007f40f4a0a640973 in 00007f429fb99640/home/miniconda3/envs/sglang/lib/python3.10/threading.py" (most recent call first):
 in _bootstrap (most recent call first):
"/home/miniconda3/envs/sglang/lib/python3.10/threading.py  File _bootstrap
  File , line ""

"324, line /home/miniconda3/envs/sglang/lib/python3.10/threading.py
Thread 0x/home/miniconda3/envs/sglang/lib/python3.10/threading.py in 324"Thread 0x00007f4ad2b5f640"wait in , line 00007fd27eb5f640 (most recent call first):
, line 
wait324 (most recent call first):
  File 324  File 
 in   File " in "  File wait"/home/miniconda3/envs/sglang/lib/python3.10/threading.pywait/home/miniconda3/envs/sglang/lib/python3.10/threading.py"
/home/miniconda3/envs/sglang/lib/python3.10/threading.py"
"  File , line /home/miniconda3/envs/sglang/lib/python3.10/threading.py  File ", line "607"", line 324/home/miniconda3/envs/sglang/lib/python3.10/threading.py in , line /cpfs/
user/liuyanjiang/miniconda3/envs/sglang/lib/python3.10/threading.py324 in "wait607" in wait, line 
 in , line wait
607  File wait607"
  File   File  in 
 in /home/miniconda3/envs/sglang/lib/python3.10/site-packages/tqdm/_monitor.pywait""wait  File "
/home/miniconda3/envs/sglang/lib/python3.10/threading.py/home/miniconda3/envs/sglang/lib/python3.10/threading.py
", line   File ""  File /home/miniconda3/envs/sglang/lib/python3.10/site-packages/tqdm/_monitor.py60", line , line "", line  in /home/miniconda3/envs/sglang/lib/python3.10/sit
e-packages/tqdm/_monitor.py607607/home/miniconda3/envs/sglang/lib/python3.10/site-packages/tqdm/_monitor.py60"run" in  in  in , line 
, line waitwaitrun6060  File in  in "  File   File   File runrun/home/miniconda3/envs/sglang/lib/python3.10/threading.py"""

"/home/miniconda3/envs/sglang/lib/python3.10/site-packages/tqdm/_monitor.py/home/miniconda3/envs/sglang/lib/python3.10/site-packages/tqdm/_monitor.py/home/min
iconda3/envs/sglang/lib/python3.10/threading.py  File , line "  File """, line /home/miniconda3/envs/sglang/lib/python3.10/threading.py1016, line ", line 1016" in 60/home/mini
conda3/envs/sglang/lib/python3.10/threading.py60 in , line _bootstrap_inner in " in _bootstrap_inner1016
run, line run
 in   File 
1016
_bootstrap_inner  File " in 
  File "/home/miniconda3/envs/sglang/lib/python3.10/threading.py_bootstrap_inner  File "/home/miniconda3/envs/sglang/lib/python3.10/threading.py"
  File "/home/miniconda3/envs/sglang/lib/python3.10/threading.py", line   File "/home/miniconda3/envs/sglang/lib/python3.10/threading.py", line 973"/home/mini
conda3/envs/sglang/lib/python3.10/threading.py", line 973 in /home/miniconda3/envs/sglang/lib/python3.10/threading.py", line 1016 in _bootstrap", line 973 in _bootstrap
, line 1016 in _bootstrap_inner

973 in _bootstrap

Thread 0x in _bootstrap_inner

  File Thread 0x00007efb177c6640_bootstrap
  File " (most recent call first):
00007f24c97c6640
Thread 0x"/home/miniconda3/envs/sglang/lib/python3.10/threading.py  File  (most recent call first):

/home/miniconda3/envs/sglang/lib/python3.10/threading.py00007f429d7c6640""  File Thread 0x" (most recent call first):
, line /home/miniconda3/envs/sglang/lib/python3.10/threading.py"00007f5099fc3640, line   File 973"/home/miniconda3/envs/sglang/lib/python3.10/threading.py (most recent call fi
rst):
973" in , line "  File  in /home/miniconda3/envs/sglang/lib/python3.10/threading.py_bootstrap324, line "_bootstrap"
 in 324/home/miniconda3/envs/sglang/lib/python3.10/threading.py
, line 
wait in 
"  File 
324Thread 0xwait00007fd2807c4640, line "Thread 0x in  
 (most recent call first):
324/home/miniconda3/envs/sglang/lib/python3.10/threading.pywait00007f4ad47c4640  File   File  in "
 (most recent call first):
""wait/home/miniconda3/envs/sglang/lib/python3.10/threading.py, line   File   File /home/miniconda3/envs/sglang/lib/python3.10/threading.py
"607" in ""  File , line /home/miniconda3/envs/sglang/lib/python3.10/threading.pywait/home/miniconda3/envs/sglang/lib/python3.10/threading.py, line "324"
"607/home/miniconda3/envs/sglang/lib/python3.10/threading.py in , line   File , line  in "wait607"324wait, line 
 in /home/miniconda3/envs/sglang/lib/python3.10/site-packages/tqdm/_monitor.py in 
607  File wait"wait  File  in "wait/home/miniconda3/envs/sglang/lib/python3.10/threading.py
, line 
"
"  File 60"  File /home/miniconda3/envs/sglang/lib/python3.10/site-packages/tqdm/_monitor.py  File , line  in /home/miniconda3/envs/sglang/lib/python3.10/site-packages/tqdm/_m
onitor.py"""607run"/home/miniconda3/envs/sglang/lib/python3.10/threading.py, line /home/miniconda3/envs/sglang/lib/python3.10/site-packages/tqdm/_monitor.py in 
, line "60"wait60, line  in   File , line 
 in 607run"60  File run in 
 Thread 0x, line   File 973
  File 
00007f28bc983640" in Thread 0x"Thread 0x (most recent call first):
/home/miniconda3/envs/sglang/lib/python3.10/site-packages/torch/_inductor/compile_worker/subproc_pool.py_bootstrap/home/miniconda3/envs/sglang/lib/python3.10/threading.py00007
f468c98364000007f5280f9c640"
  File " (most recent call first):
 (most recent call first):
, line 
, line "47Thread 0x  File   File 973/home/miniconda3/envs/sglang/lib/python3.10/site-packages/torch/_inductor/compile_worker/subproc_pool.py in 00007fd674ba4640"" in "_recv_msg (most recent ca
ll first):
/home/miniconda3/envs/sglang/lib/python3.10/site-packages/torch/_inductor/compile_worker/subproc_pool.py/home/miniconda3/envs/sglang/lib/python3.10/site-packages/torch/_induct
or/compile_worker/subproc_pool.py_bootstrap, line 
"  File "
"47, line   File , line 
/home/miniconda3/envs/sglang/lib/python3.10/site-packages/torch/_inductor/compile_worker/subproc_pool.py in 47"47Thread 0x"_recv_msg in /home/miniconda3/envs/sglang/lib/python
3.10/site-packages/torch/_inductor/compile_worker/subproc_pool.py in 00007f4ec8ea4640_recv_msg
, line _recv_msg" (most recent call first):

47  File 
  File , line  in "  File 153"  File _recv_msg/home/miniconda3/envs/sglang/lib/python3.10/site-packages/torch/_inductor/compile_worker/subproc_pool.py" in /home/miniconda3/env
s/sglang/lib/python3.10/site-packages/torch/_inductor/compile_worker/subproc_pool.py"
"/home/miniconda3/envs/sglang/lib/python3.10/site-packages/torch/_inductor/compile_worker/subproc_pool.py_read_thread"/home/miniconda3/envs/sglang/lib/python3.10/site-packages
/torch/_inductor/compile_worker/subproc_pool.py, line , line   File "
"15347", line   File  in , line  in /home/miniconda3/envs/sglang/lib/python3.10/site-packages/torch/_inductor/compile_worker/subproc_pool.py153"_read_thread153_recv_msg" in /cpfs/user/liuyanji
ang/miniconda3/envs/sglang/lib/python3.10/threading.py
 in 
, line _read_thread153
 in "_read_thread  File   File _read_thread, line 
""
953/home/miniconda3/envs/sglang/lib/python3.10/threading.py  File   File /home/miniconda3/envs/sglang/lib/python3.10/site-packages/torch/_inductor/compile_worker/subproc_pool.
py  File  in """""run, line /home/miniconda3/envs/sglang/lib/python3.10/threading.py/home/miniconda3/envs/sglang/lib/python3.10/threading.py, line /home/minic
onda3/envs/sglang/lib/python3.10/threading.py
953""153" in   File , line , line  in , line run"953953_read_thread953
/home/miniconda3/envs/sglang/lib/python3.10/threading.py in  in 
 in "  File runrun  File run, line "

"
1016/home/miniconda3/envs/sglang/lib/python3.10/threading.py  File   File /home/miniconda3/envs/sglang/lib/python3.10/threading.py  File  in """""_bootstrap_inner, line 
/home/miniconda3/envs/sglang/lib/python3.10/threading.py/home/miniconda3/envs/sglang/lib/python3.10/threading.py, line /home/miniconda3/envs/sglang/lib/python
3.10/threading.py1016  File ""953" in ", line , line  in , line _bootstrap_inner/home/miniconda3/envs/sglang/lib/python3.10/threading.py10161016run1016
" in   File  in 
 in , line "_bootstrap_inner_bootstrap_inner  File _bootstrap_inner973/home/miniconda3/envs/sglang/lib/python3.10/threading.py
 "
 in "_bootstrap  File   File /home/miniconda3/envs/sglang/lib/python3.10/threading.py  File , line 
""""973
/home/miniconda3/envs/sglang/lib/python3.10/threading.py/home/miniconda3/envs/sglang/lib/python3.10/threading.py, line /home/miniconda3/envs/sglang/lib/python
3.10/threading.py in Thread 0x""1016"_bootstrap, line 00007f0099450740, line  in , line 
973 (most recent call first):
973_bootstrap_inner973
 in  in 
  File  in Thread 0x_bootstrap00007f2a4b341740_bootstrap  File _bootstrap"
 (most recent call first):

"
/home/miniconda3/envs/sglang/lib/python3.10/site-packages/torch/cuda/streams.py

  File /home/miniconda3/envs/sglang/lib/python3.10/threading.py
"Thread 0xThread 0x""Thread 0x, line 00007f481b3f674000007fd8025ef740, line 00007f561be9e740/home/miniconda3/envs/sglang/lib/python3.10/site-packages/torch/cuda/streams.py225 (most recent call
 first):
 (most recent call first):
973 (most recent call first):
" in  in   File , line   File _bootstrap  File synchronize225""
"
 in /home/miniconda3/envs/sglang/lib/python3.10/site-packages/torch/cuda/streams.py/home/miniconda3/envs/sglang/lib/python3.10/site-packages/torch/cuda/streams.py
/home/miniconda3/envs/sglang/lib/python3.10/site-packages/torch/cuda/streams.py  File synchronize""Thread 0x", line "
, line 00007f5056861740225, line /home/miniconda3/envs/sglang/lib/python3.10/site-packages/sglang/srt/managers/tp_worker_overlap_thread.py  File 225 (most recent call first):
 in 225synchronize"" in   File  in 
, line /home/miniconda3/envs/sglang/lib/python3.10/site-packages/sglang/srt/managers/tp_worker_overlap_thread.pysynchronizesynchronize"166/home/miniconda3/envs/sglang/lib/pyth
on3.10/site-packages/torch/cuda/streams.py  File "

 in   File "resolve_batch_result", line   File ", line 
/home/miniconda3/envs/sglang/lib/python3.10/site-packages/sglang/srt/managers/tp_worker_overlap_thread.py166  File "/home/miniconda3/envs/sglang/lib/python3.10/site-packages/s
glang/srt/managers/tp_worker_overlap_thread.py225" in "/home/miniconda3/envs/sglang/lib/python3.10/site-packages/sglang/srt/managers/tp_worker_overlap_thread.py" in , line /cpfs/user/liuyanjia
ng/miniconda3/envs/sglang/lib/python3.10/site-packages/sglang/srt/managers/scheduler.pyresolve_batch_result", line synchronize166"
166, line 
 in , line   File  in 166  File resolve_batch_result1147" in resolve_batch_result
" in /home/miniconda3/envs/sglang/lib/python3.10/site-packages/sglang/srt/managers/scheduler.py
resolve_batch_result/home/miniconda3/envs/sglang/lib/python3.10/site-packages/sglang/srt/managers/tp_worker_overlap_thread.pyprocess_batch_result_prefill"  File   File 
"
  File ", line ", line "/home/miniconda3/envs/sglang/lib/python3.10/site-packages/sglang/srt/managers/scheduler.py  File 1147/home/miniconda3/envs/sglang/lib/python3.10/site-p
ackages/sglang/srt/managers/scheduler.py166/home/miniconda3/envs/sglang/lib/python3.10/site-packages/sglang/srt/managers/scheduler.py"" in , line " in "/home/miniconda3/envs/s
glang/lib/python3.10/site-packages/sglang/srt/managers/scheduler.pyprocess_batch_result_prefill1147, line resolve_batch_result, line "
 in 1147
1147, line process_batch_result_prefill1119  File  in   File  in 
 in "process_batch_result_prefill"process_batch_result_prefillprocess_batch_result  File /home/miniconda3/envs/sglang/lib/python3.10/site-packages/sglang/srt/managers/scheduler.py
/home/miniconda3/envs/sglang/lib/python3.10/site-packages/sglang/srt/managers/scheduler.py
 File <module>"
<string>", line 1 in <module>

Extension modules: numpy.core._multiarray_umath
Extension modules: numpy.core._multiarray_umath, numpy.core._multiarray_tests
Extension modules: 
Extension modules: numpy.core._multiarray_umath
Extension modules: numpy.core._multiarray_umathnumpy.core._multiarray_umath, 
Extension modules: , numpy.linalg._umath_linalgnumpy.core._multiarray_umathnumpy.core._multiarray_tests, numpy.core._multiarray_tests, , , numpy.linalg._umath_linalgnumpy.core._multiarray_testsnumpy.core._mult
iarray_tests, , numpy.fft._pocketfft_internalnumpy.core._multiarray_tests, numpy.linalg._umath_linalg, , , numpy.random._common, numpy.linalg._umath_linalgnumpy.linalg._umath_linalg, numpy.fft._pocketfft_inter
nalnumpy.linalg._umath_linalg, numpy.fft._pocketfft_internal, numpy.random._common, numpy.random._common, , numpy.fft._pocketfft_internalnumpy.fft._pocketfft_internal, numpy.fft._pocketfft_internal, numpy.rand
om.bit_generator, numpy.random._common, , , , numpy.random._commonnumpy.random._common, numpy.random.bit_generatornumpy.random._bounded_integersnumpy.random.bit_generator, , , numpy.random._bounded_integersnum
py.random._mt19937numpy.random._bounded_integers, , , , numpy.random._mt19937numpy.random.mtrandnumpy.random.bit_generatornumpy.random._mt19937, , , numpy.random.mtrand, numpy.random.bit_generator, , numpy.ran
dom.bit_generatornumpy.random._philoxnumpy.random._bounded_integers, numpy.random.mtrand, , numpy.random._philoxnumpy.random._bounded_integersnumpy.random._bounded_integers, , , , numpy.random._mt19937, numpy.
random._pcg64, numpy.random._philoxnumpy.random._pcg64numpy.random._mt19937, numpy.random._mt19937, , numpy.random.mtrand, , numpy.random._pcg64numpy.random._sfc64, numpy.random._sfc64, numpy.random.mtrandnump
y.random.mtrand, , numpy.random._philoxnumpy.random._sfc64numpy.random._generator, , , numpy.random._generator, numpy.random._philox, numpy.random._philoxnumpy.random._pcg64numpy.random._generator, , , numpy.r
andom._pcg64numpy.random._pcg64numpy.random._sfc64, , numpy.random._sfc64numpy.random._sfc64, numpy.random._generator, , numpy.random._generatornumpy.random._generator, , charset_normalizer.mdcharset_normalize
r.md, charset_normalizer.md, requests.packages.charset_normalizer.md, requests.packages.chardet.md, requests.packages.charset_normalizer.md, , , charset_normalizer.mdrequests.packages.charset_normalizer.mdrequ
ests.packages.chardet.md, , , charset_normalizer.mdcharset_normalizer.mdrequests.packages.chardet.md, requests.packages.charset_normalizer.md, requests.packages.chardet.md, , requests.packages.charset_normaliz
er.mdrequests.packages.charset_normalizer.md, requests.packages.chardet.md, requests.packages.chardet.md, multidict._multidict, yarl._quoting_c, , multidict._multidict, multidict._multidictpropcache._helpers_c
, , multidict._multidictyarl._quoting_c, , aiohttp._http_writeryarl._quoting_c, propcache._helpers_c, , multidict._multidict, , aiohttp._http_parser, multidict._multidictyarl._quoting_cpropcache._helpers_c, ai
ohttp._websocket.mask, aiohttp._http_writer, , , propcache._helpers_caiohttp._websocket.reader_cyarl._quoting_c, , aiohttp._http_parseryarl._quoting_c, aiohttp._http_writer, propcache._helpers_c, aiohttp._webs
ocket.mask, , , propcache._helpers_c, , frozenlist._frozenlistaiohttp._http_parseraiohttp._http_writeraiohttp._websocket.reader_c, aiohttp._http_writer, , , , aiohttp._http_parseraiohttp._websocket.maskaiohttp
._http_parseraiohttp._http_writer, frozenlist._frozenlist, , aiohttp._websocket.mask, , aiohttp._websocket.reader_caiohttp._websocket.maskaiohttp._http_parser, , aiohttp._websocket.reader_caiohttp._websocket.r
eader_c, aiohttp._websocket.mask, aiohttp._websocket.reader_c, frozenlist._frozenlist, , frozenlist._frozenlistfrozenlist._frozenlist, , frozenlist._frozenlistuvloop.loop, uvloop.loop, , uvloop.loopuvloop.loop
, uvloop.loop, uvloop.loop, torch._C, torch._C._dynamo.autograd_compiler, torch._C._dynamo.eval_frame, torch._C._dynamo.guards, , torch._Ctorch._C._dynamo.utils, torch._C._fft, torch._C._dynamo.autograd_compil
er, torch._C._linalg, torch._C._dynamo.eval_frame, torch._C._nested, torch._C._dynamo.guards, , torch._C._nntorch._
, cuda_utils, __triton_launcher (total: 119)
, cuda_utils, __triton_launcher (total: 119)
 (total: 119)
[rank0]:[E210 02:06:33.391752834 ProcessGroupNCCL.cpp:616] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1522789, OpType=_ALLGATHER_BASE, NumelIn=298960, NumelOut=4783360, Timeout(ms)=
600000) ran for 600093 milliseconds before timing out.
[rank0]:[E210 02:06:33.392284994 ProcessGroupNCCL.cpp:1785] [PG ID 2 PG GUID 3 Rank 0] Exception (either an error or timeout) detected by watchdog at work: 1522789, last enqueued NCCL work: 1522789, last compl
eted NCCL work: 1522788.
[rank0]:[E210 02:06:33.392326245 ProcessGroupNCCL.cpp:1834] [PG ID 2 PG GUID 3 Rank 0] Timeout at NCCL work: 1522789, last enqueued NCCL work: 1522789, last completed NCCL work: 1522788.
[rank0]:[E210 02:06:33.392339061 ProcessGroupNCCL.cpp:630] [Rank 0] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupt
ed/incomplete data.
[rank0]:[E210 02:06:33.392344903 ProcessGroupNCCL.cpp:636] [Rank 0] To avoid data inconsistency, we are taking the entire process down.
[rank2]:[E210 02:06:33.392948176 ProcessGroupNCCL.cpp:616] [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1522789, OpType=_ALLGATHER_BASE, NumelIn=298960, NumelOut=4783360, Timeout(ms)=
600000) ran for 600094 milliseconds before timing out.
[rank2]:[E210 02:06:33.393426230 ProcessGroupNCCL.cpp:1785] [PG ID 2 PG GUID 3 Rank 2] Exception (either an error or timeout) detected by watchdog at work: 1522789, last enqueued NCCL work: 1522789, last compl
eted NCCL work: 1522788.
[rank2]:[E210 02:06:33.393443515 ProcessGroupNCCL.cpp:1834] [PG ID 2 PG GUID 3 Rank 2] Timeout at NCCL work: 1522789, last enqueued NCCL work: 1522789, last completed NCCL work: 1522788.
[rank2]:[E210 02:06:33.393449435 ProcessGroupNCCL.cpp:630] [Rank 2] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupt
ed/incomplete data.
[rank2]:[E210 02:06:33.393455189 ProcessGroupNCCL.cpp:636] [Rank 2] To avoid data inconsistency, we are taking the entire process down.
[rank0]:[E210 02:06:33.393862144 ProcessGroupNCCL.cpp:1595] [PG ID 2 PG GUID 3 Rank 0] Process group watchdog thread terminated with exception: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(S
eqNum=1522789, OpType=_ALLGATHER_BASE, NumelIn=298960, NumelOut=4783360, Timeout(ms)=600000) ran for 600093 milliseconds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7ff38bd6c446 in /home/miniconda3/envs/sglang/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x282 (0x7ff3417cc772 in /home/miniconda3/envs/sglang/lib/python
3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7ff3417d3bb3 in /home/miniconda3/envs/sglang/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7ff3417d561d in /home/miniconda3/envs/sglang/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #4: <unknown function> + 0x145c0 (0x7ff38e36d5c0 in /home/miniconda3/envs/sglang/lib/python3.10/site-packages/torch/lib/libtorch.so)
frame #5: <unknown function> + 0x94ac3 (0x7ff430694ac3 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #6: <unknown function> + 0x126850 (0x7ff430726850 in /usr/lib/x86_64-linux-gnu/libc.so.6)

Fatal Python error: Aborted
Fatal Python error: Aborted

Thread 0x00007fde85ff7640 (most recent call first):
  File "/home/miniconda3/envs/sglang/lib/python3.10/site-packages/sglang/srt/managers/scheduler.py", line 462 in watchdog_thread
  File "/home/miniconda3/envs/sglang/lib/python3.10/threading.py", line 953 in run
  File "/home/miniconda3/envs/sglang/lib/python3.10/threading.py", line 1016 in _bootstrap_inner
  File "/home/miniconda3/envs/sglang/lib/python3.10/threading.py", line 973 in _bootstrap

Thread 0x00007fde867f8640 (most recent call first):
  File "/home/miniconda3/envs/sglang/lib/python3.10/threading.py", line 320 in wait
  File "/home/miniconda3/envs/sglang/lib/python3.10/queue.py", line 171 in get
  File "/home/miniconda3/envs/sglang/lib/python3.10/site-packages/sglang/srt/managers/tp_worker_overlap_thread.py", line 121 in forward_thread_func_
  File "/home/miniconda3/envs/sglang/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116 in decorate_context
  File "/home/miniconda3/envs/sglang/lib/python3.10/site-packages/sglang/srt/managers/tp_worker_overlap_thread.py", line 109 in forward_thread_func
  File "/home/miniconda3/envs/sglang/lib/python3.10/threading.py", line 953 in run
  File "/home/miniconda3/envs/sglang/lib/python3.10/threading.py", line 1016 in _bootstrap_inner
  File "/home/miniconda3/envs/sglang/lib/python3.10/threading.py", line 973 in _bootstrap

Thread 0x00007feea21d1640 (most recent call first):
  File "/home/miniconda3/envs/sglang/lib/python3.10/threading.py", line 324 in wait
  File "/home/miniconda3/envs/sglang/lib/python3.10/threading.py", line 607 in wait
  File "/home/miniconda3/envs/sglang/lib/python3.10/site-packages/tqdm/_monitor.py", line 60 in run
  File "/home/miniconda3/envs/sglang/lib/python3.10/threading.py", line 1016 in _bootstrap_inner
  File "/home/miniconda3/envs/sglang/lib/python3.10/threading.py", line 973 in _bootstrap

Thread 0x00007feea19d0640 (most recent call first):
  File "/home/miniconda3/envs/sglang/lib/python3.10/threading.py", line 324 in wait
  File "/home/miniconda3/envs/sglang/lib/python3.10/threading.py", line 607 in wait
  File "/home/miniconda3/envs/sglang/lib/python3.10/site-packages/tqdm/_monitor.py", line 60 in run
  File "/home/miniconda3/envs/sglang/lib/python3.10/threading.py", line 1016 in _bootstrap_inner
  File "/home/miniconda3/envs/sglang/lib/python3.10/threading.py", line 973 in _bootstrap

Thread 0x00007ff090aff640 (most recent call first):
  File "/home/miniconda3/envs/sglang/lib/python3.10/site-packages/torch/_inductor/compile_worker/subproc_pool.py", line 47 in _recv_msg
  File "/home/miniconda3/envs/sglang/lib/python3.10/site-packages/torch/_inductor/compile_worker/subproc_pool.py", line 153 in _read_thread
  File "/home/miniconda3/envs/sglang/lib/python3.10/threading.py", line 953 in run
  File "/home/miniconda3/envs/sglang/lib/python3.10/threading.py", line 1016 in _bootstrap_inner
  File "/home/miniconda3/envs/sglang/lib/python3.10/threading.py", line 973 in _bootstrap

Thread 0x00007ff4309ba740 (most recent call first):
  File "/home/miniconda3/envs/sglang/lib/python3.10/site-packages/torch/cuda/streams.py", line 225 in synchronize
  File "/home/miniconda3/envs/sglang/lib/python3.10/site-packages/sglang/srt/managers/tp_worker_overlap_thread.py", line 166 in resolve_batch_result
  File "/home/miniconda3/envs/sglang/lib/python3.10/site-packages/sglang/srt/managers/scheduler.py", line 1147 in process_batch_result_prefill
  File "/home/miniconda3/envs/sglang/lib/python3.10/site-packages/sglang/srt/managers/scheduler.py", line 1119 in process_batch_result
  File "/home/miniconda3/envs/sglang/lib/python3.10/site-packages/sglang/srt/managers/scheduler.py", line 519 in event_loop_overlap
  File "/home/miniconda3/envs/sglang/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116 in decorate_context
  File "/home/miniconda3/envs/sglang/lib/python3.10/site-packages/sglang/srt/managers/scheduler.py", line 1796 in run_scheduler_process
  File "/home/miniconda3/envs/sglang/lib/python3.10/multiprocessing/process.py", line 108 in run
  File "/home/miniconda3/envs/sglang/lib/python3.10/multiprocessing/process.py", line 314 in _bootstrap
  File "/home/miniconda3/envs/sglang/lib/python3.10/multiprocessing/spawn.py", line 129 in _main
  File "/home/miniconda3/envs/sglang/lib/python3.10/multiprocessing/spawn.py", line 116 in spawn_main
  File "<string>", line 1 in <module>
[rank2]:[E210 02:06:33.394775308 ProcessGroupNCCL.cpp:1595] [PG ID 2 PG GUID 3 Rank 2] Process group watchdog thread terminated with exception: [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(S
eqNum=1522789, OpType=_ALLGATHER_BASE, NumelIn=298960, NumelOut=4783360, Timeout(ms)=600000) ran for 600094 milliseconds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7ff1b616c446 in /home/miniconda3/envs/sglang/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x282 (0x7ff16b7cc772 in /home/miniconda3/envs/sglang/lib/python
3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7ff16b7d3bb3 in /miniconda3/envs/sglang/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7ff16b7d561d in /home/miniconda3/envs/sglang/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #4: <unknown function> + 0x145c0 (0x7ff1b843f5c0 in /home/miniconda3/envs/sglang/lib/python3.10/site-packages/torch/lib/libtorch.so)
frame #5: <unknown function> + 0x94ac3 (0x7ff25a894ac3 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #6: <unknown function> + 0x126850 (0x7ff25a926850 in /usr/lib/x86_64-linux-gnu/libc.so.6)

Fatal Python error: Aborted

zui-jiang avatar Feb 10 '25 02:02 zui-jiang

I encountered the same issue on Deepseek-V3 2*8 H20. Is it fixed in sglang==0.4.2.post4? @zhyncs

XiaotianWang0918 avatar Feb 11 '25 02:02 XiaotianWang0918

I encountered the same issue on Deepseek-V3 2*8 H20. Is it fixed in sglang==0.4.2.post4? @zhyncs

i have tried sglang==0.4.2.post4 but useless

LJL36 avatar Feb 11 '25 12:02 LJL36

Could someone provide the benchmarking command (client-side) used to trigger the crash?

nvcastet avatar Feb 13 '25 19:02 nvcastet

@zui-jiang Re-running with TORCH_DISTRIBUTED_DEBUG=DETAIL TORCH_SHOW_CPP_STACKTRACES=1 NCCL_DEBUG=INFO set on the server side would give us a bit more info too.

nvcastet avatar Feb 13 '25 20:02 nvcastet

+1, sglang==0.4.3 encountered the same issue on Deepseek-R1 2*8 H200.

node 1: python3 -m sglang.launch_server
--model-path /data/deepseek-ai/DeepSeek-R1
--tp 16
--dist-init-addr <head_ip>:20000
--nnodes 2
--node-rank 0
--trust-remote-code
--enable-dp-attention
--enable-torch-compile
--torch-compile-max-bs 8
--host 0.0.0.0
--port 40000 2>&1 | tee -a RUN_1.log

node 2: python3 -m sglang.launch_server
--model-path /data/deepseek-ai/DeepSeek-R1
--tp 16
--dist-init-addr <head_ip>:20000
--nnodes 2
--node-rank 1
--trust-remote-code
--enable-dp-attention
--enable-torch-compile
--torch-compile-max-bs 8
--host 0.0.0.0
--port 40000 2>&1 | tee -a RUN_1.log

isky-cd avatar Feb 14 '25 07:02 isky-cd

+1

githisw avatar Feb 15 '25 02:02 githisw

Watchdog timeout (self.watchdog_timeout=300) +1

verigle avatar Feb 15 '25 08:02 verigle

+1, 8xh20. deepseek r1 fp8

Image

Yimi81 avatar Feb 16 '25 14:02 Yimi81

@zui-jiang I met the same issue, from your logs, it seems before the 300s watchdog timeout, the longest input context would be around 13k. I've increased the watchdog timeout to 600s, when the input context is over 20k, it's timeout.

The context length used in benchmarking is a bit short, like 500+, then it won't timeout.

Apart from increasing the timeout, are there any other solutions?

Image

CyrusCY avatar Feb 18 '25 13:02 CyrusCY

#3280 would be the same issue

CyrusCY avatar Feb 18 '25 13:02 CyrusCY

Watchdog timeout (self.watchdog_timeout=300) +1 any progress on that?

Chandler-Bing avatar Feb 19 '25 02:02 Chandler-Bing

I met the same issue. "--disable-cuda-graph" works for me. However, adding this option greatly slows down the inference speed in low QPS setting.

tanconghui avatar Feb 19 '25 03:02 tanconghui

--disable-cuda-graph works, but super slow, like halve the speed, from 40+ tokens/s to 10~20 tokens/s.

also if I adjust the --watchdog-timeout 36000, it seems nccl timeout still at 600s.

CyrusCY avatar Feb 19 '25 03:02 CyrusCY

The timeout error is because the inference is already stucked or crashed. Setting a longer timeout doesn't help here.

I did some debug, and found the exact line of code where the inference got stuck, but I don't understand why. I tried different combinations of lauching arguments, --disable-cuda-graph is the only one works for me.

It indeed slows downs the speed a of a single request. However, if you have a large batch of requets to process in parallel, adding this option or not does't influence the speed.

tanconghui avatar Feb 19 '25 03:02 tanconghui

+1

piamo avatar Feb 20 '25 04:02 piamo

#3709 should fix this bug. Let me know if you encounter an issue after applying it.

nvcastet avatar Feb 20 '25 14:02 nvcastet