[Bug] DeepSeek R1 serve crash occasionally on 2*H100
Checklist
- [x] 1. I have searched related issues but cannot get the expected help.
- [x] 2. The bug has not been fixed in the latest version.
- [x] 3. Please note that if the bug-related issue you submitted lacks corresponding environment info and a minimal reproducible demo, it will be challenging for us to reproduce and resolve the issue, reducing the likelihood of receiving feedback.
- [x] 4. If the issue you raised is not a bug but a question, please raise a discussion at https://github.com/sgl-project/sglang/discussions/new/choose Otherwise, it will be closed.
- [x] 5. Please use English, otherwise it will be closed.
Describe the bug
When deploying dpsk-r1 using IB communication on two H100 units, unexpected crashes frequently occur, with a probability of occurring under tests with different numbers of requests.
Reproduction
env installation
following the installation in https://docs.sglang.ai/start/install.html
pip install --upgrade pip
conda install -c nvidia/label/cuda-12.4.0 cuda=12.4.0
pip install sgl-kernel --force-reinstall --no-deps
pip install "sglang[all]>=0.4.2.post2" --find-links https://flashinfer.ai/whl/cu124/torch2.5/flashinfer/
package information
aiohappyeyeballs==2.4.4
aiohttp==3.11.12
aiosignal==1.3.2
annotated-types==0.7.0
anthropic==0.45.2
anyio==4.8.0
asttokens==3.0.0
async-timeout==5.0.1
attrs==25.1.0
certifi==2025.1.31
charset-normalizer==3.4.1
click==8.1.8
cloudpickle==3.1.1
compressed-tensors==0.8.0
cuda-bindings==12.8.0
cuda-python==12.8.0
datasets==3.2.0
decorator==5.1.1
decord==0.6.0
dill==0.3.8
diskcache==5.6.3
distro==1.9.0
einops==0.8.0
exceptiongroup==1.2.2
executing==2.2.0
fastapi==0.115.8
filelock==3.17.0
flashinfer-python==0.2.0.post2+cu124torch2.5
frozenlist==1.5.0
fsspec==2024.9.0
gguf==0.10.0
h11==0.14.0
hf_transfer==0.1.9
httpcore==1.0.7
httptools==0.6.4
httpx==0.28.1
huggingface-hub==0.28.1
idna==3.10
importlib_metadata==8.6.1
iniconfig==2.0.0
interegular==0.3.3
ipython==8.32.0
jedi==0.19.2
Jinja2==3.1.5
jiter==0.8.2
jsonschema==4.23.0
jsonschema-specifications==2024.10.1
lark==1.2.2
litellm==1.60.6
llvmlite==0.44.0
lm-format-enforcer==0.10.9
MarkupSafe==3.0.2
matplotlib-inline==0.1.7
mistral_common==1.5.2
modelscope==1.22.3
mpmath==1.3.0
msgpack==1.1.0
msgspec==0.19.0
multidict==6.1.0
multiprocess==0.70.16
nest-asyncio==1.6.0
networkx==3.4.2
numba==0.61.0
numpy==1.26.4
nvidia-cublas-cu12==12.4.5.8
nvidia-cuda-cupti-cu12==12.4.127
nvidia-cuda-nvrtc-cu12==12.4.127
nvidia-cuda-runtime-cu12==12.4.127
nvidia-cudnn-cu12==9.1.0.70
nvidia-cufft-cu12==11.2.1.3
nvidia-curand-cu12==10.3.5.147
nvidia-cusolver-cu12==11.6.1.9
nvidia-cusparse-cu12==12.3.1.170
nvidia-ml-py==12.570.86
nvidia-nccl-cu12==2.21.5
nvidia-nvjitlink-cu12==12.4.127
nvidia-nvtx-cu12==12.4.127
openai==1.61.1
opencv-python-headless==4.11.0.86
orjson==3.10.15
outlines==0.0.46
packaging==24.2
pandas==2.2.3
parso==0.8.4
partial-json-parser==0.2.1.1.post5
pexpect==4.9.0
pillow==10.4.0
pluggy==1.5.0
prometheus-fastapi-instrumentator==7.0.2
prometheus_client==0.21.1
prompt_toolkit==3.0.50
propcache==0.2.1
protobuf==5.29.3
psutil==6.1.1
ptyprocess==0.7.0
pure_eval==0.2.3
py-cpuinfo==9.0.0
pyairports==2.1.1
pyarrow==19.0.0
pybind11==2.13.6
pycountry==24.6.1
pydantic==2.10.6
pydantic_core==2.27.2
Pygments==2.19.1
pytest==8.3.4
python-dateutil==2.9.0.post0
python-dotenv==1.0.1
python-multipart==0.0.20
pytz==2025.1
PyYAML==6.0.2
pyzmq==26.2.1
ray==2.42.0
referencing==0.36.2
regex==2024.11.6
requests==2.32.3
rpds-py==0.22.3
safetensors==0.5.2
sentencepiece==0.2.0
setproctitle==1.3.4
sgl-kernel==0.0.3.post1
sglang==0.4.2.post2
six==1.17.0
sniffio==1.3.1
stack-data==0.6.3
starlette==0.45.3
sympy==1.13.1
tiktoken==0.7.0
tokenizers==0.21.0
tomli==2.2.1
torch==2.5.1
torchao==0.8.0
torchvision==0.20.1
tqdm==4.67.1
traitlets==5.14.3
transformers==4.48.2
triton==3.1.0
typing_extensions==4.12.2
tzdata==2025.1
urllib3==2.3.0
uvicorn==0.34.0
uvloop==0.21.0
vllm==0.6.4.post1
watchfiles==1.0.4
wcwidth==0.2.13
websockets==14.2
xformers==0.0.28.post3
xgrammar==0.1.11
xxhash==3.5.0
yarl==1.18.3
zipp==3.21.0
serve setup
# node 1
python -m sglang.launch_server --model-path $MODEL_PATH --tp 16 --dist-init-addr ${IB_IP}:5000 --nnodes 2 --node-rank 0 --trust-remote-code --served-model-name dpsk-r1 --host 0.0.0.0
# node 2
python -m sglang.launch_server --model-path $MODEL_PATH --tp 16 --dist-init-addr ${IB_IP}:5000 --nnodes 2 --node-rank 1 --trust-remote-code --served-model-name dpsk-r1 --host 0.0.0.0
benchmark
Error Info
on node 1
2025-02-09 10:14:05 TP0] Decode batch. #running-req: 2, #token: 3698, token usage: 0.01, gen throughput (token/s): 50.46, #queue-req: 0
[2025-02-09 10:14:06 TP0] Decode batch. #running-req: 2, #token: 3778, token usage: 0.01, gen throughput (token/s): 50.47, #queue-req: 0
[2025-02-09 10:14:08 TP0] Decode batch. #running-req: 2, #token: 3858, token usage: 0.01, gen throughput (token/s): 50.41, #queue-req: 0
[2025-02-09 10:14:09 TP0] Decode batch. #running-req: 2, #token: 3938, token usage: 0.01, gen throughput (token/s): 50.34, #queue-req: 0
[2025-02-09 10:14:11] INFO: 10.39.13.144:48590 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-02-09 10:14:11 TP0] Decode batch. #running-req: 1, #token: 2162, token usage: 0.01, gen throughput (token/s): 46.21, #queue-req: 0
[2025-02-09 10:14:12] INFO: 10.39.13.144:48606 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-02-09 11:06:40 TP0] Prefill batch. #new-seq: 2, #new-token: 632, #cached-token: 126, cache hit rate: 80.75%, token usage: 0.00, #running-req: 0, #queue-req: 0
[2025-02-09 11:06:41 TP0] Prefill batch. #new-seq: 3, #new-token: 170, #cached-token: 837, cache hit rate: 80.76%, token usage: 0.00, #running-req: 2, #queue-req: 0
[2025-02-09 11:06:41 TP0] Prefill batch. #new-seq: 2, #new-token: 359, #cached-token: 558, cache hit rate: 80.68%, token usage: 0.00, #running-req: 5, #queue-req: 0
[2025-02-09 11:06:42 TP0] Decode batch. #running-req: 7, #token: 1029, token usage: 0.00, gen throughput (token/s): 0.02, #queue-req: 0
[2025-02-09 11:06:42 TP0] Prefill batch. #new-seq: 1, #new-token: 76, #cached-token: 279, cache hit rate: 80.67%, token usage: 0.00, #running-req: 7, #queue-req: 0
[2025-02-09 11:06:42 TP0] Prefill batch. #new-seq: 1, #new-token: 15, #cached-token: 279, cache hit rate: 80.69%, token usage: 0.00, #running-req: 8, #queue-req: 0
[2025-02-09 11:06:42 TP0] Prefill batch. #new-seq: 4, #new-token: 128, #cached-token: 1116, cache hit rate: 80.74%, token usage: 0.00, #running-req: 9, #queue-req: 0
[2025-02-09 11:06:43 TP0] Prefill batch. #new-seq: 18, #new-token: 1652, #cached-token: 5022, cache hit rate: 80.58%, token usage: 0.00, #running-req: 13, #queue-req: 0
[2025-02-09 11:06:43 TP0] Prefill batch. #new-seq: 52, #new-token: 3359, #cached-token: 14512, cache hit rate: 80.63%, token usage: 0.01, #running-req: 31, #queue-req: 2
[2025-02-09 11:13:23 TP4] Watchdog timeout (self.watchdog_timeout=300)
[2025-02-09 11:13:24 TP5] Watchdog timeout (self.watchdog_timeout=300)
[2025-02-09 11:13:24 TP2] Watchdog timeout (self.watchdog_timeout=300)
[2025-02-09 11:13:24 TP6] Watchdog timeout (self.watchdog_timeout=300)
[2025-02-09 11:13:24 TP1] Watchdog timeout (self.watchdog_timeout=300)
[2025-02-09 11:13:24 TP3] Watchdog timeout (self.watchdog_timeout=300)
[2025-02-09 11:13:24 TP7] Watchdog timeout (self.watchdog_timeout=300)
[2025-02-09 11:13:25 TP0] Watchdog timeout (self.watchdog_timeout=300)
[2025-02-09 11:13:28] Received sigquit from a child proces. It usually means the child failed.
[1] 25183 killed python -m sglang.launch_server --model-path --tp 16 --dist-init-addr 2 0
on node 2
[2025-02-09 07:00:58 TP12] Using default W8A8 Block FP8 kernel config. Performance might be sub-optimal! Config file not found at /cpfs/user/liuyanjiang/miniconda3/envs/sglang/lib/python3.10/site-packages/sglang/srt/layers/quantization/configs/N=2048,K=512,device_name=NVIDIA_L20Z,dtype=fp8_w8a8,block_shape=[128, 128].json
[2025-02-09 07:00:58 TP8] Using default W8A8 Block FP8 kernel config. Performance might be sub-optimal! Config file not found at /cpfs/user/liuyanjiang/miniconda3/envs/sglang/lib/python3.10/site-packages/sglang/srt/layers/quantization/configs/N=2048,K=512,device_name=NVIDIA_L20Z,dtype=fp8_w8a8,block_shape=[128, 128].json
[2025-02-09 07:00:58 TP9] Using default W8A8 Block FP8 kernel config. Performance might be sub-optimal! Config file not found at /cpfs/user/liuyanjiang/miniconda3/envs/sglang/lib/python3.10/site-packages/sglang/srt/layers/quantization/configs/N=2048,K=512,device_name=NVIDIA_L20Z,dtype=fp8_w8a8,block_shape=[128, 128].json
[2025-02-09 11:13:23 TP9] Watchdog timeout (self.watchdog_timeout=300)
[2025-02-09 11:13:23 TP11] Watchdog timeout (self.watchdog_timeout=300)
[2025-02-09 11:13:23 TP15] Watchdog timeout (self.watchdog_timeout=300)
[2025-02-09 11:13:24 TP8] Watchdog timeout (self.watchdog_timeout=300)
[2025-02-09 11:13:24 TP12] Watchdog timeout (self.watchdog_timeout=300)
[2025-02-09 11:13:24 TP13] Watchdog timeout (self.watchdog_timeout=300)
[2025-02-09 11:13:24 TP14] Watchdog timeout (self.watchdog_timeout=300)
[2025-02-09 11:13:25 TP10] Watchdog timeout (self.watchdog_timeout=300)
[2025-02-09 11:13:28] Received sigquit from a child proces. It usually means the child failed.
[1] 21769 killed python -m sglang.launch_server --model-path --tp 16 --dist-init-addr 2 1
system monitor
both nodes' gpu memory usage is round 75%, not even the crash moment node 1 gpu utilization
node 2 gpu utilization
occurrence frequency
0.5-4hours when requests any parallel degree of 32-128
Environment
Python: 3.10.16 (main, Dec 11 2024, 16:24:50) [GCC 11.2.0] CUDA available: True GPU 0,1,2,3,4,5,6,7: NVIDIA L20Z GPU 0,1,2,3,4,5,6,7 Compute Capability: 9.0 CUDA_HOME: /usr/local/cuda NVCC: Cuda compilation tools, release 12.5, V12.5.82 CUDA Driver Version: 550.127.08 PyTorch: 2.5.1+cu124 sglang: 0.4.2.post2 flashinfer: 0.2.0.post2+cu124torch2.5 triton: 3.1.0 transformers: 4.48.2 torchao: 0.8.0 numpy: 1.26.4 aiohttp: 3.11.12 fastapi: 0.115.8 hf_transfer: 0.1.9 huggingface_hub: 0.28.1 interegular: 0.3.3 modelscope: 1.22.3 orjson: 3.10.15 packaging: 24.2 psutil: 6.1.1 pydantic: 2.10.6 multipart: 0.0.20 zmq: 26.2.1 uvicorn: 0.34.0 uvloop: 0.21.0 vllm: 0.6.4.post1 openai: 1.61.1 anthropic: 0.45.2 decord: 0.6.0 NVIDIA Topology: [4mGPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7 NIC0 NIC1 NIC2 NIC3 NIC4 NIC5 NIC6 NIC7 CPU Affinity NUMA Affinity GPU NUMA ID[0m GPU0 X NV18 NV18 NV18 NV18 NV18 NV18 NV18 PHB PHB PHB PHB PHB PHB PHB PHB 0-127 0-1 N/A GPU1 NV18 X NV18 NV18 NV18 NV18 NV18 NV18 PHB PHB PHB PHB PHB PHB PHB PHB 0-127 0-1 N/A GPU2 NV18 NV18 X NV18 NV18 NV18 NV18 NV18 PHB PHB PHB PHB PHB PHB PHB PHB 0-127 0-1 N/A GPU3 NV18 NV18 NV18 X NV18 NV18 NV18 NV18 PHB PHB PHB PHB PHB PHB PHB PHB 0-127 0-1 N/A GPU4 NV18 NV18 NV18 NV18 X NV18 NV18 NV18 PHB PHB PHB PHB PHB PHB PHB PHB 0-127 0-1 N/A GPU5 NV18 NV18 NV18 NV18 NV18 X NV18 NV18 PHB PHB PHB PHB PHB PHB PHB PHB 0-127 0-1 N/A GPU6 NV18 NV18 NV18 NV18 NV18 NV18 X NV18 PHB PHB PHB PHB PHB PHB PHB PHB 0-127 0-1 N/A GPU7 NV18 NV18 NV18 NV18 NV18 NV18 NV18 X PHB PHB PHB PHB PHB PHB PHB PHB 0-127 0-1 N/A NIC0 PHB PHB PHB PHB PHB PHB PHB PHB X PHB PHB PHB PHB PHB PHB PHB NIC1 PHB PHB PHB PHB PHB PHB PHB PHB PHB X PHB PHB PHB PHB PHB PHB NIC2 PHB PHB PHB PHB PHB PHB PHB PHB PHB PHB X PHB PHB PHB PHB PHB NIC3 PHB PHB PHB PHB PHB PHB PHB PHB PHB PHB PHB X PHB PHB PHB PHB NIC4 PHB PHB PHB PHB PHB PHB PHB PHB PHB PHB PHB PHB X PHB PHB PHB NIC5 PHB PHB PHB PHB PHB PHB PHB PHB PHB PHB PHB PHB PHB X PHB PHB NIC6 PHB PHB PHB PHB PHB PHB PHB PHB PHB PHB PHB PHB PHB PHB X PHB NIC7 PHB PHB PHB PHB PHB PHB PHB PHB PHB PHB PHB PHB PHB PHB PHB X
Legend:
X = Self SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI) NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU) PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge) PIX = Connection traversing at most a single PCIe bridge NV# = Connection traversing a bonded set of # NVLinks
NIC Legend:
NIC0: mlx5_0 NIC1: mlx5_1 NIC2: mlx5_2 NIC3: mlx5_3 NIC4: mlx5_4 NIC5: mlx5_5 NIC6: mlx5_6 NIC7: mlx5_7
Hypervisor vendor: KVM ulimit soft: 102400
additional error log
[2025-02-10 01:56:09 TP0] Prefill batch. #new-seq: 1, #new-token: 184, #cached-token: 1048, cache hit rate: 70.90%, token usage: 0.94, #running-req: 56, #queue-req: 67
[2025-02-10 01:56:10 TP0] Decode batch. #running-req: 57, #token: 290889, token usage: 0.95, gen throughput (token/s): 656.96, #queue-req: 68
[2025-02-10 01:56:13 TP0] Decode batch. #running-req: 57, #token: 293169, token usage: 0.95, gen throughput (token/s): 715.76, #queue-req: 68
[2025-02-10 01:56:17 TP0] Decode batch. #running-req: 57, #token: 295449, token usage: 0.96, gen throughput (token/s): 719.65, #queue-req: 68
[2025-02-10 01:56:20 TP0] Decode batch. #running-req: 57, #token: 297729, token usage: 0.97, gen throughput (token/s): 717.03, #queue-req: 68
[2025-02-10 01:56:20] INFO: 10.39.127.44:34146 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-02-10 01:56:23 TP0] Decode batch. #running-req: 56, #token: 298298, token usage: 0.97, gen throughput (token/s): 705.06, #queue-req: 70
[2025-02-10 01:56:23] INFO: 10.39.127.44:33672 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-02-10 01:56:26 TP0] Decode batch. #running-req: 55, #token: 293276, token usage: 0.95, gen throughput (token/s): 694.74, #queue-req: 70
[2025-02-10 01:56:29 TP0] Decode batch. #running-req: 55, #token: 295476, token usage: 0.96, gen throughput (token/s): 695.83, #queue-req: 70
[2025-02-10 01:56:32 TP0] Prefill batch. #new-seq: 37, #new-token: 7244, #cached-token: 13389, cache hit rate: 70.89%, token usage: 0.45, #running-req: 33, #queue-req: 55
[rank6]:[E210 02:06:33.301528494 ProcessGroupNCCL.cpp:616] [Rank 6] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1522789, OpType=_ALLGATHER_BASE, NumelIn=298960, NumelOut=4783360, Timeout(ms)=
600000) ran for 600002 milliseconds before timing out.
[rank1]:[E210 02:06:33.302165623 ProcessGroupNCCL.cpp:616] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1522789, OpType=_ALLGATHER_BASE, NumelIn=298960, NumelOut=4783360, Timeout(ms)$
600000) ran for 600003 milliseconds before timing out.
[rank6]:[E210 02:06:33.302379661 ProcessGroupNCCL.cpp:1785] [PG ID 2 PG GUID 3 Rank 6] Exception (either an error or timeout) detected by watchdog at work: 1522789, last enqueued NCCL work: 1522789, last compl
eted NCCL work: 1522788.
[rank6]:[E210 02:06:33.302397811 ProcessGroupNCCL.cpp:1834] [PG ID 2 PG GUID 3 Rank 6] Timeout at NCCL work: 1522789, last enqueued NCCL work: 1522789, last completed NCCL work: 1522788.
[rank6]:[E210 02:06:33.302405183 ProcessGroupNCCL.cpp:630] [Rank 6] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupt
ed/incomplete data.
[rank6]:[E210 02:06:33.302412604 ProcessGroupNCCL.cpp:636] [Rank 6] To avoid data inconsistency, we are taking the entire process down.
[rank1]:[E210 02:06:33.302687980 ProcessGroupNCCL.cpp:1785] [PG ID 2 PG GUID 3 Rank 1] Exception (either an error or timeout) detected by watchdog at work: 1522789, last enqueued NCCL work: 1522789, last compl
eted NCCL work: 1522788.
[rank1]:[E210 02:06:33.302703257 ProcessGroupNCCL.cpp:1834] [PG ID 2 PG GUID 3 Rank 1] Timeout at NCCL work: 1522789, last enqueued NCCL work: 1522789, last completed NCCL work: 1522788.
[rank1]:[E210 02:06:33.302735851 ProcessGroupNCCL.cpp:630] [Rank 1] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupt
ed/incomplete data.
[rank1]:[E210 02:06:33.302739874 ProcessGroupNCCL.cpp:636] [Rank 1] To avoid data inconsistency, we are taking the entire process down.
[rank4]:[E210 02:06:33.321131395 ProcessGroupNCCL.cpp:616] [Rank 4] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1522789, OpType=_ALLGATHER_BASE, NumelIn=298960, NumelOut=4783360, Timeout(ms)=
600000) ran for 600022 milliseconds before timing out.
[rank4]:[E210 02:06:33.321659390 ProcessGroupNCCL.cpp:1785] [PG ID 2 PG GUID 3 Rank 4] Exception (either an error or timeout) detected by watchdog at work: 1522789, last enqueued NCCL work: 1522789, last compl
eted NCCL work: 1522788.
[rank4]:[E210 02:06:33.321676720 ProcessGroupNCCL.cpp:1834] [PG ID 2 PG GUID 3 Rank 4] Timeout at NCCL work: 1522789, last enqueued NCCL work: 1522789, last completed NCCL work: 1522788.
[rank4]:[E210 02:06:33.321682548 ProcessGroupNCCL.cpp:630] [Rank 4] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupt
ed/incomplete data.
[rank4]:[E210 02:06:33.321688076 ProcessGroupNCCL.cpp:636] [Rank 4] To avoid data inconsistency, we are taking the entire process down.
[rank7]:[E210 02:06:33.322299140 ProcessGroupNCCL.cpp:616] [Rank 7] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1522789, OpType=_ALLGATHER_BASE, NumelIn=298960, NumelOut=4783360, Timeout(ms)=
600000) ran for 600023 milliseconds before timing out.
[rank7]:[E210 02:06:33.322760131 ProcessGroupNCCL.cpp:1785] [PG ID 2 PG GUID 3 Rank 7] Exception (either an error or timeout) detected by watchdog at work: 1522789, last enqueued NCCL work: 1522789, last compl
eted NCCL work: 1522788.
[rank7]:[E210 02:06:33.322773854 ProcessGroupNCCL.cpp:1834] [PG ID 2 PG GUID 3 Rank 7] Timeout at NCCL work: 1522789, last enqueued NCCL work: 1522789, last completed NCCL work: 1522788.
[rank7]:[E210 02:06:33.322779871 ProcessGroupNCCL.cpp:630] [Rank 7] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupt
ed/incomplete data.
[rank7]:[E210 02:06:33.322786241 ProcessGroupNCCL.cpp:636] [Rank 7] To avoid data inconsistency, we are taking the entire process down.
[rank5]:[E210 02:06:33.326486590 ProcessGroupNCCL.cpp:616] [Rank 5] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1522789, OpType=_ALLGATHER_BASE, NumelIn=298960, NumelOut=4783360, Timeout(ms)=
600000) ran for 600027 milliseconds before timing out.
[rank5]:[E210 02:06:33.326991748 ProcessGroupNCCL.cpp:1785] [PG ID 2 PG GUID 3 Rank 5] Exception (either an error or timeout) detected by watchdog at work: 1522789, last enqueued NCCL work: 1522789, last compl
eted NCCL work: 1522788.
[rank5]:[E210 02:06:33.327006446 ProcessGroupNCCL.cpp:1834] [PG ID 2 PG GUID 3 Rank 5] Timeout at NCCL work: 1522789, last enqueued NCCL work: 1522789, last completed NCCL work: 1522788.
[rank5]:[E210 02:06:33.327014717 ProcessGroupNCCL.cpp:630] [Rank 5] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupt
ed/incomplete data.
[rank5]:[E210 02:06:33.327019900 ProcessGroupNCCL.cpp:636] [Rank 5] To avoid data inconsistency, we are taking the entire process down.
[rank3]:[E210 02:06:33.328206009 ProcessGroupNCCL.cpp:616] [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1522789, OpType=_ALLGATHER_BASE, NumelIn=298960, NumelOut=4783360, Timeout(ms)=
600000) ran for 600029 milliseconds before timing out.
[rank3]:[E210 02:06:33.328687681 ProcessGroupNCCL.cpp:1785] [PG ID 2 PG GUID 3 Rank 3] Exception (either an error or timeout) detected by watchdog at work: 1522789, last enqueued NCCL work: 1522789, last compl
eted NCCL work: 1522788.
[rank3]:[E210 02:06:33.328703419 ProcessGroupNCCL.cpp:1834] [PG ID 2 PG GUID 3 Rank 3] Timeout at NCCL work: 1522789, last enqueued NCCL work: 1522789, last completed NCCL work: 1522788.
[rank3]:[E210 02:06:33.328709319 ProcessGroupNCCL.cpp:630] [Rank 3] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupt
ed/incomplete data.
[rank3]:[E210 02:06:33.328713742 ProcessGroupNCCL.cpp:636] [Rank 3] To avoid data inconsistency, we are taking the entire process down.
[rank5]:[E210 02:06:33.347825685 ProcessGroupNCCL.cpp:1595] [PG ID 2 PG GUID 3 Rank 5] Process group watchdog thread terminated with exception: [Rank 5] Watchdog caught collective operation timeout: WorkNCCL(S
eqNum=1522789, OpType=_ALLGATHER_BASE, NumelIn=298960, NumelOut=4783360, Timeout(ms)=600000) ran for 600027 milliseconds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7efff4b6c446 in /home/miniconda3/envs/sglang/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x282 (0x7effaa1cc772 in /home/miniconda3/envs/sglang/lib/python
3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7effaa1d3bb3 in /home/miniconda3/envs/sglang/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7effaa1d561d in /home/miniconda3/envs/sglang/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #4: <unknown function> + 0x145c0 (0x7efff6e015c0 in /home/miniconda3/envs/sglang/lib/python3.10/site-packages/torch/lib/libtorch.so)
frame #5: <unknown function> + 0x94ac3 (0x7f0099294ac3 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #6: <unknown function> + 0x126850 (0x7f0099326850 in /usr/lib/x86_64-linux-gnu/libc.so.6)
[rank1]:[E210 02:06:33.347835395 ProcessGroupNCCL.cpp:1595] [PG ID 2 PG GUID 3 Rank 1] Process group watchdog thread terminated with exception: [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(S
eqNum=1522789, OpType=_ALLGATHER_BASE, NumelIn=298960, NumelOut=4783360, Timeout(ms)=600000) ran for 600003 milliseconds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7f55772b9446 in /home/miniconda3/envs/sglang/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x282 (0x7f552cdcc772 in /home/miniconda3/envs/sglang/lib/python
3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7f552cdd3bb3 in /home/miniconda3/envs/sglang/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7f552cdd561d in /home/miniconda3/envs/sglang/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #4: <unknown function> + 0x145c0 (0x7f55798545c0 in /home/miniconda3/envs/sglang/lib/python3.10/site-packages/torch/lib/libtorch.so)
frame #5: <unknown function> + 0x94ac3 (0x7f561bc94ac3 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #6: <unknown function> + 0x126850 (0x7f561bd26850 in /usr/lib/x86_64-linux-gnu/libc.so.6)
[rank3]:[E210 02:06:33.347824908 ProcessGroupNCCL.cpp:1595] [PG ID 2 PG GUID 3 Rank 3] Process group watchdog thread terminated with exception: [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(S
eqNum=1522789, OpType=_ALLGATHER_BASE, NumelIn=298960, NumelOut=4783360, Timeout(ms)=600000) ran for 600029 milliseconds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7f29a66b9446 in /home/miniconda3/envs/sglang/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x282 (0x7f295c1cc772 in /home/miniconda3/envs/sglang/lib/python
3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7f295c1d3bb3 in /home/miniconda3/envs/sglang/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7f295c1d561d in /home/miniconda3/envs/sglang/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #4: <unknown function> + 0x145c0 (0x7f29a8ce05c0 in /home/miniconda3/envs/sglang/lib/python3.10/site-packages/torch/lib/libtorch.so)
frame #5: <unknown function> + 0x94ac3 (0x7f2a4b094ac3 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #6: <unknown function> + 0x126850 (0x7f2a4b126850 in /usr/lib/x86_64-linux-gnu/libc.so.6)
[rank7]:[E210 02:06:33.347826419 ProcessGroupNCCL.cpp:1595] [PG ID 2 PG GUID 3 Rank 7] Process group watchdog thread terminated with exception: [Rank 7] Watchdog caught collective operation timeout: WorkNCCL(S
eqNum=1522789, OpType=_ALLGATHER_BASE, NumelIn=298960, NumelOut=4783360, Timeout(ms)=600000) ran for 600023 milliseconds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7f477676c446 in /home/miniconda3/envs/sglang/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x282 (0x7f472c1cc772 in /home/miniconda3/envs/sglang/lib/python
3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7f472c1d3bb3 in /home/miniconda3/envs/sglang/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7f472c1d561d in /home/miniconda3/envs/sglang/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #4: <unknown function> + 0x145c0 (0x7f4778dad5c0 in /home/miniconda3/envs/sglang/lib/python3.10/site-packages/torch/lib/libtorch.so)
frame #5: <unknown function> + 0x94ac3 (0x7f481b094ac3 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #6: <unknown function> + 0x126850 (0x7f481b126850 in /usr/lib/x86_64-linux-gnu/libc.so.6)
[rank4]:[E210 02:06:33.347834904 ProcessGroupNCCL.cpp:1595] [PG ID 2 PG GUID 3 Rank 4] Process group watchdog thread terminated with exception: [Rank 4] Watchdog caught collective operation timeout: WorkNCCL(S
eqNum=1522789, OpType=_ALLGATHER_BASE, NumelIn=298960, NumelOut=4783360, Timeout(ms)=600000) ran for 600022 milliseconds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7f4fb1f6c446 in /home/miniconda3/envs/sglang/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x282 (0x7f4f675cc772 in /home/miniconda3/envs/sglang/lib/python
3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7f4f675d3bb3 in /home/miniconda3/envs/sglang/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7f4f675d561d in /home/miniconda3/envs/sglang/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #4: <unknown function> + 0x145c0 (0x7f4fb42115c0 in /home/miniconda3/envs/sglang/lib/python3.10/site-packages/torch/lib/libtorch.so)
frame #5: <unknown function> + 0x94ac3 (0x7f5056694ac3 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #6: <unknown function> + 0x126850 (0x7f5056726850 in /usr/lib/x86_64-linux-gnu/libc.so.6)
[rank6]:[E210 02:06:33.347836420 ProcessGroupNCCL.cpp:1595] [PG ID 2 PG GUID 3 Rank 6] Process group watchdog thread terminated with exception: [Rank 6] Watchdog caught collective operation timeout: WorkNCCL(S
eqNum=1522789, OpType=_ALLGATHER_BASE, NumelIn=298960, NumelOut=4783360, Timeout(ms)=600000) ran for 600002 milliseconds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7fd75d96c446 in /home/miniconda3/envs/sglang/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x282 (0x7fd7133cc772 in /home/miniconda3/envs/sglang/lib/python
3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7fd7133d3bb3 in /home/miniconda3/envs/sglang/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7fd7133d561d in /home/miniconda3/envs/sglang/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #4: <unknown function> + 0x145c0 (0x7fd75ffa65c0 in /home/miniconda3/envs/sglang/lib/python3.10/site-packages/torch/lib/libtorch.so)
frame #5: <unknown function> + 0x94ac3 (0x7fd802294ac3 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #6: <unknown function> + 0x126850 (0x7fd802326850 in /usr/lib/x86_64-linux-gnu/libc.so.6)
Fatal Python error: Fatal Python error: Fatal Python error: AbortedAbortedFatal Python error: Aborted
Aborted
Thread 0xThread 0xFatal Python error:
Thread 0x00007eeb027f8640Aborted00007f4080ff9640Fatal Python error: Thread 0x (most recent call first):
00007f14a47fc64000007f3295e69640
(most recent call first):
Aborted (most recent call first):
(most recent call first):
Thread 0x File
00007f3aaba00640 File " File /home/miniconda3/envs/sglang/lib/python3.10/site-packages/sglang/srt/managers/scheduler.pyThread 0x File (most recent call first):
"""00007fc2587f8640/home/miniconda3/envs/sglang/lib/python3.10/site-packages/sglang/srt/managers/scheduler.py"/home/miniconda3/envs/sglang/lib/python3.10/site-packages/sglang/
srt/managers/scheduler.py, line (most recent call first):
File "/home/miniconda3/envs/sglang/lib/python3.10/site-packages/sglang/srt/managers/scheduler.py""462, line " File , line /home/miniconda3/envs/sglang/lib/python3.10/site-p
ackages/sglang/srt/managers/scheduler.py in 462, line 462"watchdog_thread" in /home/miniconda3/envs/sglang/lib/python3.10/site-packages/sglang/srt/managers/scheduler.py462 in
, line "watchdog_thread in watchdog_thread File 462
, line watchdog_thread
" in 462
File /home/miniconda3/envs/sglang/lib/python3.10/threading.py File "watchdog_thread, line in " File "
953/home/miniconda3/envs/sglang/lib/python3.10/threading.pywatchdog_thread"/home/miniconda3/envs/sglang/lib/python3.10/threading.py in "
File /home/miniconda3/envs/sglang/lib/python3.10/threading.py"run, line , line ""
File 953953, line /home/miniconda3/envs/sglang/lib/python3.10/threading.py" in File in 953"/home/miniconda3/envs/sglang/lib/python3.10/threading.py"runrun in , line "/cpf
s/user/liuyanjiang/miniconda3/envs/sglang/lib/python3.10/threading.py
953run, line " File
File in 953, line "" File run in 1016/home/miniconda3/envs/sglang/lib/python3.10/threading.py/home/miniconda3/envs/sglang/lib/python3.10/threading.py"
in run_bootstrap_inner""/home/miniconda3/envs/sglang/lib/python3.10/threading.py File
, line , line "" File File 10161016, line /home/miniconda3/envs/sglang/lib/python3.10/threading.py"" in in 1016"/home/miniconda3/envs/sglang/lib/python3.10/threading.py/cp
fs/user/liuyanjiang/miniconda3/envs/sglang/lib/python3.10/threading.py_bootstrap_inner in _bootstrap_inner, line ""
_bootstrap_inner1016, line , line File
File in 9731016"" File _bootstrap_inner in in /home/miniconda3/envs/sglang/lib/python3.10/threading.py/home/miniconda3/envs/sglang/lib/python3.10/threading.py""
_bootstrap_bootstrap_inner"/home/miniconda3/envs/sglang/lib/python3.10/threading.py, line File
, line File "973"
973", line in /home/miniconda3/envs/sglang/lib/python3.10/threading.pyThread 0x in /home/miniconda3/envs/sglang/lib/python3.10/threading.py973_bootstrap"_bootstrap00007eeb02f
f9640" in
, line
(most recent call first):
, line _bootstrap
973
973 File
Thread 0x in Thread 0x in "
00007f40817fa640_bootstrap00007f14a4ffd640_bootstrap/home/miniconda3/envs/sglang/lib/python3.10/threading.pyThread 0x (most recent call first):
(most recent call first):
"00007f32f5fff640
File
File , line (most recent call first):
Thread 0x"Thread 0x"320/home/miniconda3/envs/sglang/lib/python3.10/threading.py00007f3aac201640 File /home/miniconda3/envs/sglang/lib/python3.10/threading.py00007fc258ff9640
in " (most recent call first):
"" (most recent call first):
wait, line /home/miniconda3/envs/sglang/lib/python3.10/threading.py File , line
File 320""320" in , line /home/miniconda3/envs/sglang/lib/python3.10/threading.py File in /home/miniconda3/envs/sglang/lib/python3.10/threading.pywait320""wait"
in , line /home/miniconda3/envs/sglang/lib/python3.10/queue.py
, line wait320" File 320 File
in , line in ""wait File 171wait"/home/miniconda3/envs/sglang/lib/python3.10/queue.py"/home/miniconda3/envs/sglang/lib/python3.10/queue.py
in
/home/miniconda3/envs/sglang/lib/python3.10/queue.py, line " File get File "171" in , line "
, line /home/miniconda3/envs/sglang/lib/python3.10/queue.pyget171/home/miniconda3/envs/sglang/lib/python3.10/queue.py171"
in File " in , line get File ", line get171
/home/miniconda3/envs/sglang/lib/python3.10/site-packages/sglang/srt/managers/tp_worker_overlap_thread.py"171
in "/home/miniconda3/envs/sglang/lib/python3.10/site-packages/sglang/srt/managers/tp_worker_overlap_thread.py File in File get, line ""get
121", line /home/miniconda3/envs/sglang/lib/python3.10/site-packages/sglang/srt/managers/tp_worker_overlap_thread.py/home/miniconda3/envs/sglang/lib/python3.10/site-packages/s
glang/srt/managers/tp_worker_overlap_thread.py
in 121 File "" in forward_thread_func_ File
", line , line forward_thread_func_"/home/miniconda3/envs/sglang/lib/python3.10/site-packages/sglang/srt/managers/tp_worker_overlap_thread.py121121 File
/home/miniconda3/envs/sglang/lib/python3.10/site-packages/sglang/srt/managers/tp_worker_overlap_thread.py" in File in "", line forward_thread_func_"/home/miniconda3/envs/sg
lang/lib/python3.10/site-packages/torch/utils/_contextlib.pyforward_thread_func_, line 121
/home/miniconda3/envs/sglang/lib/python3.10/site-packages/torch/utils/_contextlib.py"
121 in " File , line " in File forward_thread_func_, line 116/home/miniconda3/envs/sglang/lib/python3.10/site-packages/torch/utils/_contextlib.pyforward_thread_func_"
116 in "
/home/miniconda3/envs/sglang/lib/python3.10/site-packages/torch/utils/_contextlib.py in , line decorate_context File " File decorate_context116
", line "
in /home/miniconda3/envs/sglang/lib/python3.10/site-packages/torch/utils/_contextlib.py File 116/home/miniconda3/envs/sglang/lib/python3.10/site-packages/torch/utils/_contex
tlib.py File decorate_context"", line /home/miniconda3/envs/sglang/lib/python3.10/site-packages/sglang/srt/managers/tp_worker_overlap_thread.py in ""
116"decorate_context, line /home/miniconda3/envs/sglang/lib/python3.10/site-packages/sglang/srt/managers/tp_worker_overlap_thread.py in
, line File 116" File 109"decorate_context in , line "109 in /home/miniconda3/envs/sglang/lib/python3.10/site-packages/sglang/srt/managers/tp_worker_overlap_thread.py
decorate_context/home/miniconda3/envs/sglang/lib/python3.10/site-packages/sglang/srt/managers/tp_worker_overlap_thread.py in forward_thread_func"
", line File forward_thread_func
109, line "
File in File 109/home/miniconda3/envs/sglang/lib/python3.10/site-packages/sglang/srt/managers/tp_worker_overlap_thread.py File "forward_thread_func" in ""/home/miniconda
3/envs/sglang/lib/python3.10/site-packages/sglang/srt/managers/tp_worker_overlap_thread.py
/home/miniconda3/envs/sglang/lib/python3.10/threading.pyforward_thread_func, line /home/miniconda3/envs/sglang/lib/python3.10/threading.py" File "
109", line ", line File in , line 109/home/miniconda3/envs/sglang/lib/python3.10/threading.py953"forward_thread_func953 in " in /home/miniconda3/envs/sglang/lib/python3.10/
threading.py
in forward_thread_func, line run"run File
953
, line File
" File in 953" File /home/miniconda3/envs/sglang/lib/python3.10/threading.py"run in /home/miniconda3/envs/sglang/lib/python3.10/threading.py""/home/minicon
da3/envs/sglang/lib/python3.10/threading.py
run"/home/miniconda3/envs/sglang/lib/python3.10/threading.py, line " File
, line "953, line " File 1016, line in 953/home/miniconda3/envs/sglang/lib/python3.10/threading.py" in 1016run in "/home/miniconda3/envs/sglang/lib/python3.10/threading.py_b
ootstrap_inner in run, line "
_bootstrap_inner File
1016, line
File " File in 1016" File /home/miniconda3/envs/sglang/lib/python3.10/threading.py"_bootstrap_inner in /home/miniconda3/envs/sglang/lib/python3.10/threading.py""/cpfs/use
r/liuyanjiang/miniconda3/envs/sglang/lib/python3.10/threading.py
_bootstrap_inner"/home/miniconda3/envs/sglang/lib/python3.10/threading.py, line " File
, line File "1016, line "973", line in 1016/home/miniconda3/envs/sglang/lib/python3.10/threading.py in /home/miniconda3/envs/sglang/lib/python3.10/threading.py973_bootstrap
_inner in "_bootstrap" in
_bootstrap_inner, line
, line _bootstrap
File 973
973
File " in Thread 0x in
"/home/miniconda3/envs/sglang/lib/python3.10/threading.py_bootstrap00007efb19b99640_bootstrapThread 0x/home/miniconda3/envs/sglang/lib/python3.10/threading.py"
(most recent call first):
00007f24cbb99640", line
File
(most recent call first):
, line 973Thread 0x"Thread 0x File 00007f40f4a0a640973 in 00007f429fb99640/home/miniconda3/envs/sglang/lib/python3.10/threading.py" (most recent call first):
in _bootstrap (most recent call first):
"/home/miniconda3/envs/sglang/lib/python3.10/threading.py File _bootstrap
File , line ""
"324, line /home/miniconda3/envs/sglang/lib/python3.10/threading.py
Thread 0x/home/miniconda3/envs/sglang/lib/python3.10/threading.py in 324"Thread 0x00007f4ad2b5f640"wait in , line 00007fd27eb5f640 (most recent call first):
, line
wait324 (most recent call first):
File 324 File
in File " in " File wait"/home/miniconda3/envs/sglang/lib/python3.10/threading.pywait/home/miniconda3/envs/sglang/lib/python3.10/threading.py"
/home/miniconda3/envs/sglang/lib/python3.10/threading.py"
" File , line /home/miniconda3/envs/sglang/lib/python3.10/threading.py File ", line "607"", line 324/home/miniconda3/envs/sglang/lib/python3.10/threading.py in , line /cpfs/
user/liuyanjiang/miniconda3/envs/sglang/lib/python3.10/threading.py324 in "wait607" in wait, line
in , line wait
607 File wait607"
File File in
in /home/miniconda3/envs/sglang/lib/python3.10/site-packages/tqdm/_monitor.pywait""wait File "
/home/miniconda3/envs/sglang/lib/python3.10/threading.py/home/miniconda3/envs/sglang/lib/python3.10/threading.py
", line File "" File /home/miniconda3/envs/sglang/lib/python3.10/site-packages/tqdm/_monitor.py60", line , line "", line in /home/miniconda3/envs/sglang/lib/python3.10/sit
e-packages/tqdm/_monitor.py607607/home/miniconda3/envs/sglang/lib/python3.10/site-packages/tqdm/_monitor.py60"run" in in in , line
, line waitwaitrun6060 File in in " File File File runrun/home/miniconda3/envs/sglang/lib/python3.10/threading.py"""
"/home/miniconda3/envs/sglang/lib/python3.10/site-packages/tqdm/_monitor.py/home/miniconda3/envs/sglang/lib/python3.10/site-packages/tqdm/_monitor.py/home/min
iconda3/envs/sglang/lib/python3.10/threading.py File , line " File """, line /home/miniconda3/envs/sglang/lib/python3.10/threading.py1016, line ", line 1016" in 60/home/mini
conda3/envs/sglang/lib/python3.10/threading.py60 in , line _bootstrap_inner in " in _bootstrap_inner1016
run, line run
in File
1016
_bootstrap_inner File " in
File "/home/miniconda3/envs/sglang/lib/python3.10/threading.py_bootstrap_inner File "/home/miniconda3/envs/sglang/lib/python3.10/threading.py"
File "/home/miniconda3/envs/sglang/lib/python3.10/threading.py", line File "/home/miniconda3/envs/sglang/lib/python3.10/threading.py", line 973"/home/mini
conda3/envs/sglang/lib/python3.10/threading.py", line 973 in /home/miniconda3/envs/sglang/lib/python3.10/threading.py", line 1016 in _bootstrap", line 973 in _bootstrap
, line 1016 in _bootstrap_inner
973 in _bootstrap
Thread 0x in _bootstrap_inner
File Thread 0x00007efb177c6640_bootstrap
File " (most recent call first):
00007f24c97c6640
Thread 0x"/home/miniconda3/envs/sglang/lib/python3.10/threading.py File (most recent call first):
/home/miniconda3/envs/sglang/lib/python3.10/threading.py00007f429d7c6640"" File Thread 0x" (most recent call first):
, line /home/miniconda3/envs/sglang/lib/python3.10/threading.py"00007f5099fc3640, line File 973"/home/miniconda3/envs/sglang/lib/python3.10/threading.py (most recent call fi
rst):
973" in , line " File in /home/miniconda3/envs/sglang/lib/python3.10/threading.py_bootstrap324, line "_bootstrap"
in 324/home/miniconda3/envs/sglang/lib/python3.10/threading.py
, line
wait in
" File
324Thread 0xwait00007fd2807c4640, line "Thread 0x in
(most recent call first):
324/home/miniconda3/envs/sglang/lib/python3.10/threading.pywait00007f4ad47c4640 File File in "
(most recent call first):
""wait/home/miniconda3/envs/sglang/lib/python3.10/threading.py, line File File /home/miniconda3/envs/sglang/lib/python3.10/threading.py
"607" in "" File , line /home/miniconda3/envs/sglang/lib/python3.10/threading.pywait/home/miniconda3/envs/sglang/lib/python3.10/threading.py, line "324"
"607/home/miniconda3/envs/sglang/lib/python3.10/threading.py in , line File , line in "wait607"324wait, line
in /home/miniconda3/envs/sglang/lib/python3.10/site-packages/tqdm/_monitor.py in
607 File wait"wait File in "wait/home/miniconda3/envs/sglang/lib/python3.10/threading.py
, line
"
" File 60" File /home/miniconda3/envs/sglang/lib/python3.10/site-packages/tqdm/_monitor.py File , line in /home/miniconda3/envs/sglang/lib/python3.10/site-packages/tqdm/_m
onitor.py"""607run"/home/miniconda3/envs/sglang/lib/python3.10/threading.py, line /home/miniconda3/envs/sglang/lib/python3.10/site-packages/tqdm/_monitor.py in
, line "60"wait60, line in File , line
in 607run"60 File run in
Thread 0x, line File 973
File
00007f28bc983640" in Thread 0x"Thread 0x (most recent call first):
/home/miniconda3/envs/sglang/lib/python3.10/site-packages/torch/_inductor/compile_worker/subproc_pool.py_bootstrap/home/miniconda3/envs/sglang/lib/python3.10/threading.py00007
f468c98364000007f5280f9c640"
File " (most recent call first):
(most recent call first):
, line
, line "47Thread 0x File File 973/home/miniconda3/envs/sglang/lib/python3.10/site-packages/torch/_inductor/compile_worker/subproc_pool.py in 00007fd674ba4640"" in "_recv_msg (most recent ca
ll first):
/home/miniconda3/envs/sglang/lib/python3.10/site-packages/torch/_inductor/compile_worker/subproc_pool.py/home/miniconda3/envs/sglang/lib/python3.10/site-packages/torch/_induct
or/compile_worker/subproc_pool.py_bootstrap, line
" File "
"47, line File , line
/home/miniconda3/envs/sglang/lib/python3.10/site-packages/torch/_inductor/compile_worker/subproc_pool.py in 47"47Thread 0x"_recv_msg in /home/miniconda3/envs/sglang/lib/python
3.10/site-packages/torch/_inductor/compile_worker/subproc_pool.py in 00007f4ec8ea4640_recv_msg
, line _recv_msg" (most recent call first):
47 File
File , line in " File 153" File _recv_msg/home/miniconda3/envs/sglang/lib/python3.10/site-packages/torch/_inductor/compile_worker/subproc_pool.py" in /home/miniconda3/env
s/sglang/lib/python3.10/site-packages/torch/_inductor/compile_worker/subproc_pool.py"
"/home/miniconda3/envs/sglang/lib/python3.10/site-packages/torch/_inductor/compile_worker/subproc_pool.py_read_thread"/home/miniconda3/envs/sglang/lib/python3.10/site-packages
/torch/_inductor/compile_worker/subproc_pool.py, line , line File "
"15347", line File in , line in /home/miniconda3/envs/sglang/lib/python3.10/site-packages/torch/_inductor/compile_worker/subproc_pool.py153"_read_thread153_recv_msg" in /cpfs/user/liuyanji
ang/miniconda3/envs/sglang/lib/python3.10/threading.py
in
, line _read_thread153
in "_read_thread File File _read_thread, line
""
953/home/miniconda3/envs/sglang/lib/python3.10/threading.py File File /home/miniconda3/envs/sglang/lib/python3.10/site-packages/torch/_inductor/compile_worker/subproc_pool.
py File in """""run, line /home/miniconda3/envs/sglang/lib/python3.10/threading.py/home/miniconda3/envs/sglang/lib/python3.10/threading.py, line /home/minic
onda3/envs/sglang/lib/python3.10/threading.py
953""153" in File , line , line in , line run"953953_read_thread953
/home/miniconda3/envs/sglang/lib/python3.10/threading.py in in
in " File runrun File run, line "
"
1016/home/miniconda3/envs/sglang/lib/python3.10/threading.py File File /home/miniconda3/envs/sglang/lib/python3.10/threading.py File in """""_bootstrap_inner, line
/home/miniconda3/envs/sglang/lib/python3.10/threading.py/home/miniconda3/envs/sglang/lib/python3.10/threading.py, line /home/miniconda3/envs/sglang/lib/python
3.10/threading.py1016 File ""953" in ", line , line in , line _bootstrap_inner/home/miniconda3/envs/sglang/lib/python3.10/threading.py10161016run1016
" in File in
in , line "_bootstrap_inner_bootstrap_inner File _bootstrap_inner973/home/miniconda3/envs/sglang/lib/python3.10/threading.py
"
in "_bootstrap File File /home/miniconda3/envs/sglang/lib/python3.10/threading.py File , line
""""973
/home/miniconda3/envs/sglang/lib/python3.10/threading.py/home/miniconda3/envs/sglang/lib/python3.10/threading.py, line /home/miniconda3/envs/sglang/lib/python
3.10/threading.py in Thread 0x""1016"_bootstrap, line 00007f0099450740, line in , line
973 (most recent call first):
973_bootstrap_inner973
in in
File in Thread 0x_bootstrap00007f2a4b341740_bootstrap File _bootstrap"
(most recent call first):
"
/home/miniconda3/envs/sglang/lib/python3.10/site-packages/torch/cuda/streams.py
File /home/miniconda3/envs/sglang/lib/python3.10/threading.py
"Thread 0xThread 0x""Thread 0x, line 00007f481b3f674000007fd8025ef740, line 00007f561be9e740/home/miniconda3/envs/sglang/lib/python3.10/site-packages/torch/cuda/streams.py225 (most recent call
first):
(most recent call first):
973 (most recent call first):
" in in File , line File _bootstrap File synchronize225""
"
in /home/miniconda3/envs/sglang/lib/python3.10/site-packages/torch/cuda/streams.py/home/miniconda3/envs/sglang/lib/python3.10/site-packages/torch/cuda/streams.py
/home/miniconda3/envs/sglang/lib/python3.10/site-packages/torch/cuda/streams.py File synchronize""Thread 0x", line "
, line 00007f5056861740225, line /home/miniconda3/envs/sglang/lib/python3.10/site-packages/sglang/srt/managers/tp_worker_overlap_thread.py File 225 (most recent call first):
in 225synchronize"" in File in
, line /home/miniconda3/envs/sglang/lib/python3.10/site-packages/sglang/srt/managers/tp_worker_overlap_thread.pysynchronizesynchronize"166/home/miniconda3/envs/sglang/lib/pyth
on3.10/site-packages/torch/cuda/streams.py File "
in File "resolve_batch_result", line File ", line
/home/miniconda3/envs/sglang/lib/python3.10/site-packages/sglang/srt/managers/tp_worker_overlap_thread.py166 File "/home/miniconda3/envs/sglang/lib/python3.10/site-packages/s
glang/srt/managers/tp_worker_overlap_thread.py225" in "/home/miniconda3/envs/sglang/lib/python3.10/site-packages/sglang/srt/managers/tp_worker_overlap_thread.py" in , line /cpfs/user/liuyanjia
ng/miniconda3/envs/sglang/lib/python3.10/site-packages/sglang/srt/managers/scheduler.pyresolve_batch_result", line synchronize166"
166, line
in , line File in 166 File resolve_batch_result1147" in resolve_batch_result
" in /home/miniconda3/envs/sglang/lib/python3.10/site-packages/sglang/srt/managers/scheduler.py
resolve_batch_result/home/miniconda3/envs/sglang/lib/python3.10/site-packages/sglang/srt/managers/tp_worker_overlap_thread.pyprocess_batch_result_prefill" File File
"
File ", line ", line "/home/miniconda3/envs/sglang/lib/python3.10/site-packages/sglang/srt/managers/scheduler.py File 1147/home/miniconda3/envs/sglang/lib/python3.10/site-p
ackages/sglang/srt/managers/scheduler.py166/home/miniconda3/envs/sglang/lib/python3.10/site-packages/sglang/srt/managers/scheduler.py"" in , line " in "/home/miniconda3/envs/s
glang/lib/python3.10/site-packages/sglang/srt/managers/scheduler.pyprocess_batch_result_prefill1147, line resolve_batch_result, line "
in 1147
1147, line process_batch_result_prefill1119 File in File in
in "process_batch_result_prefill"process_batch_result_prefillprocess_batch_result File /home/miniconda3/envs/sglang/lib/python3.10/site-packages/sglang/srt/managers/scheduler.py
/home/miniconda3/envs/sglang/lib/python3.10/site-packages/sglang/srt/managers/scheduler.py
File <module>"
<string>", line 1 in <module>
Extension modules: numpy.core._multiarray_umath
Extension modules: numpy.core._multiarray_umath, numpy.core._multiarray_tests
Extension modules:
Extension modules: numpy.core._multiarray_umath
Extension modules: numpy.core._multiarray_umathnumpy.core._multiarray_umath,
Extension modules: , numpy.linalg._umath_linalgnumpy.core._multiarray_umathnumpy.core._multiarray_tests, numpy.core._multiarray_tests, , , numpy.linalg._umath_linalgnumpy.core._multiarray_testsnumpy.core._mult
iarray_tests, , numpy.fft._pocketfft_internalnumpy.core._multiarray_tests, numpy.linalg._umath_linalg, , , numpy.random._common, numpy.linalg._umath_linalgnumpy.linalg._umath_linalg, numpy.fft._pocketfft_inter
nalnumpy.linalg._umath_linalg, numpy.fft._pocketfft_internal, numpy.random._common, numpy.random._common, , numpy.fft._pocketfft_internalnumpy.fft._pocketfft_internal, numpy.fft._pocketfft_internal, numpy.rand
om.bit_generator, numpy.random._common, , , , numpy.random._commonnumpy.random._common, numpy.random.bit_generatornumpy.random._bounded_integersnumpy.random.bit_generator, , , numpy.random._bounded_integersnum
py.random._mt19937numpy.random._bounded_integers, , , , numpy.random._mt19937numpy.random.mtrandnumpy.random.bit_generatornumpy.random._mt19937, , , numpy.random.mtrand, numpy.random.bit_generator, , numpy.ran
dom.bit_generatornumpy.random._philoxnumpy.random._bounded_integers, numpy.random.mtrand, , numpy.random._philoxnumpy.random._bounded_integersnumpy.random._bounded_integers, , , , numpy.random._mt19937, numpy.
random._pcg64, numpy.random._philoxnumpy.random._pcg64numpy.random._mt19937, numpy.random._mt19937, , numpy.random.mtrand, , numpy.random._pcg64numpy.random._sfc64, numpy.random._sfc64, numpy.random.mtrandnump
y.random.mtrand, , numpy.random._philoxnumpy.random._sfc64numpy.random._generator, , , numpy.random._generator, numpy.random._philox, numpy.random._philoxnumpy.random._pcg64numpy.random._generator, , , numpy.r
andom._pcg64numpy.random._pcg64numpy.random._sfc64, , numpy.random._sfc64numpy.random._sfc64, numpy.random._generator, , numpy.random._generatornumpy.random._generator, , charset_normalizer.mdcharset_normalize
r.md, charset_normalizer.md, requests.packages.charset_normalizer.md, requests.packages.chardet.md, requests.packages.charset_normalizer.md, , , charset_normalizer.mdrequests.packages.charset_normalizer.mdrequ
ests.packages.chardet.md, , , charset_normalizer.mdcharset_normalizer.mdrequests.packages.chardet.md, requests.packages.charset_normalizer.md, requests.packages.chardet.md, , requests.packages.charset_normaliz
er.mdrequests.packages.charset_normalizer.md, requests.packages.chardet.md, requests.packages.chardet.md, multidict._multidict, yarl._quoting_c, , multidict._multidict, multidict._multidictpropcache._helpers_c
, , multidict._multidictyarl._quoting_c, , aiohttp._http_writeryarl._quoting_c, propcache._helpers_c, , multidict._multidict, , aiohttp._http_parser, multidict._multidictyarl._quoting_cpropcache._helpers_c, ai
ohttp._websocket.mask, aiohttp._http_writer, , , propcache._helpers_caiohttp._websocket.reader_cyarl._quoting_c, , aiohttp._http_parseryarl._quoting_c, aiohttp._http_writer, propcache._helpers_c, aiohttp._webs
ocket.mask, , , propcache._helpers_c, , frozenlist._frozenlistaiohttp._http_parseraiohttp._http_writeraiohttp._websocket.reader_c, aiohttp._http_writer, , , , aiohttp._http_parseraiohttp._websocket.maskaiohttp
._http_parseraiohttp._http_writer, frozenlist._frozenlist, , aiohttp._websocket.mask, , aiohttp._websocket.reader_caiohttp._websocket.maskaiohttp._http_parser, , aiohttp._websocket.reader_caiohttp._websocket.r
eader_c, aiohttp._websocket.mask, aiohttp._websocket.reader_c, frozenlist._frozenlist, , frozenlist._frozenlistfrozenlist._frozenlist, , frozenlist._frozenlistuvloop.loop, uvloop.loop, , uvloop.loopuvloop.loop
, uvloop.loop, uvloop.loop, torch._C, torch._C._dynamo.autograd_compiler, torch._C._dynamo.eval_frame, torch._C._dynamo.guards, , torch._Ctorch._C._dynamo.utils, torch._C._fft, torch._C._dynamo.autograd_compil
er, torch._C._linalg, torch._C._dynamo.eval_frame, torch._C._nested, torch._C._dynamo.guards, , torch._C._nntorch._
, cuda_utils, __triton_launcher (total: 119)
, cuda_utils, __triton_launcher (total: 119)
(total: 119)
[rank0]:[E210 02:06:33.391752834 ProcessGroupNCCL.cpp:616] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1522789, OpType=_ALLGATHER_BASE, NumelIn=298960, NumelOut=4783360, Timeout(ms)=
600000) ran for 600093 milliseconds before timing out.
[rank0]:[E210 02:06:33.392284994 ProcessGroupNCCL.cpp:1785] [PG ID 2 PG GUID 3 Rank 0] Exception (either an error or timeout) detected by watchdog at work: 1522789, last enqueued NCCL work: 1522789, last compl
eted NCCL work: 1522788.
[rank0]:[E210 02:06:33.392326245 ProcessGroupNCCL.cpp:1834] [PG ID 2 PG GUID 3 Rank 0] Timeout at NCCL work: 1522789, last enqueued NCCL work: 1522789, last completed NCCL work: 1522788.
[rank0]:[E210 02:06:33.392339061 ProcessGroupNCCL.cpp:630] [Rank 0] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupt
ed/incomplete data.
[rank0]:[E210 02:06:33.392344903 ProcessGroupNCCL.cpp:636] [Rank 0] To avoid data inconsistency, we are taking the entire process down.
[rank2]:[E210 02:06:33.392948176 ProcessGroupNCCL.cpp:616] [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1522789, OpType=_ALLGATHER_BASE, NumelIn=298960, NumelOut=4783360, Timeout(ms)=
600000) ran for 600094 milliseconds before timing out.
[rank2]:[E210 02:06:33.393426230 ProcessGroupNCCL.cpp:1785] [PG ID 2 PG GUID 3 Rank 2] Exception (either an error or timeout) detected by watchdog at work: 1522789, last enqueued NCCL work: 1522789, last compl
eted NCCL work: 1522788.
[rank2]:[E210 02:06:33.393443515 ProcessGroupNCCL.cpp:1834] [PG ID 2 PG GUID 3 Rank 2] Timeout at NCCL work: 1522789, last enqueued NCCL work: 1522789, last completed NCCL work: 1522788.
[rank2]:[E210 02:06:33.393449435 ProcessGroupNCCL.cpp:630] [Rank 2] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupt
ed/incomplete data.
[rank2]:[E210 02:06:33.393455189 ProcessGroupNCCL.cpp:636] [Rank 2] To avoid data inconsistency, we are taking the entire process down.
[rank0]:[E210 02:06:33.393862144 ProcessGroupNCCL.cpp:1595] [PG ID 2 PG GUID 3 Rank 0] Process group watchdog thread terminated with exception: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(S
eqNum=1522789, OpType=_ALLGATHER_BASE, NumelIn=298960, NumelOut=4783360, Timeout(ms)=600000) ran for 600093 milliseconds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7ff38bd6c446 in /home/miniconda3/envs/sglang/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x282 (0x7ff3417cc772 in /home/miniconda3/envs/sglang/lib/python
3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7ff3417d3bb3 in /home/miniconda3/envs/sglang/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7ff3417d561d in /home/miniconda3/envs/sglang/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #4: <unknown function> + 0x145c0 (0x7ff38e36d5c0 in /home/miniconda3/envs/sglang/lib/python3.10/site-packages/torch/lib/libtorch.so)
frame #5: <unknown function> + 0x94ac3 (0x7ff430694ac3 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #6: <unknown function> + 0x126850 (0x7ff430726850 in /usr/lib/x86_64-linux-gnu/libc.so.6)
Fatal Python error: Aborted
Fatal Python error: Aborted
Thread 0x00007fde85ff7640 (most recent call first):
File "/home/miniconda3/envs/sglang/lib/python3.10/site-packages/sglang/srt/managers/scheduler.py", line 462 in watchdog_thread
File "/home/miniconda3/envs/sglang/lib/python3.10/threading.py", line 953 in run
File "/home/miniconda3/envs/sglang/lib/python3.10/threading.py", line 1016 in _bootstrap_inner
File "/home/miniconda3/envs/sglang/lib/python3.10/threading.py", line 973 in _bootstrap
Thread 0x00007fde867f8640 (most recent call first):
File "/home/miniconda3/envs/sglang/lib/python3.10/threading.py", line 320 in wait
File "/home/miniconda3/envs/sglang/lib/python3.10/queue.py", line 171 in get
File "/home/miniconda3/envs/sglang/lib/python3.10/site-packages/sglang/srt/managers/tp_worker_overlap_thread.py", line 121 in forward_thread_func_
File "/home/miniconda3/envs/sglang/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116 in decorate_context
File "/home/miniconda3/envs/sglang/lib/python3.10/site-packages/sglang/srt/managers/tp_worker_overlap_thread.py", line 109 in forward_thread_func
File "/home/miniconda3/envs/sglang/lib/python3.10/threading.py", line 953 in run
File "/home/miniconda3/envs/sglang/lib/python3.10/threading.py", line 1016 in _bootstrap_inner
File "/home/miniconda3/envs/sglang/lib/python3.10/threading.py", line 973 in _bootstrap
Thread 0x00007feea21d1640 (most recent call first):
File "/home/miniconda3/envs/sglang/lib/python3.10/threading.py", line 324 in wait
File "/home/miniconda3/envs/sglang/lib/python3.10/threading.py", line 607 in wait
File "/home/miniconda3/envs/sglang/lib/python3.10/site-packages/tqdm/_monitor.py", line 60 in run
File "/home/miniconda3/envs/sglang/lib/python3.10/threading.py", line 1016 in _bootstrap_inner
File "/home/miniconda3/envs/sglang/lib/python3.10/threading.py", line 973 in _bootstrap
Thread 0x00007feea19d0640 (most recent call first):
File "/home/miniconda3/envs/sglang/lib/python3.10/threading.py", line 324 in wait
File "/home/miniconda3/envs/sglang/lib/python3.10/threading.py", line 607 in wait
File "/home/miniconda3/envs/sglang/lib/python3.10/site-packages/tqdm/_monitor.py", line 60 in run
File "/home/miniconda3/envs/sglang/lib/python3.10/threading.py", line 1016 in _bootstrap_inner
File "/home/miniconda3/envs/sglang/lib/python3.10/threading.py", line 973 in _bootstrap
Thread 0x00007ff090aff640 (most recent call first):
File "/home/miniconda3/envs/sglang/lib/python3.10/site-packages/torch/_inductor/compile_worker/subproc_pool.py", line 47 in _recv_msg
File "/home/miniconda3/envs/sglang/lib/python3.10/site-packages/torch/_inductor/compile_worker/subproc_pool.py", line 153 in _read_thread
File "/home/miniconda3/envs/sglang/lib/python3.10/threading.py", line 953 in run
File "/home/miniconda3/envs/sglang/lib/python3.10/threading.py", line 1016 in _bootstrap_inner
File "/home/miniconda3/envs/sglang/lib/python3.10/threading.py", line 973 in _bootstrap
Thread 0x00007ff4309ba740 (most recent call first):
File "/home/miniconda3/envs/sglang/lib/python3.10/site-packages/torch/cuda/streams.py", line 225 in synchronize
File "/home/miniconda3/envs/sglang/lib/python3.10/site-packages/sglang/srt/managers/tp_worker_overlap_thread.py", line 166 in resolve_batch_result
File "/home/miniconda3/envs/sglang/lib/python3.10/site-packages/sglang/srt/managers/scheduler.py", line 1147 in process_batch_result_prefill
File "/home/miniconda3/envs/sglang/lib/python3.10/site-packages/sglang/srt/managers/scheduler.py", line 1119 in process_batch_result
File "/home/miniconda3/envs/sglang/lib/python3.10/site-packages/sglang/srt/managers/scheduler.py", line 519 in event_loop_overlap
File "/home/miniconda3/envs/sglang/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116 in decorate_context
File "/home/miniconda3/envs/sglang/lib/python3.10/site-packages/sglang/srt/managers/scheduler.py", line 1796 in run_scheduler_process
File "/home/miniconda3/envs/sglang/lib/python3.10/multiprocessing/process.py", line 108 in run
File "/home/miniconda3/envs/sglang/lib/python3.10/multiprocessing/process.py", line 314 in _bootstrap
File "/home/miniconda3/envs/sglang/lib/python3.10/multiprocessing/spawn.py", line 129 in _main
File "/home/miniconda3/envs/sglang/lib/python3.10/multiprocessing/spawn.py", line 116 in spawn_main
File "<string>", line 1 in <module>
[rank2]:[E210 02:06:33.394775308 ProcessGroupNCCL.cpp:1595] [PG ID 2 PG GUID 3 Rank 2] Process group watchdog thread terminated with exception: [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(S
eqNum=1522789, OpType=_ALLGATHER_BASE, NumelIn=298960, NumelOut=4783360, Timeout(ms)=600000) ran for 600094 milliseconds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7ff1b616c446 in /home/miniconda3/envs/sglang/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x282 (0x7ff16b7cc772 in /home/miniconda3/envs/sglang/lib/python
3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7ff16b7d3bb3 in /miniconda3/envs/sglang/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7ff16b7d561d in /home/miniconda3/envs/sglang/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #4: <unknown function> + 0x145c0 (0x7ff1b843f5c0 in /home/miniconda3/envs/sglang/lib/python3.10/site-packages/torch/lib/libtorch.so)
frame #5: <unknown function> + 0x94ac3 (0x7ff25a894ac3 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #6: <unknown function> + 0x126850 (0x7ff25a926850 in /usr/lib/x86_64-linux-gnu/libc.so.6)
Fatal Python error: Aborted
I encountered the same issue on Deepseek-V3 2*8 H20. Is it fixed in sglang==0.4.2.post4? @zhyncs
I encountered the same issue on Deepseek-V3 2*8 H20. Is it fixed in sglang==0.4.2.post4? @zhyncs
i have tried sglang==0.4.2.post4 but useless
Could someone provide the benchmarking command (client-side) used to trigger the crash?
@zui-jiang Re-running with TORCH_DISTRIBUTED_DEBUG=DETAIL TORCH_SHOW_CPP_STACKTRACES=1 NCCL_DEBUG=INFO set on the server side would give us a bit more info too.
+1, sglang==0.4.3 encountered the same issue on Deepseek-R1 2*8 H200.
node 1:
python3 -m sglang.launch_server
--model-path /data/deepseek-ai/DeepSeek-R1
--tp 16
--dist-init-addr <head_ip>:20000
--nnodes 2
--node-rank 0
--trust-remote-code
--enable-dp-attention
--enable-torch-compile
--torch-compile-max-bs 8
--host 0.0.0.0
--port 40000 2>&1 | tee -a RUN_1.log
node 2:
python3 -m sglang.launch_server
--model-path /data/deepseek-ai/DeepSeek-R1
--tp 16
--dist-init-addr <head_ip>:20000
--nnodes 2
--node-rank 1
--trust-remote-code
--enable-dp-attention
--enable-torch-compile
--torch-compile-max-bs 8
--host 0.0.0.0
--port 40000 2>&1 | tee -a RUN_1.log
+1
Watchdog timeout (self.watchdog_timeout=300) +1
+1, 8xh20. deepseek r1 fp8
@zui-jiang I met the same issue, from your logs, it seems before the 300s watchdog timeout, the longest input context would be around 13k. I've increased the watchdog timeout to 600s, when the input context is over 20k, it's timeout.
The context length used in benchmarking is a bit short, like 500+, then it won't timeout.
Apart from increasing the timeout, are there any other solutions?
#3280 would be the same issue
Watchdog timeout (self.watchdog_timeout=300) +1 any progress on that?
I met the same issue. "--disable-cuda-graph" works for me. However, adding this option greatly slows down the inference speed in low QPS setting.
--disable-cuda-graph works, but super slow, like halve the speed, from 40+ tokens/s to 10~20 tokens/s.
also if I adjust the --watchdog-timeout 36000, it seems nccl timeout still at 600s.
The timeout error is because the inference is already stucked or crashed. Setting a longer timeout doesn't help here.
I did some debug, and found the exact line of code where the inference got stuck, but I don't understand why. I tried different combinations of lauching arguments, --disable-cuda-graph is the only one works for me.
It indeed slows downs the speed a of a single request. However, if you have a large batch of requets to process in parallel, adding this option or not does't influence the speed.
+1
#3709 should fix this bug. Let me know if you encounter an issue after applying it.