sglang [Bug] sglang crashed when use enable_dp_attention running DeepSeekV3 on 2x8xH100

Checklist

[x] 1. I have searched related issues but cannot get the expected help.
[x] 2. The bug has not been fixed in the latest version.
[x] 3. Please note that if the bug-related issue you submitted lacks corresponding environment info and a minimal reproducible demo, it will be challenging for us to reproduce and resolve the issue, reducing the likelihood of receiving feedback.
[x] 4. If the issue you raised is not a bug but a question, please raise a discussion at https://github.com/sgl-project/sglang/discussions/new/choose Otherwise, it will be closed.
[x] 5. Please use English, otherwise it will be closed.

Describe the bug

According to the dp-attention performance & usage, I turn on it by --enable-dp-attention when launching DeepSeek v3 on 2x8xH100. My command is like as below: docker run --gpus all -d --entrypoint=python3 --shm-size 32g --privileged -e NCCL_IB_HCA=mlx5_1,mlx5_2,mlx5_3,mlx5_4,mlx5_5,mlx5_6,mlx5_7,mlx5_8 -e NCCL_IB_QPS_PER_CONNECTION=2 -e NCCL_IB_ADAPTIVE_ROUTING=1 -e NCCL_DEBUG_SUBSYS=INIT,ENV,GRAPH -e NCCL_NVLS_ENABLE=0 -e NCCL_IB_GID_INDEX=3 -e NCCL_DEBUG=TRACE --network=host --ipc=host lmsysorg/sglang:v0.4.3-cu124 -m sglang.launch_server --model-path /sgl-workspace/deepseek-ai/DeepSeekV3/ --tp 16 --nccl-init-addr sgl-master:50001 --nnodes 2 --node-rank 0 --trust-remote-code --host 0.0.0.0 --port 8000 --watchdog-timeout 3600 --kv-cache-dtype fp8_e5m2 --enable-dp-attention --mem-fraction-static 0.78 2>&1 When I run test scripts, the server crashed

[2025-02-18 06:32:23 DP7 TP7] Prefill batch. #new-seq: 8, #new-token: 4096, #cached-token: 8, cache hit rate: 0.23%, token usage: 0.00, #running-req: 2, #queue-req: 0 [2025-02-18 06:32:23 DP6 TP6] Prefill batch. #new-seq: 8, #new-token: 2265, #cached-token: 8, cache hit rate: 0.38%, token usage: 0.00, #running-req: 2, #queue-req: 0 [rank2]:[E218 06:32:24.828983440 ProcessGroupNCCL.cpp:1595] [PG ID 2 PG GUID 3 Rank 2] Process group watchdog thread terminated with exception: CUDA error: an illegal memory access was encountered CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1 Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

Exception raised from c10_cuda_check_implementation at ../c10/cuda/CUDAException.cpp:43 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7f3545f6c446 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so) frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7f3545f166e4 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so) frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x7f354633ea18 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10_cuda.so) frame #3: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x56 (0x7f34fbe25726 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so) frame #4: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0xa0 (0x7f34fbe2a3f0 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so) frame #5: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1da (0x7f34fbe31b5a in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so) frame #6: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7f34fbe3361d in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so) frame #7: + 0x145c0 (0x7f3547b375c0 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch.so) frame #8: + 0x94ac3 (0x7f35489c0ac3 in /usr/lib/x86_64-linux-gnu/libc.so.6) frame #9: + 0x126850 (0x7f3548a52850 in /usr/lib/x86_64-linux-gnu/libc.so.6)

[2025-02-18 06:32:24 DP2 TP2] TpModelWorkerClient hit an exception: Traceback (most recent call last): File "/sgl-workspace/sglang/python/sglang/srt/managers/tp_worker_overlap_thread.py", line 109, in forward_thread_func self.forward_thread_func_() File "/usr/local/lib/python3.10/dist-packages/torch/utils/contextlib.py", line 116, in decorate_context return func(*args, **kwargs) File "/sgl-workspace/sglang/python/sglang/srt/managers/tp_worker_overlap_thread.py", line 140, in forward_thread_func logits_output, next_token_ids = self.worker.forward_batch_generation( File "/sgl-workspace/sglang/python/sglang/srt/managers/tp_worker.py", line 164, in forward_batch_generation logits_output = self.model_runner.forward(forward_batch) File "/sgl-workspace/sglang/python/sglang/srt/model_executor/model_runner.py", line 795, in forward return self.forward_extend(forward_batch) File "/sgl-workspace/sglang/python/sglang/srt/model_executor/model_runner.py", line 760, in forward_extend return self.model.forward( File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context return func(*args, **kwargs) File "/sgl-workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 868, in forward hidden_states = self.model(input_ids, positions, forward_batch) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl return self._call_impl(*args, **kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1747, in _call_impl return forward_call(*args, **kwargs) File "/sgl-workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 829, in forward hidden_states, residual = layer( File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl return self._call_impl(*args, **kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1747, in _call_impl return forward_call(*args, **kwargs) File "/sgl-workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 781, in forward hidden_states = self.mlp(hidden_states) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl return self._call_impl(*args, **kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1747, in _call_impl return forward_call(*args, **kwargs) File "/sgl-workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 177, in forward self.experts(hidden_states=hidden_states, router_logits=router_logits) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl return self._call_impl(*args, **kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1747, in _call_impl return forward_call(*args, **kwargs) File "/sgl-workspace/sglang/python/sglang/srt/layers/moe/fused_moe_triton/layer.py", line 589, in forward final_hidden_states = self.quant_method.apply( File "/sgl-workspace/sglang/python/sglang/srt/layers/quantization/fp8.py", line 820, in apply return fused_experts( File "/sgl-workspace/sglang/python/sglang/srt/layers/moe/fused_moe_triton/fused_moe.py", line 851, in fused_experts torch.ops.sglang.inplace_fused_experts( File "/usr/local/lib/python3.10/dist-packages/torch/_ops.py", line 1116, in call return self._op(*args, **(kwargs or {})) File "/sgl-workspace/sglang/python/sglang/srt/layers/moe/fused_moe_triton/fused_moe.py", line 731, in inplace_fused_experts fused_experts_impl( File "/sgl-workspace/sglang/python/sglang/srt/layers/moe/fused_moe_triton/fused_moe.py", line 1057, in fused_experts_impl torch.sum( RuntimeError: CUDA error: an illegal memory access was encountered CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1 Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/sgl-workspace/sglang/python/sglang/srt/managers/tp_worker_overlap_thread.py", line 108, in forward_thread_func with torch.get_device_module(self.device).stream(self.forward_stream): File "/usr/local/lib/python3.10/dist-packages/torch/cuda/init.py", line 595, in exit torch.cuda.set_stream(self.src_prev_stream) # type: ignore[arg-type] File "/usr/local/lib/python3.10/dist-packages/vllm/utils.py", line 962, in _patched_set_stream prev_set_stream(stream) File "/usr/local/lib/python3.10/dist-packages/torch/cuda/init.py", line 636, in set_stream _set_stream_by_id( File "/usr/local/lib/python3.10/dist-packages/torch/cuda/init.py", line 618, in _set_stream_by_id torch._C._cuda_setStream( RuntimeError: CUDA error: CUDA-capable device(s) is/are busy or unavailable CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1 Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

terminate called after throwing an instance of 'c10::DistBackendError' terminate called recursively Fatal Python error: Aborted

Thread 0x00007f2fe8afc640 (most recent call first): File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 462 in watchdog_thread File "/usr/lib/python3.10/threading.py", line 953 in run File "/usr/lib/python3.10/threading.py", line 1016 in _bootstrap_inner File "/usr/lib/python3.10/threading.py", line 973 in _bootstrap

Current thread 0x00007f1f035fe640 (most recent call first): File "/sgl-workspace/sglang/python/sglang/srt/managers/tp_worker_overlap_thread.py", line 113 in forward_thread_func File "/usr/lib/python3.10/threading.py", line 953 in run File "/usr/lib/python3.10/threading.py", line 1016 what(): in [PG ID 2 PG GUID 3 Rank 2] Process group watchdog thread terminated with exception: CUDA error: an illegal memory access was encountered CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1 Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

I tried to change the value of mem-fraction-static, but it doesn't work.

Environment

Reproduction

on node 1 docker run --gpus all -d --entrypoint=python3 --shm-size 32g --privileged -e NCCL_IB_HCA=mlx5_1,mlx5_2,mlx5_3,mlx5_4,mlx5_5,mlx5_6,mlx5_7,mlx5_8 -e NCCL_IB_QPS_PER_CONNECTION=2 -e NCCL_IB_ADAPTIVE_ROUTING=1 -e NCCL_DEBUG_SUBSYS=INIT,ENV,GRAPH -e NCCL_NVLS_ENABLE=0 -e NCCL_IB_GID_INDEX=3 -e NCCL_DEBUG=TRACE --network=host --ipc=host lmsysorg/sglang:v0.4.3-cu124 -m sglang.launch_server --model-path /sgl-workspace/deepseek-ai/DeepSeekV3/ --tp 16 --nccl-init-addr sgl-master:50001 --nnodes 2 --node-rank 0 --trust-remote-code --host 0.0.0.0 --port 8000 --watchdog-timeout 3600 --kv-cache-dtype fp8_e5m2 --enable-dp-attention --mem-fraction-static 0.78 2>&1

on node 2 docker run --gpus all -d --entrypoint=python3 --shm-size 32g --privileged -e NCCL_IB_HCA=mlx5_1,mlx5_2,mlx5_3,mlx5_4,mlx5_5,mlx5_6,mlx5_7,mlx5_8 -e NCCL_IB_QPS_PER_CONNECTION=2 -e NCCL_IB_ADAPTIVE_ROUTING=1 -e NCCL_DEBUG_SUBSYS=INIT,ENV,GRAPH -e NCCL_NVLS_ENABLE=0 -e NCCL_IB_GID_INDEX=3 -e NCCL_DEBUG=TRACE --network=host --ipc=host lmsysorg/sglang:v0.4.3-cu124 -m sglang.launch_server --model-path /sgl-workspace/deepseek-ai/DeepSeekV3/ --tp 16 --nccl-init-addr sgl-master:50001 --nnodes 2 --node-rank 1 --trust-remote-code --host 0.0.0.0 --port 8000 --watchdog-timeout 3600 --kv-cache-dtype fp8_e5m2 --enable-dp-attention --mem-fraction-static 0.78 2>&1

Environment

Python: 3.10.12 CUDA available: True GPU 0,1,2,3,4,5,6,7: NVIDIA H100 GPU 0,1,2,3,4,5,6,7 CUDA_HOME: /usr/local/cuda NVCC: Cuda CUDA Driver Version: 535.183.06 PyTorch: 2.5.1+cu124 sgl_kernel: 0.0.3.post6 flashinfer: 0.2.1.post2+cu124torch2.5 triton: 3.1.0 transformers: 4.49.0 torchao: 0.8.0 numpy: 1.26.4 aiohttp: 3.11.12 fastapi: 0.115.8 hf_transfer: 0.1.9 huggingface_hub: 0.28.1 interegular: 0.3.3 modelscope: 1.23.0 orjson: 3.10.15 packaging: 24.2 psutil: 7.0.0 pydantic: 2.10.6 multipart: 0.0.20 zmq: 26.2.1 uvicorn: 0.34.0 uvloop: 0.21.0 vllm: 0.7.2 openai: 1.63.2 tiktoken: 0.9.0 anthropic: 0.45.2 decord: 0.6.0 NVIDIA Topology: GPU0 GPU1 GPU0 X GPU1 NV18 X GPU2 NV18 NV18 X GPU3 NV18 NV18 NV18 GPU4 NV18 GPU5 NV18 GPU6 NV18 GPU7 NV18 NIC0 PIX NIC1 NODE NIC2 NODE NIC3 NODE PIX NIC4 NODE NODE PIX NIC5 NODE NIC6 SYS NIC7 SYS NIC8 SYS NIC9 SYS (main, Jan 17 2025, 14:35:34) [GCC 11.4.0] Compute Capability: 9.0 compilation tools, release 12.4, V12.4.131 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7 NIC0 NIC1 NIC2 NIC3 NIC4 NIC5 NIC6 NIC7 NIC8 NIC9 CPU AffinityNUMA Affinity GPU NUMA ID NV18 NV18 NV18 NV18 NV18 NV18 NV18 PIX NODE NODE NODE NODE NODE SYS SYS SYS SYS 0-47,96-1430N/A NV18 NV18 NV18 NV18 NV18 NV18 NODE NODE NODE PIX NODE NODE SYS SYS SYS SYS 0-47,96-1430N/A NV18 NV18 NV18 NV18 NV18 NODE NODE NODE NODE PIX NODE SYS SYS SYS SYS 0-47,96-1430N/A X NV18 NV18 NV18 NV18 NODE NODE NODE NODE NODE PIX SYS SYS SYS SYS 0-47,96-1430N/A NV18 NV18 NV18 X NV18 NV18 NV18 SYS SYS SYS SYS SYS SYS PIX NODE NODE NODE 48-95,144-191 1 N/A NV18 NV18 NV18 NV18 X NV18 NV18 SYS SYS SYS SYS SYS SYS NODE PIX NODE NODE 48-95,144-191 1 N/A NV18 NV18 NV18 NV18 NV18 X NV18 SYS SYS SYS SYS SYS SYS NODE NODE PIX NODE 48-95,144-191 1 N/A NV18 NV18 NV18 NV18 NV18 NV18 X SYS SYS SYS SYS SYS SYS NODE NODE NODE PIX 48-95,144-191 1 N/A NODE NODE NODE SYS SYS SYS SYS X NODE NODE NODE NODE NODE SYS SYS SYS SYS NODE NODE NODE SYS SYS SYS SYS NODE X PIX NODE NODE NODE SYS SYS SYS SYS NODE NODE NODE SYS SYS SYS SYS NODE PIX X NODE NODE NODE SYS SYS SYS SYS NODE NODE SYS SYS SYS SYS NODE NODE NODE X NODE NODE SYS SYS SYS SYS NODE SYS SYS SYS SYS NODE NODE NODE NODE X NODE SYS SYS SYS SYS NODE NODE PIX SYS SYS SYS SYS NODE NODE NODE NODE NODE X SYS SYS SYS SYS SYS SYS SYS PIX NODE NODE NODE SYS SYS SYS SYS SYS SYS X NODE NODE NODE SYS SYS SYS NODE PIX NODE NODE SYS SYS SYS SYS SYS SYS NODE X NODE NODE SYS SYS SYS NODE NODE PIX NODE SYS SYS SYS SYS SYS SYS NODE NODE X NODE SYS SYS SYS NODE NODE NODE PIX SYS SYS SYS SYS SYS SYS NODE NODE NODE X

Legend:

X = Self SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI) NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU) PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge) PIX = Connection traversing at most a single PCIe bridge NV# = Connection traversing a bonded set of # NVLinks

NIC Legend:

NIC0: mlx5_0 NIC1: mlx5_1 NIC2: mlx5_2 NIC3: mlx5_3 NIC4: mlx5_4 NIC5: mlx5_5 NIC6: mlx5_6 NIC7: mlx5_7 NIC8: mlx5_8 NIC9: mlx5_9

ulimit soft: 1048576

Feb 18 '25 06:02 ToughK

same issue on 8*H200 single server

Feb 18 '25 13:02 hariag

@hariag could you share the commands for 8*H200?

Feb 18 '25 13:02 ispobock

server: ulimit -n 4096000 python3 -m sglang.launch_server --model DeepSeek-R1 --tp 8 --trust-remote-code --port 8000 --watchdog-timeout 3600 --enable-dp-attention

test: ulimit -n 4096000 evalscope perf --url 'http://127.0.0.1:8000/v1/chat/completions' --parallel 2048 --model 'DeepSeek-R1' --api-key EMPTY --number 20480 --api openai --stream --temperature 0.6 --log-every-n-query 1024 --max-tokens 100 --max-prompt-length 100 --read-timeout 600 --connect-timeout 600 --prompt "hello"

by the way, if I remove --enable-dp-attention option, its working perfect but much slower.

Feb 18 '25 14:02 hariag

Could you try to add --disable-overlap-schedule and test it again?

Feb 18 '25 16:02 ispobock

add --disable-overlap-schedule not help.

I attached the server side log, please check it.

debug.log

Feb 19 '25 01:02 hariag

I also have the same issue on a single server of 8*H200. Add --disable-overlap-schedule can not help.

python3 -m sglang.launch_server --model-path /mnt/model/  --tensor-parallel-size 8 --trust-remote-code --enable-torch-compile  --disable-cuda-graph --enable-dp-attention

python3 -m sglang.bench_serving \
        --backend sglang \
        --dataset-name random \
        --random-range-ratio 1 \
        --num-prompt 300 \
        --request-rate 8 \
        --random-input 1024 \
        --random-output 1024 |tee -a SGLang_${model_name}_${input_len}_${output_len}_rps${i}_${DATETIME}_servering.log

Feb 19 '25 06:02 yuqie

I attached the server side log, please check it. debug.log

I checked the log, it seems an issue for sgl_kernels.fp8_blockwise_scaled_mm cc: @zhyncs @yizhang2077

Feb 19 '25 12:02 ispobock

I have same problem.

Feb 20 '25 03:02 changqingla

Hi @changqingla , could you try if this case still happen when remove options --enable-dp-attention option?

Feb 20 '25 03:02 yizhang2077

I meet "output tensor size must be equal to world_size times input tensor size" error when add --enable-dp-attention option in two 8*H800, and without it everything ok. Is anyone try it on deepseek r1，can give me help

Feb 20 '25 06:02 hiyforever

Hi @changqingla , could you try if this case still happen when remove options --enable-dp-attention option?

In my env, the inferences are successed after removing this option. What's wrong with this option: --enable-dp-attention. It is recommended in doc: https://github.com/sgl-project/sglang/tree/main/benchmark/deepseek_v3#using-docker-recommended

Feb 20 '25 07:02 lshmouse

I post a hot fix for this case in #3727 , could you try it again ? @lshmouse @hariag @ToughK @hiyforever Thank you!

Feb 20 '25 08:02 yizhang2077

I post a hot fix for this case in #3727 , could you try it again ? @lshmouse @hariag @ToughK @hiyforever Thank you!

OK, let me make a hotfix image and test it~

Feb 20 '25 09:02 lshmouse

mark! the same problem with tp 16 on 2 H800*8 source version 3c7bfd7eabed5e29cf907dba3e2ed875d7a92fd4.

Feb 20 '25 10:02 YEXINGZHE54

I have same problem without dp attention

Feb 21 '25 02:02 Lzhang-hub

@Lzhang-hub Did you try the latest main branch?

Feb 21 '25 02:02 ispobock

I met another problem when launched a deepseek-r1 model server with arguments --enable-dp-attention --dp-size 16 --tp 16 on 28H100. The rank 1 node threw Segmentation fault.

[2025-02-21 02:29:22 DP11 TP11] max_total_num_tokens=66295, chunked_prefill_size=4096, max_prefill_tokens=16384, max_running_requests=2049, context_len=163840
[2025-02-21 02:29:22 DP14 TP14] max_total_num_tokens=66295, chunked_prefill_size=4096, max_prefill_tokens=16384, max_running_requests=2049, context_len=163840
[2025-02-21 02:29:22 DP10 TP10] max_total_num_tokens=66295, chunked_prefill_size=4096, max_prefill_tokens=16384, max_running_requests=2049, context_len=163840
[2025-02-21 02:29:22 DP13 TP13] max_total_num_tokens=66295, chunked_prefill_size=4096, max_prefill_tokens=16384, max_running_requests=2049, context_len=163840
[2025-02-21 02:29:22 DP15 TP15] max_total_num_tokens=66295, chunked_prefill_size=4096, max_prefill_tokens=16384, max_running_requests=2049, context_len=163840
[2025-02-21 02:29:22 DP9 TP9] max_total_num_tokens=66295, chunked_prefill_size=4096, max_prefill_tokens=16384, max_running_requests=2049, context_len=163840
[2025-02-21 02:29:22 DP12 TP12] max_total_num_tokens=66295, chunked_prefill_size=4096, max_prefill_tokens=16384, max_running_requests=2049, context_len=163840
[2025-02-21 02:29:22 DP8 TP8] max_total_num_tokens=66295, chunked_prefill_size=4096, max_prefill_tokens=16384, max_running_requests=2049, context_len=163840
INFO:     Started server process [1]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://0.0.0.0:40000 (Press CTRL+C to quit)
[worker0:268  :0:36540] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x16)
==== backtrace (tid:  36540) ====
 0 0x0000000000042520 __sigaction()  ???:0
 1 0x00000000000494f4 uploadProxyOps()  /dvs/p4/build/sw/gpgpu/nccl/gitfusion/stable/src/enqueue.cc:1131
 2 0x0000000000051a7f hostStreamPlanTask()  /dvs/p4/build/sw/gpgpu/nccl/gitfusion/stable/src/enqueue.cc:1163
 3 0x0000000000051bd9 hostStreamPlanCallback()  /dvs/p4/build/sw/gpgpu/nccl/gitfusion/stable/src/enqueue.cc:1175
 4 0x0000000000253ffd cuEGLApiInit()  ???:0
 5 0x0000000000263373 cuEGLApiInit()  ???:0
 6 0x0000000000094ac3 pthread_condattr_setpshared()  ???:0
 7 0x0000000000126850 __xmknodat()  ???:0
=================================
Fatal Python error: Segmentation fault

Thread 0x00007f1755ffc640 (most recent call first):
  File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 462 in watchdog_thread
  File "/usr/lib/python3.10/threading.py", line 953 in run
  File "/usr/lib/python3.10/threading.py", line 1016 in _bootstrap_inner
  File "/usr/lib/python3.10/threading.py", line 973 in _bootstrap

Thread 0x00007f17567fd640 (most recent call first):
  File "/usr/local/lib/python3.10/dist-packages/torch/cuda/graphs.py", line 88 in replay
  File "/sgl-workspace/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 449 in replay
  File "/sgl-workspace/sglang/python/sglang/srt/model_executor/model_runner.py", line 791 in forward
  File "/sgl-workspace/sglang/python/sglang/srt/managers/tp_worker.py", line 164 in forward_batch_generation
  File "/sgl-workspace/sglang/python/sglang/srt/managers/tp_worker_overlap_thread.py", line 140 in forward_thread_func_
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116 in decorate_context
  File "/sgl-workspace/sglang/python/sglang/srt/managers/tp_worker_overlap_thread.py", line 109 in forward_thread_func
  File "/usr/lib/python3.10/threading.py", line 953 in run
  File "/usr/lib/python3.10/threading.py", line 1016 in _bootstrap_inner
  File "/usr/lib/python3.10/threading.py", line 973 in _bootstrap

Thread 0x00007f27f27d0640 (most recent call first):
  File "/usr/lib/python3.10/threading.py", line 324 in wait
  File "/usr/lib/python3.10/threading.py", line 607 in wait
  File "/usr/local/lib/python3.10/dist-packages/tqdm/_monitor.py", line 60 in run
  File "/usr/lib/python3.10/threading.py", line 1016 in _bootstrap_inner
  File "/usr/lib/python3.10/threading.py", line 973 in _bootstrap

Thread 0x00007f27f17ce640 (most recent call first):
  File "/usr/lib/python3.10/threading.py", line 324 in wait
  File "/usr/lib/python3.10/threading.py", line 607 in wait
  File "/usr/local/lib/python3.10/dist-packages/tqdm/_monitor.py", line 60 in run
  File "/usr/lib/python3.10/threading.py", line 1016 in _bootstrap_inner
  File "/usr/lib/python3.10/threading.py", line 973 in _bootstrap

Thread 0x00007f2d322ca4c0 (most recent call first):
  File "/usr/local/lib/python3.10/dist-packages/torch/cuda/streams.py", line 225 in synchronize
  File "/sgl-workspace/sglang/python/sglang/srt/managers/tp_worker_overlap_thread.py", line 170 in resolve_batch_result
  File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 1123 in process_batch_result
  File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 519 in event_loop_overlap
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116 in decorate_context
  File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 1825 in run_scheduler_process
  File "/usr/lib/python3.10/multiprocessing/process.py", line 108 in run
  File "/usr/lib/python3.10/multiprocessing/process.py", line 314 in _bootstrap
  File "/usr/lib/python3.10/multiprocessing/spawn.py", line 129 in _main
  File "/usr/lib/python3.10/multiprocessing/spawn.py", line 116 in spawn_main
  File "<string>", line 1 in <module>

Feb 21 '25 02:02 dwq370

@Lzhang-hub Did you try the latest main branch?

@ispobock I use commit 32b44d2fcac, I will try latest main branch.

Feb 21 '25 03:02 Lzhang-hub

I post a hot fix for this case in #3727 , could you try it again ? @lshmouse @hariag @ToughK @hiyforever Thank you!

OK, let me make a hotfix image and test it~

@yizhang2077 Test sglang:v0.4.3 , sglang will not crash with pr 3727. But I found that the TTFT increased huge with --enable-dp-attention.

The serving benchmark result without --enable-dp-attention.

The serving benchmark result with --enable-dp-attention.

Feb 21 '25 03:02 lshmouse

I post a hot fix for this case in #3727 , could you try it again ? @lshmouse @hariag @ToughK @hiyforever Thank you!

@yizhang2077 Thanks, sglang won't crash now, but the throughput is decreased significantly with --enable-dp-attention, even when the qps is about 2 req/s.

For high QPS scenarios, add the --enable-dp-attention argument to boost throughput

Feb 21 '25 06:02 ToughK

@Lzhang-hub Did you try the latest main branch?

@ispobock I use commit 32b44d2fcac, I will try latest main branch.

update: I try latest main branch, got error is same with 3424

Feb 21 '25 06:02 Lzhang-hub

@lshmouse @ToughK The dp attention is aimed to improve throughput for large batch size (>128). The latency is higher than TP.

Feb 21 '25 06:02 ispobock

@ispobock I use main branch wirh commit df84ab2a, still got same error. 2 nodes H20*8

update: I use flashinfer mla --enable-flashinfer-mla

[2025-03-11 02:44:43 DP7 TP7] TpModelWorkerClient hit an exception: Traceback (most recent call last):
  File "/workspace/python/sglang/srt/managers/tp_worker_overlap_thread.py", line 109, in forward_thread_func
    self.forward_thread_func_()
  File "/opt/conda/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
  File "/workspace/python/sglang/srt/managers/tp_worker_overlap_thread.py", line 140, in forward_thread_func_
    logits_output, next_token_ids = self.worker.forward_batch_generation(
  File "/workspace/python/sglang/srt/managers/tp_worker.py", line 172, in forward_batch_generation
    logits_output = self.model_runner.forward(forward_batch)
  File "/workspace/python/sglang/srt/model_executor/model_runner.py", line 921, in forward
    return self.forward_extend(
  File "/workspace/python/sglang/srt/model_executor/model_runner.py", line 882, in forward_extend
    return self.model.forward(
  File "/opt/conda/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
  File "/workspace/python/sglang/srt/models/deepseek_v2.py", line 1086, in forward
    hidden_states = self.model(input_ids, positions, forward_batch)
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
    return forward_call(*args, **kwargs)
  File "/workspace/python/sglang/srt/models/deepseek_v2.py", line 1040, in forward
    hidden_states, residual = layer(
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
    return forward_call(*args, **kwargs)
  File "/workspace/python/sglang/srt/models/deepseek_v2.py", line 990, in forward
    hidden_states = self.mlp(hidden_states)
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
    return forward_call(*args, **kwargs)
  File "/workspace/python/sglang/srt/models/deepseek_v2.py", line 197, in forward
    self.experts(hidden_states=hidden_states, router_logits=router_logits)
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
    return forward_call(*args, **kwargs)
  File "/workspace/python/sglang/srt/layers/moe/fused_moe_triton/layer.py", line 620, in forward
    final_hidden_states = self.quant_method.apply(
  File "/workspace/python/sglang/srt/layers/quantization/fp8.py", line 949, in apply
    return fused_experts(
  File "/workspace/python/sglang/srt/layers/moe/fused_moe_triton/fused_moe.py", line 921, in fused_experts
    torch.ops.sglang.inplace_fused_experts(
  File "/opt/conda/lib/python3.10/site-packages/torch/_ops.py", line 1123, in __call__
    return self._op(*args, **(kwargs or {}))
  File "/workspace/python/sglang/srt/layers/moe/fused_moe_triton/fused_moe.py", line 790, in inplace_fused_experts
    fused_experts_impl(
  File "/workspace/python/sglang/srt/layers/moe/fused_moe_triton/fused_moe.py", line 1150, in fused_experts_impl
    torch.sum(
RuntimeError: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Mar 11 '25 02:03 Lzhang-hub

rank 1 node threw Segmentation fault.

Have you fix it? I have the same problem.

Mar 14 '25 06:03 Xu-backup

I was also receiving this error on main (compiled this morning) using the google/gemma-3-27b-it

My end-user was sending 128 concurrent requests to the system via python, system was setup to use 1 node 4xA100. I just started running a new test limiting --max-running-requests 64 seems to help stabilize it so far.

Here is the python call to launch the backend: python -m sglang.launch_server --model-path google/gemma-3-27b-it --tp 4 --port 18443 --host=0.0.0.0 --mem-fraction-static=0.8 --max-running-requests 64

And the stacktrace when it errored when I didn't specify max-running-requests:

Exception raised from c10_cuda_check_implementation at ../c10/cuda/CUDAException.cpp:43 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7f395ea59446 in /mnt/isgnas/home/user/miniconda3/envs/sglang/lib/python3.12/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7f395ea036e4 in /mnt/isgnas/home/user/miniconda3/envs/sglang/lib/python3.12/site-packages/torch/lib/libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x7f395eb45a18 in /mnt/isgnas/home/user/miniconda3/envs/sglang/lib/python3.12/site-packages/torch/lib/libc10_cuda.so)
frame #3: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x56 (0x7f395fd67726 in /mnt/isgnas/home/user/miniconda3/envs/sglang/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #4: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0xa0 (0x7f395fd6c3f0 in /mnt/isgnas/home/user/miniconda3/envs/sglang/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #5: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1da (0x7f395fd73b5a in /mnt/isgnas/home/user/miniconda3/envs/sglang/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #6: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7f395fd7561d in /mnt/isgnas/home/user/miniconda3/envs/sglang/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #7: <unknown function> + 0x145c0 (0x7f39a87025c0 in /mnt/isgnas/home/user/miniconda3/envs/sglang/lib/python3.12/site-packages/torch/lib/libtorch.so)
frame #8: <unknown function> + 0x8609 (0x7f3a4d95c609 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #9: clone + 0x43 (0x7f3a4d727353 in /lib/x86_64-linux-gnu/libc.so.6)

Mar 17 '25 19:03 tblattner

still crash when using large chunk-prefill-size like 4096, need to be fixed? what is wrong with your kernels?

RuntimeError: CUDA error: an illegal memory access was encountered

Mar 19 '25 06:03 chengmengli06

Just a follow-up from my post a couple days ago. The instance is still running strong with --max-running-requests 64 using the gemma-3. Haven't tried the DeepSeekV3. Thought I'd provide this update just in case it helps identify issues because if I omit specifying the max-running-requests then I get the illegal memory access errors for the small model.

Mar 19 '25 13:03 tblattner

https://github.com/sgl-project/sglang/issues/4673 encounters the same issue. Follow https://github.com/sgl-project/sglang/issues/4673#issuecomment-2745578452 may fix your issue.

Mar 22 '25 20:03 ch-wan

This issue has been automatically closed due to inactivity. Please feel free to reopen it if needed.

May 22 '25 00:05 github-actions[bot]