[Bug] sglang crashed when use enable_dp_attention running DeepSeekV3 on 2x8xH100
Checklist
- [x] 1. I have searched related issues but cannot get the expected help.
- [x] 2. The bug has not been fixed in the latest version.
- [x] 3. Please note that if the bug-related issue you submitted lacks corresponding environment info and a minimal reproducible demo, it will be challenging for us to reproduce and resolve the issue, reducing the likelihood of receiving feedback.
- [x] 4. If the issue you raised is not a bug but a question, please raise a discussion at https://github.com/sgl-project/sglang/discussions/new/choose Otherwise, it will be closed.
- [x] 5. Please use English, otherwise it will be closed.
Describe the bug
According to the dp-attention performance & usage, I turn on it by --enable-dp-attention when launching DeepSeek v3 on 2x8xH100. My command is like as below:
docker run --gpus all -d --entrypoint=python3 --shm-size 32g --privileged -e NCCL_IB_HCA=mlx5_1,mlx5_2,mlx5_3,mlx5_4,mlx5_5,mlx5_6,mlx5_7,mlx5_8 -e NCCL_IB_QPS_PER_CONNECTION=2 -e NCCL_IB_ADAPTIVE_ROUTING=1 -e NCCL_DEBUG_SUBSYS=INIT,ENV,GRAPH -e NCCL_NVLS_ENABLE=0 -e NCCL_IB_GID_INDEX=3 -e NCCL_DEBUG=TRACE --network=host --ipc=host lmsysorg/sglang:v0.4.3-cu124 -m sglang.launch_server --model-path /sgl-workspace/deepseek-ai/DeepSeekV3/ --tp 16 --nccl-init-addr sgl-master:50001 --nnodes 2 --node-rank 0 --trust-remote-code --host 0.0.0.0 --port 8000 --watchdog-timeout 3600 --kv-cache-dtype fp8_e5m2 --enable-dp-attention --mem-fraction-static 0.78 2>&1
When I run test scripts, the server crashed
[2025-02-18 06:32:23 DP7 TP7] Prefill batch. #new-seq: 8, #new-token: 4096, #cached-token: 8, cache hit rate: 0.23%, token usage: 0.00, #running-req: 2, #queue-req: 0 [2025-02-18 06:32:23 DP6 TP6] Prefill batch. #new-seq: 8, #new-token: 2265, #cached-token: 8, cache hit rate: 0.38%, token usage: 0.00, #running-req: 2, #queue-req: 0 [rank2]:[E218 06:32:24.828983440 ProcessGroupNCCL.cpp:1595] [PG ID 2 PG GUID 3 Rank 2] Process group watchdog thread terminated with exception: CUDA error: an illegal memory access was encountered CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1 Compile with
TORCH_USE_CUDA_DSAto enable device-side assertions.Exception raised from c10_cuda_check_implementation at ../c10/cuda/CUDAException.cpp:43 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7f3545f6c446 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so) frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7f3545f166e4 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so) frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x7f354633ea18 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10_cuda.so) frame #3: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x56 (0x7f34fbe25726 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so) frame #4: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0xa0 (0x7f34fbe2a3f0 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so) frame #5: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1da (0x7f34fbe31b5a in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so) frame #6: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7f34fbe3361d in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so) frame #7:
+ 0x145c0 (0x7f3547b375c0 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch.so) frame #8: + 0x94ac3 (0x7f35489c0ac3 in /usr/lib/x86_64-linux-gnu/libc.so.6) frame #9: + 0x126850 (0x7f3548a52850 in /usr/lib/x86_64-linux-gnu/libc.so.6) [2025-02-18 06:32:24 DP2 TP2] TpModelWorkerClient hit an exception: Traceback (most recent call last): File "/sgl-workspace/sglang/python/sglang/srt/managers/tp_worker_overlap_thread.py", line 109, in forward_thread_func self.forward_thread_func_() File "/usr/local/lib/python3.10/dist-packages/torch/utils/contextlib.py", line 116, in decorate_context return func(*args, **kwargs) File "/sgl-workspace/sglang/python/sglang/srt/managers/tp_worker_overlap_thread.py", line 140, in forward_thread_func logits_output, next_token_ids = self.worker.forward_batch_generation( File "/sgl-workspace/sglang/python/sglang/srt/managers/tp_worker.py", line 164, in forward_batch_generation logits_output = self.model_runner.forward(forward_batch) File "/sgl-workspace/sglang/python/sglang/srt/model_executor/model_runner.py", line 795, in forward return self.forward_extend(forward_batch) File "/sgl-workspace/sglang/python/sglang/srt/model_executor/model_runner.py", line 760, in forward_extend return self.model.forward( File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context return func(*args, **kwargs) File "/sgl-workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 868, in forward hidden_states = self.model(input_ids, positions, forward_batch) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl return self._call_impl(*args, **kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1747, in _call_impl return forward_call(*args, **kwargs) File "/sgl-workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 829, in forward hidden_states, residual = layer( File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl return self._call_impl(*args, **kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1747, in _call_impl return forward_call(*args, **kwargs) File "/sgl-workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 781, in forward hidden_states = self.mlp(hidden_states) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl return self._call_impl(*args, **kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1747, in _call_impl return forward_call(*args, **kwargs) File "/sgl-workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 177, in forward self.experts(hidden_states=hidden_states, router_logits=router_logits) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl return self._call_impl(*args, **kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1747, in _call_impl return forward_call(*args, **kwargs) File "/sgl-workspace/sglang/python/sglang/srt/layers/moe/fused_moe_triton/layer.py", line 589, in forward final_hidden_states = self.quant_method.apply( File "/sgl-workspace/sglang/python/sglang/srt/layers/quantization/fp8.py", line 820, in apply return fused_experts( File "/sgl-workspace/sglang/python/sglang/srt/layers/moe/fused_moe_triton/fused_moe.py", line 851, in fused_experts torch.ops.sglang.inplace_fused_experts( File "/usr/local/lib/python3.10/dist-packages/torch/_ops.py", line 1116, in call return self._op(*args, **(kwargs or {})) File "/sgl-workspace/sglang/python/sglang/srt/layers/moe/fused_moe_triton/fused_moe.py", line 731, in inplace_fused_experts fused_experts_impl( File "/sgl-workspace/sglang/python/sglang/srt/layers/moe/fused_moe_triton/fused_moe.py", line 1057, in fused_experts_impl torch.sum( RuntimeError: CUDA error: an illegal memory access was encountered CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1 Compile with
TORCH_USE_CUDA_DSAto enable device-side assertions.During handling of the above exception, another exception occurred:
Traceback (most recent call last): File "/sgl-workspace/sglang/python/sglang/srt/managers/tp_worker_overlap_thread.py", line 108, in forward_thread_func with torch.get_device_module(self.device).stream(self.forward_stream): File "/usr/local/lib/python3.10/dist-packages/torch/cuda/init.py", line 595, in exit torch.cuda.set_stream(self.src_prev_stream) # type: ignore[arg-type] File "/usr/local/lib/python3.10/dist-packages/vllm/utils.py", line 962, in _patched_set_stream prev_set_stream(stream) File "/usr/local/lib/python3.10/dist-packages/torch/cuda/init.py", line 636, in set_stream _set_stream_by_id( File "/usr/local/lib/python3.10/dist-packages/torch/cuda/init.py", line 618, in _set_stream_by_id torch._C._cuda_setStream( RuntimeError: CUDA error: CUDA-capable device(s) is/are busy or unavailable CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1 Compile with
TORCH_USE_CUDA_DSAto enable device-side assertions.terminate called after throwing an instance of 'c10::DistBackendError' terminate called recursively Fatal Python error: Aborted
Thread 0x00007f2fe8afc640 (most recent call first): File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 462 in watchdog_thread File "/usr/lib/python3.10/threading.py", line 953 in run File "/usr/lib/python3.10/threading.py", line 1016 in _bootstrap_inner File "/usr/lib/python3.10/threading.py", line 973 in _bootstrap
Current thread 0x00007f1f035fe640 (most recent call first): File "/sgl-workspace/sglang/python/sglang/srt/managers/tp_worker_overlap_thread.py", line 113 in forward_thread_func File "/usr/lib/python3.10/threading.py", line 953 in run File "/usr/lib/python3.10/threading.py", line 1016 what(): in [PG ID 2 PG GUID 3 Rank 2] Process group watchdog thread terminated with exception: CUDA error: an illegal memory access was encountered CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1 Compile with
TORCH_USE_CUDA_DSAto enable device-side assertions.
I tried to change the value of mem-fraction-static, but it doesn't work.
Environment
Reproduction
on node 1
docker run --gpus all -d --entrypoint=python3 --shm-size 32g --privileged -e NCCL_IB_HCA=mlx5_1,mlx5_2,mlx5_3,mlx5_4,mlx5_5,mlx5_6,mlx5_7,mlx5_8 -e NCCL_IB_QPS_PER_CONNECTION=2 -e NCCL_IB_ADAPTIVE_ROUTING=1 -e NCCL_DEBUG_SUBSYS=INIT,ENV,GRAPH -e NCCL_NVLS_ENABLE=0 -e NCCL_IB_GID_INDEX=3 -e NCCL_DEBUG=TRACE --network=host --ipc=host lmsysorg/sglang:v0.4.3-cu124 -m sglang.launch_server --model-path /sgl-workspace/deepseek-ai/DeepSeekV3/ --tp 16 --nccl-init-addr sgl-master:50001 --nnodes 2 --node-rank 0 --trust-remote-code --host 0.0.0.0 --port 8000 --watchdog-timeout 3600 --kv-cache-dtype fp8_e5m2 --enable-dp-attention --mem-fraction-static 0.78 2>&1
on node 2
docker run --gpus all -d --entrypoint=python3 --shm-size 32g --privileged -e NCCL_IB_HCA=mlx5_1,mlx5_2,mlx5_3,mlx5_4,mlx5_5,mlx5_6,mlx5_7,mlx5_8 -e NCCL_IB_QPS_PER_CONNECTION=2 -e NCCL_IB_ADAPTIVE_ROUTING=1 -e NCCL_DEBUG_SUBSYS=INIT,ENV,GRAPH -e NCCL_NVLS_ENABLE=0 -e NCCL_IB_GID_INDEX=3 -e NCCL_DEBUG=TRACE --network=host --ipc=host lmsysorg/sglang:v0.4.3-cu124 -m sglang.launch_server --model-path /sgl-workspace/deepseek-ai/DeepSeekV3/ --tp 16 --nccl-init-addr sgl-master:50001 --nnodes 2 --node-rank 1 --trust-remote-code --host 0.0.0.0 --port 8000 --watchdog-timeout 3600 --kv-cache-dtype fp8_e5m2 --enable-dp-attention --mem-fraction-static 0.78 2>&1
Environment
Python: 3.10.12 (main, Jan 17 2025, 14:35:34) [GCC 11.4.0] CUDA available: True GPU 0,1,2,3,4,5,6,7: NVIDIA H100 GPU 0,1,2,3,4,5,6,7 Compute Capability: 9.0 CUDA_HOME: /usr/local/cuda NVCC: Cuda compilation tools, release 12.4, V12.4.131 CUDA Driver Version: 535.183.06 PyTorch: 2.5.1+cu124 sgl_kernel: 0.0.3.post6 flashinfer: 0.2.1.post2+cu124torch2.5 triton: 3.1.0 transformers: 4.49.0 torchao: 0.8.0 numpy: 1.26.4 aiohttp: 3.11.12 fastapi: 0.115.8 hf_transfer: 0.1.9 huggingface_hub: 0.28.1 interegular: 0.3.3 modelscope: 1.23.0 orjson: 3.10.15 packaging: 24.2 psutil: 7.0.0 pydantic: 2.10.6 multipart: 0.0.20 zmq: 26.2.1 uvicorn: 0.34.0 uvloop: 0.21.0 vllm: 0.7.2 openai: 1.63.2 tiktoken: 0.9.0 anthropic: 0.45.2 decord: 0.6.0 NVIDIA Topology: GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7 NIC0 NIC1 NIC2 NIC3 NIC4 NIC5 NIC6 NIC7 NIC8 NIC9 CPU AffinityNUMA Affinity GPU NUMA ID GPU0 X NV18 NV18 NV18 NV18 NV18 NV18 NV18 PIX NODE NODE NODE NODE NODE SYS SYS SYS SYS 0-47,96-1430N/A GPU1 NV18 X NV18 NV18 NV18 NV18 NV18 NV18 NODE NODE NODE PIX NODE NODE SYS SYS SYS SYS 0-47,96-1430N/A GPU2 NV18 NV18 X NV18 NV18 NV18 NV18 NV18 NODE NODE NODE NODE PIX NODE SYS SYS SYS SYS 0-47,96-1430N/A GPU3 NV18 NV18 NV18 X NV18 NV18 NV18 NV18 NODE NODE NODE NODE NODE PIX SYS SYS SYS SYS 0-47,96-1430N/A GPU4 NV18 NV18 NV18 NV18 X NV18 NV18 NV18 SYS SYS SYS SYS SYS SYS PIX NODE NODE NODE 48-95,144-191 1 N/A GPU5 NV18 NV18 NV18 NV18 NV18 X NV18 NV18 SYS SYS SYS SYS SYS SYS NODE PIX NODE NODE 48-95,144-191 1 N/A GPU6 NV18 NV18 NV18 NV18 NV18 NV18 X NV18 SYS SYS SYS SYS SYS SYS NODE NODE PIX NODE 48-95,144-191 1 N/A GPU7 NV18 NV18 NV18 NV18 NV18 NV18 NV18 X SYS SYS SYS SYS SYS SYS NODE NODE NODE PIX 48-95,144-191 1 N/A NIC0 PIX NODE NODE NODE SYS SYS SYS SYS X NODE NODE NODE NODE NODE SYS SYS SYS SYS NIC1 NODE NODE NODE NODE SYS SYS SYS SYS NODE X PIX NODE NODE NODE SYS SYS SYS SYS NIC2 NODE NODE NODE NODE SYS SYS SYS SYS NODE PIX X NODE NODE NODE SYS SYS SYS SYS NIC3 NODE PIX NODE NODE SYS SYS SYS SYS NODE NODE NODE X NODE NODE SYS SYS SYS SYS NIC4 NODE NODE PIX NODE SYS SYS SYS SYS NODE NODE NODE NODE X NODE SYS SYS SYS SYS NIC5 NODE NODE NODE PIX SYS SYS SYS SYS NODE NODE NODE NODE NODE X SYS SYS SYS SYS NIC6 SYS SYS SYS SYS PIX NODE NODE NODE SYS SYS SYS SYS SYS SYS X NODE NODE NODE NIC7 SYS SYS SYS SYS NODE PIX NODE NODE SYS SYS SYS SYS SYS SYS NODE X NODE NODE NIC8 SYS SYS SYS SYS NODE NODE PIX NODE SYS SYS SYS SYS SYS SYS NODE NODE X NODE NIC9 SYS SYS SYS SYS NODE NODE NODE PIX SYS SYS SYS SYS SYS SYS NODE NODE NODE X
Legend:
X = Self SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI) NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU) PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge) PIX = Connection traversing at most a single PCIe bridge NV# = Connection traversing a bonded set of # NVLinks
NIC Legend:
NIC0: mlx5_0 NIC1: mlx5_1 NIC2: mlx5_2 NIC3: mlx5_3 NIC4: mlx5_4 NIC5: mlx5_5 NIC6: mlx5_6 NIC7: mlx5_7 NIC8: mlx5_8 NIC9: mlx5_9
ulimit soft: 1048576
same issue on 8*H200 single server
@hariag could you share the commands for 8*H200?
server: ulimit -n 4096000 python3 -m sglang.launch_server --model DeepSeek-R1 --tp 8 --trust-remote-code --port 8000 --watchdog-timeout 3600 --enable-dp-attention
test: ulimit -n 4096000 evalscope perf --url 'http://127.0.0.1:8000/v1/chat/completions' --parallel 2048 --model 'DeepSeek-R1' --api-key EMPTY --number 20480 --api openai --stream --temperature 0.6 --log-every-n-query 1024 --max-tokens 100 --max-prompt-length 100 --read-timeout 600 --connect-timeout 600 --prompt "hello"
by the way, if I remove --enable-dp-attention option, its working perfect but much slower.
Could you try to add --disable-overlap-schedule and test it again?
I also have the same issue on a single server of 8*H200. Add --disable-overlap-schedule can not help.
python3 -m sglang.launch_server --model-path /mnt/model/ --tensor-parallel-size 8 --trust-remote-code --enable-torch-compile --disable-cuda-graph --enable-dp-attention
python3 -m sglang.bench_serving \
--backend sglang \
--dataset-name random \
--random-range-ratio 1 \
--num-prompt 300 \
--request-rate 8 \
--random-input 1024 \
--random-output 1024 |tee -a SGLang_${model_name}_${input_len}_${output_len}_rps${i}_${DATETIME}_servering.log
I attached the server side log, please check it. debug.log
I checked the log, it seems an issue for sgl_kernels.fp8_blockwise_scaled_mm
cc: @zhyncs @yizhang2077
I have same problem.
Hi @changqingla , could you try if this case still happen when remove options --enable-dp-attention option?
I meet "output tensor size must be equal to world_size times input tensor size" error when add --enable-dp-attention option in two 8*H800, and without it everything ok. Is anyone try it on deepseek r1,can give me help
Hi @changqingla , could you try if this case still happen when remove options
--enable-dp-attention option?
In my env, the inferences are successed after removing this option. What's wrong with this option: --enable-dp-attention. It is recommended in doc: https://github.com/sgl-project/sglang/tree/main/benchmark/deepseek_v3#using-docker-recommended
I post a hot fix for this case in #3727 , could you try it again ? @lshmouse @hariag @ToughK @hiyforever Thank you!
I post a hot fix for this case in #3727 , could you try it again ? @lshmouse @hariag @ToughK @hiyforever Thank you!
OK, let me make a hotfix image and test it~
mark! the same problem with tp 16 on 2 H800*8 source version 3c7bfd7eabed5e29cf907dba3e2ed875d7a92fd4.
I have same problem without dp attention
@Lzhang-hub Did you try the latest main branch?
I met another problem when launched a deepseek-r1 model server with arguments --enable-dp-attention --dp-size 16 --tp 16 on 28H100. The rank 1 node threw Segmentation fault.
[2025-02-21 02:29:22 DP11 TP11] max_total_num_tokens=66295, chunked_prefill_size=4096, max_prefill_tokens=16384, max_running_requests=2049, context_len=163840
[2025-02-21 02:29:22 DP14 TP14] max_total_num_tokens=66295, chunked_prefill_size=4096, max_prefill_tokens=16384, max_running_requests=2049, context_len=163840
[2025-02-21 02:29:22 DP10 TP10] max_total_num_tokens=66295, chunked_prefill_size=4096, max_prefill_tokens=16384, max_running_requests=2049, context_len=163840
[2025-02-21 02:29:22 DP13 TP13] max_total_num_tokens=66295, chunked_prefill_size=4096, max_prefill_tokens=16384, max_running_requests=2049, context_len=163840
[2025-02-21 02:29:22 DP15 TP15] max_total_num_tokens=66295, chunked_prefill_size=4096, max_prefill_tokens=16384, max_running_requests=2049, context_len=163840
[2025-02-21 02:29:22 DP9 TP9] max_total_num_tokens=66295, chunked_prefill_size=4096, max_prefill_tokens=16384, max_running_requests=2049, context_len=163840
[2025-02-21 02:29:22 DP12 TP12] max_total_num_tokens=66295, chunked_prefill_size=4096, max_prefill_tokens=16384, max_running_requests=2049, context_len=163840
[2025-02-21 02:29:22 DP8 TP8] max_total_num_tokens=66295, chunked_prefill_size=4096, max_prefill_tokens=16384, max_running_requests=2049, context_len=163840
INFO: Started server process [1]
INFO: Waiting for application startup.
INFO: Application startup complete.
INFO: Uvicorn running on http://0.0.0.0:40000 (Press CTRL+C to quit)
[worker0:268 :0:36540] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x16)
==== backtrace (tid: 36540) ====
0 0x0000000000042520 __sigaction() ???:0
1 0x00000000000494f4 uploadProxyOps() /dvs/p4/build/sw/gpgpu/nccl/gitfusion/stable/src/enqueue.cc:1131
2 0x0000000000051a7f hostStreamPlanTask() /dvs/p4/build/sw/gpgpu/nccl/gitfusion/stable/src/enqueue.cc:1163
3 0x0000000000051bd9 hostStreamPlanCallback() /dvs/p4/build/sw/gpgpu/nccl/gitfusion/stable/src/enqueue.cc:1175
4 0x0000000000253ffd cuEGLApiInit() ???:0
5 0x0000000000263373 cuEGLApiInit() ???:0
6 0x0000000000094ac3 pthread_condattr_setpshared() ???:0
7 0x0000000000126850 __xmknodat() ???:0
=================================
Fatal Python error: Segmentation fault
Thread 0x00007f1755ffc640 (most recent call first):
File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 462 in watchdog_thread
File "/usr/lib/python3.10/threading.py", line 953 in run
File "/usr/lib/python3.10/threading.py", line 1016 in _bootstrap_inner
File "/usr/lib/python3.10/threading.py", line 973 in _bootstrap
Thread 0x00007f17567fd640 (most recent call first):
File "/usr/local/lib/python3.10/dist-packages/torch/cuda/graphs.py", line 88 in replay
File "/sgl-workspace/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 449 in replay
File "/sgl-workspace/sglang/python/sglang/srt/model_executor/model_runner.py", line 791 in forward
File "/sgl-workspace/sglang/python/sglang/srt/managers/tp_worker.py", line 164 in forward_batch_generation
File "/sgl-workspace/sglang/python/sglang/srt/managers/tp_worker_overlap_thread.py", line 140 in forward_thread_func_
File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116 in decorate_context
File "/sgl-workspace/sglang/python/sglang/srt/managers/tp_worker_overlap_thread.py", line 109 in forward_thread_func
File "/usr/lib/python3.10/threading.py", line 953 in run
File "/usr/lib/python3.10/threading.py", line 1016 in _bootstrap_inner
File "/usr/lib/python3.10/threading.py", line 973 in _bootstrap
Thread 0x00007f27f27d0640 (most recent call first):
File "/usr/lib/python3.10/threading.py", line 324 in wait
File "/usr/lib/python3.10/threading.py", line 607 in wait
File "/usr/local/lib/python3.10/dist-packages/tqdm/_monitor.py", line 60 in run
File "/usr/lib/python3.10/threading.py", line 1016 in _bootstrap_inner
File "/usr/lib/python3.10/threading.py", line 973 in _bootstrap
Thread 0x00007f27f17ce640 (most recent call first):
File "/usr/lib/python3.10/threading.py", line 324 in wait
File "/usr/lib/python3.10/threading.py", line 607 in wait
File "/usr/local/lib/python3.10/dist-packages/tqdm/_monitor.py", line 60 in run
File "/usr/lib/python3.10/threading.py", line 1016 in _bootstrap_inner
File "/usr/lib/python3.10/threading.py", line 973 in _bootstrap
Thread 0x00007f2d322ca4c0 (most recent call first):
File "/usr/local/lib/python3.10/dist-packages/torch/cuda/streams.py", line 225 in synchronize
File "/sgl-workspace/sglang/python/sglang/srt/managers/tp_worker_overlap_thread.py", line 170 in resolve_batch_result
File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 1123 in process_batch_result
File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 519 in event_loop_overlap
File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116 in decorate_context
File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 1825 in run_scheduler_process
File "/usr/lib/python3.10/multiprocessing/process.py", line 108 in run
File "/usr/lib/python3.10/multiprocessing/process.py", line 314 in _bootstrap
File "/usr/lib/python3.10/multiprocessing/spawn.py", line 129 in _main
File "/usr/lib/python3.10/multiprocessing/spawn.py", line 116 in spawn_main
File "<string>", line 1 in <module>
@Lzhang-hub Did you try the latest main branch?
@ispobock I use commit 32b44d2fcac, I will try latest main branch.
I post a hot fix for this case in #3727 , could you try it again ? @lshmouse @hariag @ToughK @hiyforever Thank you!
OK, let me make a hotfix image and test it~
@yizhang2077 Test sglang:v0.4.3 , sglang will not crash with pr 3727. But I found that the TTFT increased huge with --enable-dp-attention.
The serving benchmark result without --enable-dp-attention.
The serving benchmark result with --enable-dp-attention.
I post a hot fix for this case in #3727 , could you try it again ? @lshmouse @hariag @ToughK @hiyforever Thank you!
@yizhang2077 Thanks, sglang won't crash now, but the throughput is decreased significantly with --enable-dp-attention, even when the qps is about 2 req/s.
For high QPS scenarios, add the --enable-dp-attention argument to boost throughput
@Lzhang-hub Did you try the latest main branch?
@ispobock I use commit 32b44d2fcac, I will try latest main branch.
update: I try latest main branch, got error is same with 3424
@lshmouse @ToughK The dp attention is aimed to improve throughput for large batch size (>128). The latency is higher than TP.
@ispobock I use main branch wirh commit df84ab2a, still got same error. 2 nodes H20*8
update: I use flashinfer mla --enable-flashinfer-mla
[2025-03-11 02:44:43 DP7 TP7] TpModelWorkerClient hit an exception: Traceback (most recent call last):
File "/workspace/python/sglang/srt/managers/tp_worker_overlap_thread.py", line 109, in forward_thread_func
self.forward_thread_func_()
File "/opt/conda/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
return func(*args, **kwargs)
File "/workspace/python/sglang/srt/managers/tp_worker_overlap_thread.py", line 140, in forward_thread_func_
logits_output, next_token_ids = self.worker.forward_batch_generation(
File "/workspace/python/sglang/srt/managers/tp_worker.py", line 172, in forward_batch_generation
logits_output = self.model_runner.forward(forward_batch)
File "/workspace/python/sglang/srt/model_executor/model_runner.py", line 921, in forward
return self.forward_extend(
File "/workspace/python/sglang/srt/model_executor/model_runner.py", line 882, in forward_extend
return self.model.forward(
File "/opt/conda/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
return func(*args, **kwargs)
File "/workspace/python/sglang/srt/models/deepseek_v2.py", line 1086, in forward
hidden_states = self.model(input_ids, positions, forward_batch)
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
return forward_call(*args, **kwargs)
File "/workspace/python/sglang/srt/models/deepseek_v2.py", line 1040, in forward
hidden_states, residual = layer(
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
return forward_call(*args, **kwargs)
File "/workspace/python/sglang/srt/models/deepseek_v2.py", line 990, in forward
hidden_states = self.mlp(hidden_states)
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
return forward_call(*args, **kwargs)
File "/workspace/python/sglang/srt/models/deepseek_v2.py", line 197, in forward
self.experts(hidden_states=hidden_states, router_logits=router_logits)
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
return forward_call(*args, **kwargs)
File "/workspace/python/sglang/srt/layers/moe/fused_moe_triton/layer.py", line 620, in forward
final_hidden_states = self.quant_method.apply(
File "/workspace/python/sglang/srt/layers/quantization/fp8.py", line 949, in apply
return fused_experts(
File "/workspace/python/sglang/srt/layers/moe/fused_moe_triton/fused_moe.py", line 921, in fused_experts
torch.ops.sglang.inplace_fused_experts(
File "/opt/conda/lib/python3.10/site-packages/torch/_ops.py", line 1123, in __call__
return self._op(*args, **(kwargs or {}))
File "/workspace/python/sglang/srt/layers/moe/fused_moe_triton/fused_moe.py", line 790, in inplace_fused_experts
fused_experts_impl(
File "/workspace/python/sglang/srt/layers/moe/fused_moe_triton/fused_moe.py", line 1150, in fused_experts_impl
torch.sum(
RuntimeError: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
rank 1 node threw Segmentation fault.
Have you fix it? I have the same problem.
I was also receiving this error on main (compiled this morning) using the google/gemma-3-27b-it
My end-user was sending 128 concurrent requests to the system via python, system was setup to use 1 node 4xA100. I just started running a new test limiting --max-running-requests 64 seems to help stabilize it so far.
Here is the python call to launch the backend:
python -m sglang.launch_server --model-path google/gemma-3-27b-it --tp 4 --port 18443 --host=0.0.0.0 --mem-fraction-static=0.8 --max-running-requests 64
And the stacktrace when it errored when I didn't specify max-running-requests:
Exception raised from c10_cuda_check_implementation at ../c10/cuda/CUDAException.cpp:43 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7f395ea59446 in /mnt/isgnas/home/user/miniconda3/envs/sglang/lib/python3.12/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7f395ea036e4 in /mnt/isgnas/home/user/miniconda3/envs/sglang/lib/python3.12/site-packages/torch/lib/libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x7f395eb45a18 in /mnt/isgnas/home/user/miniconda3/envs/sglang/lib/python3.12/site-packages/torch/lib/libc10_cuda.so)
frame #3: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x56 (0x7f395fd67726 in /mnt/isgnas/home/user/miniconda3/envs/sglang/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #4: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0xa0 (0x7f395fd6c3f0 in /mnt/isgnas/home/user/miniconda3/envs/sglang/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #5: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1da (0x7f395fd73b5a in /mnt/isgnas/home/user/miniconda3/envs/sglang/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #6: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7f395fd7561d in /mnt/isgnas/home/user/miniconda3/envs/sglang/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #7: <unknown function> + 0x145c0 (0x7f39a87025c0 in /mnt/isgnas/home/user/miniconda3/envs/sglang/lib/python3.12/site-packages/torch/lib/libtorch.so)
frame #8: <unknown function> + 0x8609 (0x7f3a4d95c609 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #9: clone + 0x43 (0x7f3a4d727353 in /lib/x86_64-linux-gnu/libc.so.6)
still crash when using large chunk-prefill-size like 4096, need to be fixed? what is wrong with your kernels?
RuntimeError: CUDA error: an illegal memory access was encountered
Just a follow-up from my post a couple days ago. The instance is still running strong with --max-running-requests 64 using the gemma-3. Haven't tried the DeepSeekV3. Thought I'd provide this update just in case it helps identify issues because if I omit specifying the max-running-requests then I get the illegal memory access errors for the small model.
https://github.com/sgl-project/sglang/issues/4673 encounters the same issue. Follow https://github.com/sgl-project/sglang/issues/4673#issuecomment-2745578452 may fix your issue.
This issue has been automatically closed due to inactivity. Please feel free to reopen it if needed.