sglang icon indicating copy to clipboard operation
sglang copied to clipboard

[Bug] sglang crashed when use enable_dp_attention running DeepSeekV3 on 2x8xH100

Open ToughK opened this issue 10 months ago • 22 comments

server.log

Checklist

  • [x] 1. I have searched related issues but cannot get the expected help.
  • [x] 2. The bug has not been fixed in the latest version.
  • [x] 3. Please note that if the bug-related issue you submitted lacks corresponding environment info and a minimal reproducible demo, it will be challenging for us to reproduce and resolve the issue, reducing the likelihood of receiving feedback.
  • [x] 4. If the issue you raised is not a bug but a question, please raise a discussion at https://github.com/sgl-project/sglang/discussions/new/choose Otherwise, it will be closed.
  • [x] 5. Please use English, otherwise it will be closed.

Describe the bug

According to the dp-attention performance & usage, I turn on it by --enable-dp-attention when launching DeepSeek v3 on 2x8xH100. My command is like as below: docker run --gpus all -d --entrypoint=python3 --shm-size 32g --privileged -e NCCL_IB_HCA=mlx5_1,mlx5_2,mlx5_3,mlx5_4,mlx5_5,mlx5_6,mlx5_7,mlx5_8 -e NCCL_IB_QPS_PER_CONNECTION=2 -e NCCL_IB_ADAPTIVE_ROUTING=1 -e NCCL_DEBUG_SUBSYS=INIT,ENV,GRAPH -e NCCL_NVLS_ENABLE=0 -e NCCL_IB_GID_INDEX=3 -e NCCL_DEBUG=TRACE --network=host --ipc=host lmsysorg/sglang:v0.4.3-cu124 -m sglang.launch_server --model-path /sgl-workspace/deepseek-ai/DeepSeekV3/ --tp 16 --nccl-init-addr sgl-master:50001 --nnodes 2 --node-rank 0 --trust-remote-code --host 0.0.0.0 --port 8000 --watchdog-timeout 3600 --kv-cache-dtype fp8_e5m2 --enable-dp-attention --mem-fraction-static 0.78 2>&1 When I run test scripts, the server crashed

[2025-02-18 06:32:23 DP7 TP7] Prefill batch. #new-seq: 8, #new-token: 4096, #cached-token: 8, cache hit rate: 0.23%, token usage: 0.00, #running-req: 2, #queue-req: 0 [2025-02-18 06:32:23 DP6 TP6] Prefill batch. #new-seq: 8, #new-token: 2265, #cached-token: 8, cache hit rate: 0.38%, token usage: 0.00, #running-req: 2, #queue-req: 0 [rank2]:[E218 06:32:24.828983440 ProcessGroupNCCL.cpp:1595] [PG ID 2 PG GUID 3 Rank 2] Process group watchdog thread terminated with exception: CUDA error: an illegal memory access was encountered CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1 Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

Exception raised from c10_cuda_check_implementation at ../c10/cuda/CUDAException.cpp:43 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7f3545f6c446 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so) frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7f3545f166e4 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so) frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x7f354633ea18 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10_cuda.so) frame #3: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x56 (0x7f34fbe25726 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so) frame #4: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0xa0 (0x7f34fbe2a3f0 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so) frame #5: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1da (0x7f34fbe31b5a in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so) frame #6: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7f34fbe3361d in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so) frame #7: + 0x145c0 (0x7f3547b375c0 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch.so) frame #8: + 0x94ac3 (0x7f35489c0ac3 in /usr/lib/x86_64-linux-gnu/libc.so.6) frame #9: + 0x126850 (0x7f3548a52850 in /usr/lib/x86_64-linux-gnu/libc.so.6)

[2025-02-18 06:32:24 DP2 TP2] TpModelWorkerClient hit an exception: Traceback (most recent call last): File "/sgl-workspace/sglang/python/sglang/srt/managers/tp_worker_overlap_thread.py", line 109, in forward_thread_func self.forward_thread_func_() File "/usr/local/lib/python3.10/dist-packages/torch/utils/contextlib.py", line 116, in decorate_context return func(*args, **kwargs) File "/sgl-workspace/sglang/python/sglang/srt/managers/tp_worker_overlap_thread.py", line 140, in forward_thread_func logits_output, next_token_ids = self.worker.forward_batch_generation( File "/sgl-workspace/sglang/python/sglang/srt/managers/tp_worker.py", line 164, in forward_batch_generation logits_output = self.model_runner.forward(forward_batch) File "/sgl-workspace/sglang/python/sglang/srt/model_executor/model_runner.py", line 795, in forward return self.forward_extend(forward_batch) File "/sgl-workspace/sglang/python/sglang/srt/model_executor/model_runner.py", line 760, in forward_extend return self.model.forward( File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context return func(*args, **kwargs) File "/sgl-workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 868, in forward hidden_states = self.model(input_ids, positions, forward_batch) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl return self._call_impl(*args, **kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1747, in _call_impl return forward_call(*args, **kwargs) File "/sgl-workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 829, in forward hidden_states, residual = layer( File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl return self._call_impl(*args, **kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1747, in _call_impl return forward_call(*args, **kwargs) File "/sgl-workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 781, in forward hidden_states = self.mlp(hidden_states) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl return self._call_impl(*args, **kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1747, in _call_impl return forward_call(*args, **kwargs) File "/sgl-workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 177, in forward self.experts(hidden_states=hidden_states, router_logits=router_logits) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl return self._call_impl(*args, **kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1747, in _call_impl return forward_call(*args, **kwargs) File "/sgl-workspace/sglang/python/sglang/srt/layers/moe/fused_moe_triton/layer.py", line 589, in forward final_hidden_states = self.quant_method.apply( File "/sgl-workspace/sglang/python/sglang/srt/layers/quantization/fp8.py", line 820, in apply return fused_experts( File "/sgl-workspace/sglang/python/sglang/srt/layers/moe/fused_moe_triton/fused_moe.py", line 851, in fused_experts torch.ops.sglang.inplace_fused_experts( File "/usr/local/lib/python3.10/dist-packages/torch/_ops.py", line 1116, in call return self._op(*args, **(kwargs or {})) File "/sgl-workspace/sglang/python/sglang/srt/layers/moe/fused_moe_triton/fused_moe.py", line 731, in inplace_fused_experts fused_experts_impl( File "/sgl-workspace/sglang/python/sglang/srt/layers/moe/fused_moe_triton/fused_moe.py", line 1057, in fused_experts_impl torch.sum( RuntimeError: CUDA error: an illegal memory access was encountered CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1 Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/sgl-workspace/sglang/python/sglang/srt/managers/tp_worker_overlap_thread.py", line 108, in forward_thread_func with torch.get_device_module(self.device).stream(self.forward_stream): File "/usr/local/lib/python3.10/dist-packages/torch/cuda/init.py", line 595, in exit torch.cuda.set_stream(self.src_prev_stream) # type: ignore[arg-type] File "/usr/local/lib/python3.10/dist-packages/vllm/utils.py", line 962, in _patched_set_stream prev_set_stream(stream) File "/usr/local/lib/python3.10/dist-packages/torch/cuda/init.py", line 636, in set_stream _set_stream_by_id( File "/usr/local/lib/python3.10/dist-packages/torch/cuda/init.py", line 618, in _set_stream_by_id torch._C._cuda_setStream( RuntimeError: CUDA error: CUDA-capable device(s) is/are busy or unavailable CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1 Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

terminate called after throwing an instance of 'c10::DistBackendError' terminate called recursively Fatal Python error: Aborted

Thread 0x00007f2fe8afc640 (most recent call first): File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 462 in watchdog_thread File "/usr/lib/python3.10/threading.py", line 953 in run File "/usr/lib/python3.10/threading.py", line 1016 in _bootstrap_inner File "/usr/lib/python3.10/threading.py", line 973 in _bootstrap

Current thread 0x00007f1f035fe640 (most recent call first): File "/sgl-workspace/sglang/python/sglang/srt/managers/tp_worker_overlap_thread.py", line 113 in forward_thread_func File "/usr/lib/python3.10/threading.py", line 953 in run File "/usr/lib/python3.10/threading.py", line 1016 what(): in [PG ID 2 PG GUID 3 Rank 2] Process group watchdog thread terminated with exception: CUDA error: an illegal memory access was encountered CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1 Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

I tried to change the value of mem-fraction-static, but it doesn't work.

Environment

Reproduction

on node 1 docker run --gpus all -d --entrypoint=python3 --shm-size 32g --privileged -e NCCL_IB_HCA=mlx5_1,mlx5_2,mlx5_3,mlx5_4,mlx5_5,mlx5_6,mlx5_7,mlx5_8 -e NCCL_IB_QPS_PER_CONNECTION=2 -e NCCL_IB_ADAPTIVE_ROUTING=1 -e NCCL_DEBUG_SUBSYS=INIT,ENV,GRAPH -e NCCL_NVLS_ENABLE=0 -e NCCL_IB_GID_INDEX=3 -e NCCL_DEBUG=TRACE --network=host --ipc=host lmsysorg/sglang:v0.4.3-cu124 -m sglang.launch_server --model-path /sgl-workspace/deepseek-ai/DeepSeekV3/ --tp 16 --nccl-init-addr sgl-master:50001 --nnodes 2 --node-rank 0 --trust-remote-code --host 0.0.0.0 --port 8000 --watchdog-timeout 3600 --kv-cache-dtype fp8_e5m2 --enable-dp-attention --mem-fraction-static 0.78 2>&1

on node 2 docker run --gpus all -d --entrypoint=python3 --shm-size 32g --privileged -e NCCL_IB_HCA=mlx5_1,mlx5_2,mlx5_3,mlx5_4,mlx5_5,mlx5_6,mlx5_7,mlx5_8 -e NCCL_IB_QPS_PER_CONNECTION=2 -e NCCL_IB_ADAPTIVE_ROUTING=1 -e NCCL_DEBUG_SUBSYS=INIT,ENV,GRAPH -e NCCL_NVLS_ENABLE=0 -e NCCL_IB_GID_INDEX=3 -e NCCL_DEBUG=TRACE --network=host --ipc=host lmsysorg/sglang:v0.4.3-cu124 -m sglang.launch_server --model-path /sgl-workspace/deepseek-ai/DeepSeekV3/ --tp 16 --nccl-init-addr sgl-master:50001 --nnodes 2 --node-rank 1 --trust-remote-code --host 0.0.0.0 --port 8000 --watchdog-timeout 3600 --kv-cache-dtype fp8_e5m2 --enable-dp-attention --mem-fraction-static 0.78 2>&1

Environment

Python: 3.10.12 (main, Jan 17 2025, 14:35:34) [GCC 11.4.0] CUDA available: True GPU 0,1,2,3,4,5,6,7: NVIDIA H100 GPU 0,1,2,3,4,5,6,7 Compute Capability: 9.0 CUDA_HOME: /usr/local/cuda NVCC: Cuda compilation tools, release 12.4, V12.4.131 CUDA Driver Version: 535.183.06 PyTorch: 2.5.1+cu124 sgl_kernel: 0.0.3.post6 flashinfer: 0.2.1.post2+cu124torch2.5 triton: 3.1.0 transformers: 4.49.0 torchao: 0.8.0 numpy: 1.26.4 aiohttp: 3.11.12 fastapi: 0.115.8 hf_transfer: 0.1.9 huggingface_hub: 0.28.1 interegular: 0.3.3 modelscope: 1.23.0 orjson: 3.10.15 packaging: 24.2 psutil: 7.0.0 pydantic: 2.10.6 multipart: 0.0.20 zmq: 26.2.1 uvicorn: 0.34.0 uvloop: 0.21.0 vllm: 0.7.2 openai: 1.63.2 tiktoken: 0.9.0 anthropic: 0.45.2 decord: 0.6.0 NVIDIA Topology: GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7 NIC0 NIC1 NIC2 NIC3 NIC4 NIC5 NIC6 NIC7 NIC8 NIC9 CPU AffinityNUMA Affinity GPU NUMA ID GPU0 X NV18 NV18 NV18 NV18 NV18 NV18 NV18 PIX NODE NODE NODE NODE NODE SYS SYS SYS SYS 0-47,96-1430N/A GPU1 NV18 X NV18 NV18 NV18 NV18 NV18 NV18 NODE NODE NODE PIX NODE NODE SYS SYS SYS SYS 0-47,96-1430N/A GPU2 NV18 NV18 X NV18 NV18 NV18 NV18 NV18 NODE NODE NODE NODE PIX NODE SYS SYS SYS SYS 0-47,96-1430N/A GPU3 NV18 NV18 NV18 X NV18 NV18 NV18 NV18 NODE NODE NODE NODE NODE PIX SYS SYS SYS SYS 0-47,96-1430N/A GPU4 NV18 NV18 NV18 NV18 X NV18 NV18 NV18 SYS SYS SYS SYS SYS SYS PIX NODE NODE NODE 48-95,144-191 1 N/A GPU5 NV18 NV18 NV18 NV18 NV18 X NV18 NV18 SYS SYS SYS SYS SYS SYS NODE PIX NODE NODE 48-95,144-191 1 N/A GPU6 NV18 NV18 NV18 NV18 NV18 NV18 X NV18 SYS SYS SYS SYS SYS SYS NODE NODE PIX NODE 48-95,144-191 1 N/A GPU7 NV18 NV18 NV18 NV18 NV18 NV18 NV18 X SYS SYS SYS SYS SYS SYS NODE NODE NODE PIX 48-95,144-191 1 N/A NIC0 PIX NODE NODE NODE SYS SYS SYS SYS X NODE NODE NODE NODE NODE SYS SYS SYS SYS NIC1 NODE NODE NODE NODE SYS SYS SYS SYS NODE X PIX NODE NODE NODE SYS SYS SYS SYS NIC2 NODE NODE NODE NODE SYS SYS SYS SYS NODE PIX X NODE NODE NODE SYS SYS SYS SYS NIC3 NODE PIX NODE NODE SYS SYS SYS SYS NODE NODE NODE X NODE NODE SYS SYS SYS SYS NIC4 NODE NODE PIX NODE SYS SYS SYS SYS NODE NODE NODE NODE X NODE SYS SYS SYS SYS NIC5 NODE NODE NODE PIX SYS SYS SYS SYS NODE NODE NODE NODE NODE X SYS SYS SYS SYS NIC6 SYS SYS SYS SYS PIX NODE NODE NODE SYS SYS SYS SYS SYS SYS X NODE NODE NODE NIC7 SYS SYS SYS SYS NODE PIX NODE NODE SYS SYS SYS SYS SYS SYS NODE X NODE NODE NIC8 SYS SYS SYS SYS NODE NODE PIX NODE SYS SYS SYS SYS SYS SYS NODE NODE X NODE NIC9 SYS SYS SYS SYS NODE NODE NODE PIX SYS SYS SYS SYS SYS SYS NODE NODE NODE X

Legend:

X = Self SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI) NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU) PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge) PIX = Connection traversing at most a single PCIe bridge NV# = Connection traversing a bonded set of # NVLinks

NIC Legend:

NIC0: mlx5_0 NIC1: mlx5_1 NIC2: mlx5_2 NIC3: mlx5_3 NIC4: mlx5_4 NIC5: mlx5_5 NIC6: mlx5_6 NIC7: mlx5_7 NIC8: mlx5_8 NIC9: mlx5_9

ulimit soft: 1048576

ToughK avatar Feb 18 '25 06:02 ToughK

same issue on 8*H200 single server

hariag avatar Feb 18 '25 13:02 hariag

@hariag could you share the commands for 8*H200?

ispobock avatar Feb 18 '25 13:02 ispobock

server: ulimit -n 4096000 python3 -m sglang.launch_server --model DeepSeek-R1 --tp 8 --trust-remote-code --port 8000 --watchdog-timeout 3600 --enable-dp-attention

test: ulimit -n 4096000 evalscope perf --url 'http://127.0.0.1:8000/v1/chat/completions' --parallel 2048 --model 'DeepSeek-R1' --api-key EMPTY --number 20480 --api openai --stream --temperature 0.6 --log-every-n-query 1024 --max-tokens 100 --max-prompt-length 100 --read-timeout 600 --connect-timeout 600 --prompt "hello"

by the way, if I remove --enable-dp-attention option, its working perfect but much slower.

hariag avatar Feb 18 '25 14:02 hariag

Could you try to add --disable-overlap-schedule and test it again?

ispobock avatar Feb 18 '25 16:02 ispobock

add --disable-overlap-schedule not help.

I attached the server side log, please check it.

debug.log

hariag avatar Feb 19 '25 01:02 hariag

I also have the same issue on a single server of 8*H200. Add --disable-overlap-schedule can not help.

python3 -m sglang.launch_server --model-path /mnt/model/  --tensor-parallel-size 8 --trust-remote-code --enable-torch-compile  --disable-cuda-graph --enable-dp-attention

python3 -m sglang.bench_serving \
        --backend sglang \
        --dataset-name random \
        --random-range-ratio 1 \
        --num-prompt 300 \
        --request-rate 8 \
        --random-input 1024 \
        --random-output 1024 |tee -a SGLang_${model_name}_${input_len}_${output_len}_rps${i}_${DATETIME}_servering.log

yuqie avatar Feb 19 '25 06:02 yuqie

I attached the server side log, please check it. debug.log

I checked the log, it seems an issue for sgl_kernels.fp8_blockwise_scaled_mm cc: @zhyncs @yizhang2077

ispobock avatar Feb 19 '25 12:02 ispobock

I have same problem.

changqingla avatar Feb 20 '25 03:02 changqingla

Hi @changqingla , could you try if this case still happen when remove options --enable-dp-attention option?

yizhang2077 avatar Feb 20 '25 03:02 yizhang2077

I meet "output tensor size must be equal to world_size times input tensor size" error when add --enable-dp-attention option in two 8*H800, and without it everything ok. Is anyone try it on deepseek r1,can give me help

hiyforever avatar Feb 20 '25 06:02 hiyforever

Hi @changqingla , could you try if this case still happen when remove options --enable-dp-attention option?

In my env, the inferences are successed after removing this option. What's wrong with this option: --enable-dp-attention. It is recommended in doc: https://github.com/sgl-project/sglang/tree/main/benchmark/deepseek_v3#using-docker-recommended

lshmouse avatar Feb 20 '25 07:02 lshmouse

I post a hot fix for this case in #3727 , could you try it again ? @lshmouse @hariag @ToughK @hiyforever Thank you!

yizhang2077 avatar Feb 20 '25 08:02 yizhang2077

I post a hot fix for this case in #3727 , could you try it again ? @lshmouse @hariag @ToughK @hiyforever Thank you!

OK, let me make a hotfix image and test it~

lshmouse avatar Feb 20 '25 09:02 lshmouse

mark! the same problem with tp 16 on 2 H800*8 source version 3c7bfd7eabed5e29cf907dba3e2ed875d7a92fd4.

YEXINGZHE54 avatar Feb 20 '25 10:02 YEXINGZHE54

I have same problem without dp attention

Lzhang-hub avatar Feb 21 '25 02:02 Lzhang-hub

@Lzhang-hub Did you try the latest main branch?

ispobock avatar Feb 21 '25 02:02 ispobock

I met another problem when launched a deepseek-r1 model server with arguments --enable-dp-attention --dp-size 16 --tp 16 on 28H100. The rank 1 node threw Segmentation fault.

[2025-02-21 02:29:22 DP11 TP11] max_total_num_tokens=66295, chunked_prefill_size=4096, max_prefill_tokens=16384, max_running_requests=2049, context_len=163840
[2025-02-21 02:29:22 DP14 TP14] max_total_num_tokens=66295, chunked_prefill_size=4096, max_prefill_tokens=16384, max_running_requests=2049, context_len=163840
[2025-02-21 02:29:22 DP10 TP10] max_total_num_tokens=66295, chunked_prefill_size=4096, max_prefill_tokens=16384, max_running_requests=2049, context_len=163840
[2025-02-21 02:29:22 DP13 TP13] max_total_num_tokens=66295, chunked_prefill_size=4096, max_prefill_tokens=16384, max_running_requests=2049, context_len=163840
[2025-02-21 02:29:22 DP15 TP15] max_total_num_tokens=66295, chunked_prefill_size=4096, max_prefill_tokens=16384, max_running_requests=2049, context_len=163840
[2025-02-21 02:29:22 DP9 TP9] max_total_num_tokens=66295, chunked_prefill_size=4096, max_prefill_tokens=16384, max_running_requests=2049, context_len=163840
[2025-02-21 02:29:22 DP12 TP12] max_total_num_tokens=66295, chunked_prefill_size=4096, max_prefill_tokens=16384, max_running_requests=2049, context_len=163840
[2025-02-21 02:29:22 DP8 TP8] max_total_num_tokens=66295, chunked_prefill_size=4096, max_prefill_tokens=16384, max_running_requests=2049, context_len=163840
INFO:     Started server process [1]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://0.0.0.0:40000 (Press CTRL+C to quit)
[worker0:268  :0:36540] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x16)
==== backtrace (tid:  36540) ====
 0 0x0000000000042520 __sigaction()  ???:0
 1 0x00000000000494f4 uploadProxyOps()  /dvs/p4/build/sw/gpgpu/nccl/gitfusion/stable/src/enqueue.cc:1131
 2 0x0000000000051a7f hostStreamPlanTask()  /dvs/p4/build/sw/gpgpu/nccl/gitfusion/stable/src/enqueue.cc:1163
 3 0x0000000000051bd9 hostStreamPlanCallback()  /dvs/p4/build/sw/gpgpu/nccl/gitfusion/stable/src/enqueue.cc:1175
 4 0x0000000000253ffd cuEGLApiInit()  ???:0
 5 0x0000000000263373 cuEGLApiInit()  ???:0
 6 0x0000000000094ac3 pthread_condattr_setpshared()  ???:0
 7 0x0000000000126850 __xmknodat()  ???:0
=================================
Fatal Python error: Segmentation fault

Thread 0x00007f1755ffc640 (most recent call first):
  File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 462 in watchdog_thread
  File "/usr/lib/python3.10/threading.py", line 953 in run
  File "/usr/lib/python3.10/threading.py", line 1016 in _bootstrap_inner
  File "/usr/lib/python3.10/threading.py", line 973 in _bootstrap

Thread 0x00007f17567fd640 (most recent call first):
  File "/usr/local/lib/python3.10/dist-packages/torch/cuda/graphs.py", line 88 in replay
  File "/sgl-workspace/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 449 in replay
  File "/sgl-workspace/sglang/python/sglang/srt/model_executor/model_runner.py", line 791 in forward
  File "/sgl-workspace/sglang/python/sglang/srt/managers/tp_worker.py", line 164 in forward_batch_generation
  File "/sgl-workspace/sglang/python/sglang/srt/managers/tp_worker_overlap_thread.py", line 140 in forward_thread_func_
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116 in decorate_context
  File "/sgl-workspace/sglang/python/sglang/srt/managers/tp_worker_overlap_thread.py", line 109 in forward_thread_func
  File "/usr/lib/python3.10/threading.py", line 953 in run
  File "/usr/lib/python3.10/threading.py", line 1016 in _bootstrap_inner
  File "/usr/lib/python3.10/threading.py", line 973 in _bootstrap

Thread 0x00007f27f27d0640 (most recent call first):
  File "/usr/lib/python3.10/threading.py", line 324 in wait
  File "/usr/lib/python3.10/threading.py", line 607 in wait
  File "/usr/local/lib/python3.10/dist-packages/tqdm/_monitor.py", line 60 in run
  File "/usr/lib/python3.10/threading.py", line 1016 in _bootstrap_inner
  File "/usr/lib/python3.10/threading.py", line 973 in _bootstrap

Thread 0x00007f27f17ce640 (most recent call first):
  File "/usr/lib/python3.10/threading.py", line 324 in wait
  File "/usr/lib/python3.10/threading.py", line 607 in wait
  File "/usr/local/lib/python3.10/dist-packages/tqdm/_monitor.py", line 60 in run
  File "/usr/lib/python3.10/threading.py", line 1016 in _bootstrap_inner
  File "/usr/lib/python3.10/threading.py", line 973 in _bootstrap

Thread 0x00007f2d322ca4c0 (most recent call first):
  File "/usr/local/lib/python3.10/dist-packages/torch/cuda/streams.py", line 225 in synchronize
  File "/sgl-workspace/sglang/python/sglang/srt/managers/tp_worker_overlap_thread.py", line 170 in resolve_batch_result
  File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 1123 in process_batch_result
  File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 519 in event_loop_overlap
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116 in decorate_context
  File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 1825 in run_scheduler_process
  File "/usr/lib/python3.10/multiprocessing/process.py", line 108 in run
  File "/usr/lib/python3.10/multiprocessing/process.py", line 314 in _bootstrap
  File "/usr/lib/python3.10/multiprocessing/spawn.py", line 129 in _main
  File "/usr/lib/python3.10/multiprocessing/spawn.py", line 116 in spawn_main
  File "<string>", line 1 in <module>

dwq370 avatar Feb 21 '25 02:02 dwq370

@Lzhang-hub Did you try the latest main branch?

@ispobock I use commit 32b44d2fcac, I will try latest main branch.

Lzhang-hub avatar Feb 21 '25 03:02 Lzhang-hub

I post a hot fix for this case in #3727 , could you try it again ? @lshmouse @hariag @ToughK @hiyforever Thank you!

OK, let me make a hotfix image and test it~

@yizhang2077 Test sglang:v0.4.3 , sglang will not crash with pr 3727. But I found that the TTFT increased huge with --enable-dp-attention.

The serving benchmark result without --enable-dp-attention.

Image

The serving benchmark result with --enable-dp-attention.

Image

lshmouse avatar Feb 21 '25 03:02 lshmouse

I post a hot fix for this case in #3727 , could you try it again ? @lshmouse @hariag @ToughK @hiyforever Thank you!

@yizhang2077 Thanks, sglang won't crash now, but the throughput is decreased significantly with --enable-dp-attention, even when the qps is about 2 req/s.

For high QPS scenarios, add the --enable-dp-attention argument to boost throughput

ToughK avatar Feb 21 '25 06:02 ToughK

@Lzhang-hub Did you try the latest main branch?

@ispobock I use commit 32b44d2fcac, I will try latest main branch.

update: I try latest main branch, got error is same with 3424

Lzhang-hub avatar Feb 21 '25 06:02 Lzhang-hub

@lshmouse @ToughK The dp attention is aimed to improve throughput for large batch size (>128). The latency is higher than TP.

ispobock avatar Feb 21 '25 06:02 ispobock

@ispobock I use main branch wirh commit df84ab2a, still got same error. 2 nodes H20*8

update: I use flashinfer mla --enable-flashinfer-mla

[2025-03-11 02:44:43 DP7 TP7] TpModelWorkerClient hit an exception: Traceback (most recent call last):
  File "/workspace/python/sglang/srt/managers/tp_worker_overlap_thread.py", line 109, in forward_thread_func
    self.forward_thread_func_()
  File "/opt/conda/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
  File "/workspace/python/sglang/srt/managers/tp_worker_overlap_thread.py", line 140, in forward_thread_func_
    logits_output, next_token_ids = self.worker.forward_batch_generation(
  File "/workspace/python/sglang/srt/managers/tp_worker.py", line 172, in forward_batch_generation
    logits_output = self.model_runner.forward(forward_batch)
  File "/workspace/python/sglang/srt/model_executor/model_runner.py", line 921, in forward
    return self.forward_extend(
  File "/workspace/python/sglang/srt/model_executor/model_runner.py", line 882, in forward_extend
    return self.model.forward(
  File "/opt/conda/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
  File "/workspace/python/sglang/srt/models/deepseek_v2.py", line 1086, in forward
    hidden_states = self.model(input_ids, positions, forward_batch)
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
    return forward_call(*args, **kwargs)
  File "/workspace/python/sglang/srt/models/deepseek_v2.py", line 1040, in forward
    hidden_states, residual = layer(
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
    return forward_call(*args, **kwargs)
  File "/workspace/python/sglang/srt/models/deepseek_v2.py", line 990, in forward
    hidden_states = self.mlp(hidden_states)
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
    return forward_call(*args, **kwargs)
  File "/workspace/python/sglang/srt/models/deepseek_v2.py", line 197, in forward
    self.experts(hidden_states=hidden_states, router_logits=router_logits)
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
    return forward_call(*args, **kwargs)
  File "/workspace/python/sglang/srt/layers/moe/fused_moe_triton/layer.py", line 620, in forward
    final_hidden_states = self.quant_method.apply(
  File "/workspace/python/sglang/srt/layers/quantization/fp8.py", line 949, in apply
    return fused_experts(
  File "/workspace/python/sglang/srt/layers/moe/fused_moe_triton/fused_moe.py", line 921, in fused_experts
    torch.ops.sglang.inplace_fused_experts(
  File "/opt/conda/lib/python3.10/site-packages/torch/_ops.py", line 1123, in __call__
    return self._op(*args, **(kwargs or {}))
  File "/workspace/python/sglang/srt/layers/moe/fused_moe_triton/fused_moe.py", line 790, in inplace_fused_experts
    fused_experts_impl(
  File "/workspace/python/sglang/srt/layers/moe/fused_moe_triton/fused_moe.py", line 1150, in fused_experts_impl
    torch.sum(
RuntimeError: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Lzhang-hub avatar Mar 11 '25 02:03 Lzhang-hub

rank 1 node threw Segmentation fault.

Have you fix it? I have the same problem.

Xu-backup avatar Mar 14 '25 06:03 Xu-backup

I was also receiving this error on main (compiled this morning) using the google/gemma-3-27b-it

My end-user was sending 128 concurrent requests to the system via python, system was setup to use 1 node 4xA100. I just started running a new test limiting --max-running-requests 64 seems to help stabilize it so far.

Here is the python call to launch the backend: python -m sglang.launch_server --model-path google/gemma-3-27b-it --tp 4 --port 18443 --host=0.0.0.0 --mem-fraction-static=0.8 --max-running-requests 64

And the stacktrace when it errored when I didn't specify max-running-requests:

Exception raised from c10_cuda_check_implementation at ../c10/cuda/CUDAException.cpp:43 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7f395ea59446 in /mnt/isgnas/home/user/miniconda3/envs/sglang/lib/python3.12/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7f395ea036e4 in /mnt/isgnas/home/user/miniconda3/envs/sglang/lib/python3.12/site-packages/torch/lib/libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x7f395eb45a18 in /mnt/isgnas/home/user/miniconda3/envs/sglang/lib/python3.12/site-packages/torch/lib/libc10_cuda.so)
frame #3: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x56 (0x7f395fd67726 in /mnt/isgnas/home/user/miniconda3/envs/sglang/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #4: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0xa0 (0x7f395fd6c3f0 in /mnt/isgnas/home/user/miniconda3/envs/sglang/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #5: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1da (0x7f395fd73b5a in /mnt/isgnas/home/user/miniconda3/envs/sglang/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #6: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7f395fd7561d in /mnt/isgnas/home/user/miniconda3/envs/sglang/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #7: <unknown function> + 0x145c0 (0x7f39a87025c0 in /mnt/isgnas/home/user/miniconda3/envs/sglang/lib/python3.12/site-packages/torch/lib/libtorch.so)
frame #8: <unknown function> + 0x8609 (0x7f3a4d95c609 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #9: clone + 0x43 (0x7f3a4d727353 in /lib/x86_64-linux-gnu/libc.so.6)

tblattner avatar Mar 17 '25 19:03 tblattner

still crash when using large chunk-prefill-size like 4096, need to be fixed? what is wrong with your kernels?

RuntimeError: CUDA error: an illegal memory access was encountered

chengmengli06 avatar Mar 19 '25 06:03 chengmengli06

Just a follow-up from my post a couple days ago. The instance is still running strong with --max-running-requests 64 using the gemma-3. Haven't tried the DeepSeekV3. Thought I'd provide this update just in case it helps identify issues because if I omit specifying the max-running-requests then I get the illegal memory access errors for the small model.

tblattner avatar Mar 19 '25 13:03 tblattner

https://github.com/sgl-project/sglang/issues/4673 encounters the same issue. Follow https://github.com/sgl-project/sglang/issues/4673#issuecomment-2745578452 may fix your issue.

ch-wan avatar Mar 22 '25 20:03 ch-wan

This issue has been automatically closed due to inactivity. Please feel free to reopen it if needed.

github-actions[bot] avatar May 22 '25 00:05 github-actions[bot]