sglang [Bug] deepseekr1 illegal memory access on 2*8*H20

Checklist

[x] 1. I have searched related issues but cannot get the expected help.
[x] 2. The bug has not been fixed in the latest version.
[x] 3. Please note that if the bug-related issue you submitted lacks corresponding environment info and a minimal reproducible demo, it will be challenging for us to reproduce and resolve the issue, reducing the likelihood of receiving feedback.
[x] 4. If the issue you raised is not a bug but a question, please raise a discussion at https://github.com/sgl-project/sglang/discussions/new/choose Otherwise, it will be closed.
[x] 5. Please use English, otherwise it will be closed.

Describe the bug

Hey guys, I tried to following the optimization guide and turned on DPA, but I sometimes ran into a tricky CUDA illegal memory access during the stress test. Turning off DPA is all fine. Any idea on how to fix this？And its performance was really not good, which was directly reduced by half. @zhyncs

Reproduction

ERROR 02-09 16:43:26 tp_worker_overlap_thread.py:112] TpModelWorkerClient hit an exception: Traceback (most recent call last): ERROR 02-09 16:43:26 tp_worker_overlap_thread.py:112] File "/data/private/sglang-dev/sglang/python/sglang/srt/managers/tp_worker_overlap_thread.py", line 109, in forward_thread_func ERROR 02-09 16:43:26 tp_worker_overlap_thread.py:112] self.forward_thread_func_() ERROR 02-09 16:43:26 tp_worker_overlap_thread.py:112] File "/usr/local/lib/python3.10/dist-packages/torch/utils/contextlib.py", line 116, in decorate_context ERROR 02-09 16:43:26 tp_worker_overlap_thread.py:112] return func(*args, **kwargs) ERROR 02-09 16:43:26 tp_worker_overlap_thread.py:112] File "/data/private/sglang-dev/sglang/python/sglang/srt/managers/tp_worker_overlap_thread.py", line 140, in forward_thread_func ERROR 02-09 16:43:26 tp_worker_overlap_thread.py:112] logits_output, next_token_ids = self.worker.forward_batch_generation( ERROR 02-09 16:43:26 tp_worker_overlap_thread.py:112] File "/data/private/sglang-dev/sglang/python/sglang/srt/managers/tp_worker.py", line 164, in forward_batch_generation ERROR 02-09 16:43:26 tp_worker_overlap_thread.py:112] logits_output = self.model_runner.forward(forward_batch) ERROR 02-09 16:43:26 tp_worker_overlap_thread.py:112] File "/data/private/sglang-dev/sglang/python/sglang/srt/model_executor/model_runner.py", line 785, in forward ERROR 02-09 16:43:26 tp_worker_overlap_thread.py:112] return self.forward_extend(forward_batch) ERROR 02-09 16:43:26 tp_worker_overlap_thread.py:112] File "/data/private/sglang-dev/sglang/python/sglang/srt/model_executor/model_runner.py", line 750, in forward_extend ERROR 02-09 16:43:26 tp_worker_overlap_thread.py:112] return self.model.forward( ERROR 02-09 16:43:26 tp_worker_overlap_thread.py:112] File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context ERROR 02-09 16:43:26 tp_worker_overlap_thread.py:112] return func(*args, **kwargs) ERROR 02-09 16:43:26 tp_worker_overlap_thread.py:112] File "/data/private/sglang-dev/sglang/python/sglang/srt/models/deepseek_v2.py", line 858, in forward ERROR 02-09 16:43:26 tp_worker_overlap_thread.py:112] hidden_states = self.model(input_ids, positions, forward_batch) ERROR 02-09 16:43:26 tp_worker_overlap_thread.py:112] File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl ERROR 02-09 16:43:26 tp_worker_overlap_thread.py:112] return self._call_impl(*args, **kwargs) ERROR 02-09 16:43:26 tp_worker_overlap_thread.py:112] File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1747, in _call_impl ERROR 02-09 16:43:26 tp_worker_overlap_thread.py:112] return forward_call(*args, **kwargs) ERROR 02-09 16:43:26 tp_worker_overlap_thread.py:112] File "/data/private/sglang-dev/sglang/python/sglang/srt/models/deepseek_v2.py", line 819, in forward ERROR 02-09 16:43:26 tp_worker_overlap_thread.py:112] hidden_states, residual = layer( ERROR 02-09 16:43:26 tp_worker_overlap_thread.py:112] File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl ERROR 02-09 16:43:26 tp_worker_overlap_thread.py:112] return self._call_impl(*args, **kwargs) ERROR 02-09 16:43:26 tp_worker_overlap_thread.py:112] File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1747, in _call_impl ERROR 02-09 16:43:26 tp_worker_overlap_thread.py:112] return forward_call(*args, **kwargs) ERROR 02-09 16:43:26 tp_worker_overlap_thread.py:112] File "/data/private/sglang-dev/sglang/python/sglang/srt/models/deepseek_v2.py", line 757, in forward ERROR 02-09 16:43:26 tp_worker_overlap_thread.py:112] hidden_states = self.self_attn( ERROR 02-09 16:43:26 tp_worker_overlap_thread.py:112] File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl ERROR 02-09 16:43:26 tp_worker_overlap_thread.py:112] return self._call_impl(*args, **kwargs) ERROR 02-09 16:43:26 tp_worker_overlap_thread.py:112] File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1747, in _call_impl ERROR 02-09 16:43:26 tp_worker_overlap_thread.py:112] return forward_call(*args, **kwargs) ERROR 02-09 16:43:26 tp_worker_overlap_thread.py:112] File "/data/private/sglang-dev/sglang/python/sglang/srt/models/deepseek_v2.py", line 512, in forward ERROR 02-09 16:43:26 tp_worker_overlap_thread.py:112] and forward_batch.extend_prefix_lens.sum() == 0 ERROR 02-09 16:43:26 tp_worker_overlap_thread.py:112] RuntimeError: CUDA error: an illegal memory access was encountered ERROR 02-09 16:43:26 tp_worker_overlap_thread.py:112] CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. ERROR 02-09 16:43:26 tp_worker_overlap_thread.py:112] For debugging consider passing CUDA_LAUNCH_BLOCKING=1 ERROR 02-09 16:43:26 tp_worker_overlap_thread.py:112] Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

Environment

python -m sglang.launch_server --model-path DeepSeek-R1 --tp 16 --dist-init-addr xxx --nnodes 2 --node-rank 0 --trust-remote-code --enable-dp-attention and change node rank to 1 on the other side

Feb 09 '25 11:02 Juelianqvq

We haven't found the root cause of your problem yet, you can try running some low concurrency (say 8 concurrent) tasks to warm up, and then increase the concurrency (say 512 concurrent) to see if it solves your problem, or you can try updating the pip package

Feb 11 '25 14:02 minleminzui

We haven't found the root cause of your problem yet, you can try running some low concurrency (say 8 concurrent) tasks to warm up, and then increase the concurrency (say 512 concurrent) to see if it solves your problem, or you can try updating the pip package

Then why does the performance drop by half after enabling it, instead of being optimized as mentioned in the tutorial? @minleminzui

Feb 12 '25 01:02 Juelianqvq

mark

Feb 20 '25 10:02 kuangdao

mark! the same problem with tp 16 on 2 H800*8 source version 3c7bfd7eabed5e29cf907dba3e2ed875d7a92fd4. maybe dp-attention matters

Feb 20 '25 10:02 YEXINGZHE54

This issue has been automatically closed due to inactivity. Please feel free to reopen it if needed.

Apr 22 '25 00:04 github-actions[bot]