[Bug] deepseekr1 illegal memory access on 2*8*H20
Checklist
- [x] 1. I have searched related issues but cannot get the expected help.
- [x] 2. The bug has not been fixed in the latest version.
- [x] 3. Please note that if the bug-related issue you submitted lacks corresponding environment info and a minimal reproducible demo, it will be challenging for us to reproduce and resolve the issue, reducing the likelihood of receiving feedback.
- [x] 4. If the issue you raised is not a bug but a question, please raise a discussion at https://github.com/sgl-project/sglang/discussions/new/choose Otherwise, it will be closed.
- [x] 5. Please use English, otherwise it will be closed.
Describe the bug
Hey guys, I tried to following the optimization guide and turned on DPA, but I sometimes ran into a tricky CUDA illegal memory access during the stress test. Turning off DPA is all fine. Any idea on how to fix this?And its performance was really not good, which was directly reduced by half. @zhyncs
Reproduction
ERROR 02-09 16:43:26 tp_worker_overlap_thread.py:112] TpModelWorkerClient hit an exception: Traceback (most recent call last):
ERROR 02-09 16:43:26 tp_worker_overlap_thread.py:112] File "/data/private/sglang-dev/sglang/python/sglang/srt/managers/tp_worker_overlap_thread.py", line 109, in forward_thread_func
ERROR 02-09 16:43:26 tp_worker_overlap_thread.py:112] self.forward_thread_func_()
ERROR 02-09 16:43:26 tp_worker_overlap_thread.py:112] File "/usr/local/lib/python3.10/dist-packages/torch/utils/contextlib.py", line 116, in decorate_context
ERROR 02-09 16:43:26 tp_worker_overlap_thread.py:112] return func(*args, **kwargs)
ERROR 02-09 16:43:26 tp_worker_overlap_thread.py:112] File "/data/private/sglang-dev/sglang/python/sglang/srt/managers/tp_worker_overlap_thread.py", line 140, in forward_thread_func
ERROR 02-09 16:43:26 tp_worker_overlap_thread.py:112] logits_output, next_token_ids = self.worker.forward_batch_generation(
ERROR 02-09 16:43:26 tp_worker_overlap_thread.py:112] File "/data/private/sglang-dev/sglang/python/sglang/srt/managers/tp_worker.py", line 164, in forward_batch_generation
ERROR 02-09 16:43:26 tp_worker_overlap_thread.py:112] logits_output = self.model_runner.forward(forward_batch)
ERROR 02-09 16:43:26 tp_worker_overlap_thread.py:112] File "/data/private/sglang-dev/sglang/python/sglang/srt/model_executor/model_runner.py", line 785, in forward
ERROR 02-09 16:43:26 tp_worker_overlap_thread.py:112] return self.forward_extend(forward_batch)
ERROR 02-09 16:43:26 tp_worker_overlap_thread.py:112] File "/data/private/sglang-dev/sglang/python/sglang/srt/model_executor/model_runner.py", line 750, in forward_extend
ERROR 02-09 16:43:26 tp_worker_overlap_thread.py:112] return self.model.forward(
ERROR 02-09 16:43:26 tp_worker_overlap_thread.py:112] File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
ERROR 02-09 16:43:26 tp_worker_overlap_thread.py:112] return func(*args, **kwargs)
ERROR 02-09 16:43:26 tp_worker_overlap_thread.py:112] File "/data/private/sglang-dev/sglang/python/sglang/srt/models/deepseek_v2.py", line 858, in forward
ERROR 02-09 16:43:26 tp_worker_overlap_thread.py:112] hidden_states = self.model(input_ids, positions, forward_batch)
ERROR 02-09 16:43:26 tp_worker_overlap_thread.py:112] File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
ERROR 02-09 16:43:26 tp_worker_overlap_thread.py:112] return self._call_impl(*args, **kwargs)
ERROR 02-09 16:43:26 tp_worker_overlap_thread.py:112] File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1747, in _call_impl
ERROR 02-09 16:43:26 tp_worker_overlap_thread.py:112] return forward_call(*args, **kwargs)
ERROR 02-09 16:43:26 tp_worker_overlap_thread.py:112] File "/data/private/sglang-dev/sglang/python/sglang/srt/models/deepseek_v2.py", line 819, in forward
ERROR 02-09 16:43:26 tp_worker_overlap_thread.py:112] hidden_states, residual = layer(
ERROR 02-09 16:43:26 tp_worker_overlap_thread.py:112] File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
ERROR 02-09 16:43:26 tp_worker_overlap_thread.py:112] return self._call_impl(*args, **kwargs)
ERROR 02-09 16:43:26 tp_worker_overlap_thread.py:112] File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1747, in _call_impl
ERROR 02-09 16:43:26 tp_worker_overlap_thread.py:112] return forward_call(*args, **kwargs)
ERROR 02-09 16:43:26 tp_worker_overlap_thread.py:112] File "/data/private/sglang-dev/sglang/python/sglang/srt/models/deepseek_v2.py", line 757, in forward
ERROR 02-09 16:43:26 tp_worker_overlap_thread.py:112] hidden_states = self.self_attn(
ERROR 02-09 16:43:26 tp_worker_overlap_thread.py:112] File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
ERROR 02-09 16:43:26 tp_worker_overlap_thread.py:112] return self._call_impl(*args, **kwargs)
ERROR 02-09 16:43:26 tp_worker_overlap_thread.py:112] File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1747, in _call_impl
ERROR 02-09 16:43:26 tp_worker_overlap_thread.py:112] return forward_call(*args, **kwargs)
ERROR 02-09 16:43:26 tp_worker_overlap_thread.py:112] File "/data/private/sglang-dev/sglang/python/sglang/srt/models/deepseek_v2.py", line 512, in forward
ERROR 02-09 16:43:26 tp_worker_overlap_thread.py:112] and forward_batch.extend_prefix_lens.sum() == 0
ERROR 02-09 16:43:26 tp_worker_overlap_thread.py:112] RuntimeError: CUDA error: an illegal memory access was encountered
ERROR 02-09 16:43:26 tp_worker_overlap_thread.py:112] CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
ERROR 02-09 16:43:26 tp_worker_overlap_thread.py:112] For debugging consider passing CUDA_LAUNCH_BLOCKING=1
ERROR 02-09 16:43:26 tp_worker_overlap_thread.py:112] Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.
Environment
python -m sglang.launch_server --model-path DeepSeek-R1 --tp 16 --dist-init-addr xxx --nnodes 2 --node-rank 0 --trust-remote-code --enable-dp-attention and change node rank to 1 on the other side
We haven't found the root cause of your problem yet, you can try running some low concurrency (say 8 concurrent) tasks to warm up, and then increase the concurrency (say 512 concurrent) to see if it solves your problem, or you can try updating the pip package
We haven't found the root cause of your problem yet, you can try running some low concurrency (say 8 concurrent) tasks to warm up, and then increase the concurrency (say 512 concurrent) to see if it solves your problem, or you can try updating the pip package
Then why does the performance drop by half after enabling it, instead of being optimized as mentioned in the tutorial? @minleminzui
mark
mark! the same problem with tp 16 on 2 H800*8 source version 3c7bfd7eabed5e29cf907dba3e2ed875d7a92fd4. maybe dp-attention matters
This issue has been automatically closed due to inactivity. Please feel free to reopen it if needed.