sglang
sglang copied to clipboard
[Bug] [DeepSeek-R1] Error: Failed to initialize the TMA descriptor due to invalid argument on B200
Checklist
- [x] I searched related issues but found no solution.
- [x] The bug persists in the latest version.
- [x] Issues without environment info and a minimal reproducible demo are hard to resolve and may receive no feedback.
- [x] If this is not a bug report but a general question, please start a discussion at https://github.com/sgl-project/sglang/discussions. Otherwise, it will be closed.
- [x] Please use English. Otherwise, it will be closed.
Describe the bug
When running inference with DeepSeek-R1 model on long context inputs, the FlashInfer TMA backend fails to initialize the TMA descriptor, causing the scheduler to crash.
Error: Failed to initialize the TMA descriptor due to invalid argument
tmaFormat: 9 dim: 4 gmem: 0x7b4cc0000000
Shape: 192 131072 128 1 3806834432
Stride: 49152 384 2305843007066210304 273
tileShapes: 64 128 1 1 49152
tileStrides: 1 1 1 1 32
swizzleType: 3
Error: Failed to initialize the TMA descriptor due to invalid argument
tmaFormat: 9 dim: 4 gmem: 0x74df40000000
Shape: 192 131072 128 1 48
Stride: 49152 384 2305843007066210304 2497
tileShapes: 64 128 1 1 49152
tileStrides: 1 1 1 1 32
swizzleType: 3
[2025-11-22 22:22:32 DP4 TP4] Scheduler hit an exception: Traceback (most recent call last):
File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 2736, in run_scheduler_process
scheduler.event_loop_overlap()
File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 120, in decorate_context
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 1024, in event_loop_overlap
batch_result = self.run_batch(batch)
^^^^^^^^^^^^^^^^^^^^^
File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 2032, in run_batch
batch_result = self.model_worker.forward_batch_generation(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/sgl-workspace/sglang/python/sglang/srt/managers/tp_worker.py", line 371, in forward_batch_generation
logits_output, can_run_cuda_graph = self.model_runner.forward(
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/sgl-workspace/sglang/python/sglang/srt/model_executor/model_runner.py", line 2225, in forward
output = self._forward_raw(
^^^^^^^^^^^^^^^^^^
File "/sgl-workspace/sglang/python/sglang/srt/model_executor/model_runner.py", line 2284, in _forward_raw
ret = self.forward_extend(
^^^^^^^^^^^^^^^^^^^^
File "/sgl-workspace/sglang/python/sglang/srt/model_executor/model_runner.py", line 2170, in forward_extend
return self.model.forward(
^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 120, in decorate_context
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/sgl-workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 3448, in forward
hidden_states = self.model(
^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1773, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1784, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/sgl-workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 3258, in forward
hidden_states, residual = layer(
^^^^^^
File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1773, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1784, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/sgl-workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 2971, in forward
hidden_states = self.self_attn(
^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1773, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1784, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/sgl-workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 1446, in forward
return self.forward_core(s)
^^^^^^^^^^^^^^^^^^^^
File "/sgl-workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 1541, in forward_core
return self.forward_normal_chunked_kv_core(*inner_state)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/sgl-workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 2665, in forward_normal_chunked_kv_core
attn_output = self._chunked_prefix_attn_mha(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/sgl-workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 2621, in _chunked_prefix_attn_mha
output, lse = self.attn_mha(q, k, v, forward_batch, save_kv_cache=False)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1773, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1784, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/sgl-workspace/sglang/python/sglang/srt/layers/radix_attention.py", line 123, in forward
return forward_batch.attn_backend.forward(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/sgl-workspace/sglang/python/sglang/srt/layers/attention/base_attn_backend.py", line 101, in forward
return self.forward_extend(
^^^^^^^^^^^^^^^^^^^^
File "/sgl-workspace/sglang/python/sglang/srt/layers/attention/trtllm_mla_backend.py", line 1052, in forward_extend
return flashinfer.prefill.trtllm_ragged_attention_deepseek(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/flashinfer/prefill.py", line 3291, in trtllm_ragged_attention_deepseek
run_func(
File "python/tvm_ffi/cython/function.pxi", line 901, in core.Function.__call__
RuntimeError: Error in function 'buildNdTmaDescriptor' at /workspace/include/flashinfer/trtllm/fmha/kernelParams.h:528: Check failed: false
Reproduction
Launch:
python -m sglang.launch_server --model deepseek-ai/DeepSeek-R1 --tp 8 --dp 8 --enable-dp-attention
Evaluation:
# Use NemoSkills to evaluate aalcr benchmark
ns prepare_data aalcr
ns eval --server_type=openai --model=deepseek-ai/DeepSeek-R1 --server_address=http://127.0.0.1:30000/v1 --benchmarks=aalcr:4 --output_dir=/sgl-workspace/files/dpskr1_aalcr --judge_model=deepseek-ai/DeepSeek-R1 --judge_server_type=openai --judge_server_address=http://127.0.0.1:30000/v1 ++max_concurrent_requests=100 ++server.api_key=dummy ++inference.temperature=0.0 ++inference.top_p=1.0 ++inference.tokens_to_generate=4096
Environment
Nvidia B200 * 8
lmsysorg/sglang:dev