[Bug] [DeepSeek-R1] Error: Failed to initialize the TMA descriptor due to invalid argument on B200

Open YAMY1234 opened this issue 1 month ago • 0 comments

Checklist

[x] I searched related issues but found no solution.
[x] The bug persists in the latest version.
[x] Issues without environment info and a minimal reproducible demo are hard to resolve and may receive no feedback.
[x] If this is not a bug report but a general question, please start a discussion at https://github.com/sgl-project/sglang/discussions. Otherwise, it will be closed.
[x] Please use English. Otherwise, it will be closed.

Describe the bug

When running inference with DeepSeek-R1 model on long context inputs, the FlashInfer TMA backend fails to initialize the TMA descriptor, causing the scheduler to crash.

Error: Failed to initialize the TMA descriptor due to invalid argument
tmaFormat: 9 dim: 4 gmem: 0x7b4cc0000000
Shape: 192 131072 128 1 3806834432
Stride: 49152 384 2305843007066210304 273
tileShapes: 64 128 1 1 49152
tileStrides: 1 1 1 1 32
swizzleType: 3
Error: Failed to initialize the TMA descriptor due to invalid argument
tmaFormat: 9 dim: 4 gmem: 0x74df40000000
Shape: 192 131072 128 1 48
Stride: 49152 384 2305843007066210304 2497
tileShapes: 64 128 1 1 49152
tileStrides: 1 1 1 1 32
swizzleType: 3
[2025-11-22 22:22:32 DP4 TP4] Scheduler hit an exception: Traceback (most recent call last):
  File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 2736, in run_scheduler_process
    scheduler.event_loop_overlap()
  File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 120, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 1024, in event_loop_overlap
    batch_result = self.run_batch(batch)
                   ^^^^^^^^^^^^^^^^^^^^^
  File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 2032, in run_batch
    batch_result = self.model_worker.forward_batch_generation(
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/sgl-workspace/sglang/python/sglang/srt/managers/tp_worker.py", line 371, in forward_batch_generation
    logits_output, can_run_cuda_graph = self.model_runner.forward(
                                        ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/sgl-workspace/sglang/python/sglang/srt/model_executor/model_runner.py", line 2225, in forward
    output = self._forward_raw(
             ^^^^^^^^^^^^^^^^^^
  File "/sgl-workspace/sglang/python/sglang/srt/model_executor/model_runner.py", line 2284, in _forward_raw
    ret = self.forward_extend(
          ^^^^^^^^^^^^^^^^^^^^
  File "/sgl-workspace/sglang/python/sglang/srt/model_executor/model_runner.py", line 2170, in forward_extend
    return self.model.forward(
           ^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 120, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/sgl-workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 3448, in forward
    hidden_states = self.model(
                    ^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1773, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1784, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/sgl-workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 3258, in forward
    hidden_states, residual = layer(
                              ^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1773, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1784, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/sgl-workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 2971, in forward
    hidden_states = self.self_attn(
                    ^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1773, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1784, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/sgl-workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 1446, in forward
    return self.forward_core(s)
           ^^^^^^^^^^^^^^^^^^^^
  File "/sgl-workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 1541, in forward_core
    return self.forward_normal_chunked_kv_core(*inner_state)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/sgl-workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 2665, in forward_normal_chunked_kv_core
    attn_output = self._chunked_prefix_attn_mha(
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/sgl-workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 2621, in _chunked_prefix_attn_mha
    output, lse = self.attn_mha(q, k, v, forward_batch, save_kv_cache=False)
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1773, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1784, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/sgl-workspace/sglang/python/sglang/srt/layers/radix_attention.py", line 123, in forward
    return forward_batch.attn_backend.forward(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/sgl-workspace/sglang/python/sglang/srt/layers/attention/base_attn_backend.py", line 101, in forward
    return self.forward_extend(
           ^^^^^^^^^^^^^^^^^^^^
  File "/sgl-workspace/sglang/python/sglang/srt/layers/attention/trtllm_mla_backend.py", line 1052, in forward_extend
    return flashinfer.prefill.trtllm_ragged_attention_deepseek(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/flashinfer/prefill.py", line 3291, in trtllm_ragged_attention_deepseek
    run_func(
  File "python/tvm_ffi/cython/function.pxi", line 901, in core.Function.__call__
RuntimeError: Error in function 'buildNdTmaDescriptor' at /workspace/include/flashinfer/trtllm/fmha/kernelParams.h:528: Check failed: false

Reproduction

Launch:

python -m sglang.launch_server --model deepseek-ai/DeepSeek-R1 --tp 8 --dp 8 --enable-dp-attention

Evaluation:

# Use NemoSkills to evaluate aalcr benchmark
ns prepare_data aalcr

ns eval   --server_type=openai   --model=deepseek-ai/DeepSeek-R1   --server_address=http://127.0.0.1:30000/v1   --benchmarks=aalcr:4   --output_dir=/sgl-workspace/files/dpskr1_aalcr   --judge_model=deepseek-ai/DeepSeek-R1   --judge_server_type=openai   --judge_server_address=http://127.0.0.1:30000/v1   ++max_concurrent_requests=100   ++server.api_key=dummy   ++inference.temperature=0.0   ++inference.top_p=1.0   ++inference.tokens_to_generate=4096

Environment

Nvidia B200 * 8

lmsysorg/sglang:dev

Nov 22 '25 22:11 YAMY1234