TensorRT-LLM Unable to run Deepseek R1 on blackwell

System Info

8xB200 system

Started a docker container from TensorRT-LLM codebase as follows: make -C docker build

Run the container as docker run --rm -it
--ipc=host --ulimit memlock=-1 --ulimit stack=67108864 --gpus=all
--volume ${PWD}:/code/tensorrt_llm
--workdir /code/tensorrt_llm
tensorrt_llm/devel:latest

git sha dc0463b0e2f62a0aaaa3b018673440d3b39d594a

Who can help?

No response

Information

[x] The official example scripts
[ ] My own modified scripts

Tasks

[ ] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[ ] My own task or dataset (give details below)

Reproduction

trtllm-serve deepseek-ai/DeepSeek-R1 --extra_llm_api_options options-lat.yml --port 10001

Where contents of options-lat are:

enable_attention_dp: false
pytorch_backend_config:
  enable_overlap_scheduler: true
  use_cuda_graph: true
  cuda_graph_batch_sizes: [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16]
speculative_config:
  decoding_type: MTP
  num_nextn_predict_layers: 3
kv_cache_config:
  free_gpu_memory_fraction: 0.75
backend: pytorch
tensor_parallel_size: 8
moe_expert_parallel_size: 8

Expected behavior

The model server should start up

actual behavior

File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/_torch/modules/attention.py", line 437, in forward compressed_q, compressed_kv, k_pe = self.fused_a( ^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1740, in _wrapped_call_impl return self._call_impl(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1751, in _call_impl return forward_call(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/_torch/modules/linear.py", line 409, in forward output = self.apply_linear(input, self.weight, self.bias) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/_torch/modules/linear.py", line 330, in apply_linear output = torch.ops.trtllm.fp8_block_scaling_gemm( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/dist-packages/torch/_ops.py", line 1123, in call return self._op(*args, **(kwargs or {})) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ RuntimeError: N must be a multiple of 128, (N=2112)

additional notes

The issue doesn't happen on Hopper, just Blackwell. I think this check is failing: https://github.com/NVIDIA/TensorRT-LLM/blob/60d4dacc47ba18b3aed425dd4c5af8cbc8068169/cpp/tensorrt_llm/thop/fp8BlockScalingGemm.cpp#L135

Mar 27 '25 04:03 pankajroark

@pankajroark
Hi, have you tried with the latest main branch or follow this guide to see whether the issue still exist?

June

Mar 27 '25 05:03 juney-nvidia

It looks like the issue is coming from fp8BlockScalingGemm, where N=2112 isn’t satisfying the multiple-of-128 requirement. Since this doesn’t happen on Hopper but does on Blackwell, it could be something hardware-specific affecting how TensorRT-LLM handles FP8 computations.

A couple of things that might be worth checking:

Is there a difference in how cuda_graph_batch_sizes are handled on Blackwell vs. Hopper? It might be pushing N to an unsupported value.

Does setting use_cuda_graph: false change anything? If the issue is related to CUDA graph optimizations, that could help isolate it.

Have you tried forcing N to a multiple of 128 by adjusting the batch size or tensor shape? Just to see if that workaround gets things running.

Might be worth looping in someone from the TensorRT-LLM team to confirm if this is expected behavior on Blackwell or if there’s a deeper issue at play.

Mar 27 '25 08:03 BrechtCorbeel

Yes, in fact, these assertions are unnecessary. I will file a PR soon to fix it.

Mar 27 '25 16:03 chang-l

Yes, in fact, these assertions are unnecessary. I will file a PR soon to fix it.

Thanks, Chang!

June

Mar 28 '25 00:03 juney-nvidia