Unable to run Deepseek R1 on blackwell
System Info
8xB200 system
Started a docker container from TensorRT-LLM codebase as follows: make -C docker build
Run the container as
docker run --rm -it
--ipc=host --ulimit memlock=-1 --ulimit stack=67108864 --gpus=all
--volume ${PWD}:/code/tensorrt_llm
--workdir /code/tensorrt_llm
tensorrt_llm/devel:latest
git sha dc0463b0e2f62a0aaaa3b018673440d3b39d594a
Who can help?
No response
Information
- [x] The official example scripts
- [ ] My own modified scripts
Tasks
- [ ] An officially supported task in the
examplesfolder (such as GLUE/SQuAD, ...) - [ ] My own task or dataset (give details below)
Reproduction
trtllm-serve deepseek-ai/DeepSeek-R1 --extra_llm_api_options options-lat.yml --port 10001
Where contents of options-lat are:
enable_attention_dp: false
pytorch_backend_config:
enable_overlap_scheduler: true
use_cuda_graph: true
cuda_graph_batch_sizes: [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16]
speculative_config:
decoding_type: MTP
num_nextn_predict_layers: 3
kv_cache_config:
free_gpu_memory_fraction: 0.75
backend: pytorch
tensor_parallel_size: 8
moe_expert_parallel_size: 8
Expected behavior
The model server should start up
actual behavior
File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/_torch/modules/attention.py", line 437, in forward compressed_q, compressed_kv, k_pe = self.fused_a( ^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1740, in _wrapped_call_impl return self._call_impl(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1751, in _call_impl return forward_call(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/_torch/modules/linear.py", line 409, in forward output = self.apply_linear(input, self.weight, self.bias) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/_torch/modules/linear.py", line 330, in apply_linear output = torch.ops.trtllm.fp8_block_scaling_gemm( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/dist-packages/torch/_ops.py", line 1123, in call return self._op(*args, **(kwargs or {})) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ RuntimeError: N must be a multiple of 128, (N=2112)
additional notes
The issue doesn't happen on Hopper, just Blackwell. I think this check is failing: https://github.com/NVIDIA/TensorRT-LLM/blob/60d4dacc47ba18b3aed425dd4c5af8cbc8068169/cpp/tensorrt_llm/thop/fp8BlockScalingGemm.cpp#L135
@pankajroark
Hi, have you tried with the latest main branch or follow this guide to see whether the issue still exist?
June
It looks like the issue is coming from fp8BlockScalingGemm, where N=2112 isn’t satisfying the multiple-of-128 requirement. Since this doesn’t happen on Hopper but does on Blackwell, it could be something hardware-specific affecting how TensorRT-LLM handles FP8 computations.
A couple of things that might be worth checking:
Is there a difference in how cuda_graph_batch_sizes are handled on Blackwell vs. Hopper? It might be pushing N to an unsupported value.
Does setting use_cuda_graph: false change anything? If the issue is related to CUDA graph optimizations, that could help isolate it.
Have you tried forcing N to a multiple of 128 by adjusting the batch size or tensor shape? Just to see if that workaround gets things running.
Might be worth looping in someone from the TensorRT-LLM team to confirm if this is expected behavior on Blackwell or if there’s a deeper issue at play.
Yes, in fact, these assertions are unnecessary. I will file a PR soon to fix it.
Yes, in fact, these assertions are unnecessary. I will file a PR soon to fix it.
Thanks, Chang!
June