TensorRT-LLM Inquiry on 0.18 Release plan and R1 TRT engine support in different branch

I've noticed that the main and deepseek branches have diverged, with the deepseek branch being 9 commits ahead of and 9 commits behind main. Given that typical commits include many changes, it has been challenging to determine the exact differences between the main and deepseek branches.

Specifically, the Readme file in deepseek branch (https://github.com/NVIDIA/TensorRT-LLM/tree/deepseek/examples/deepseek_v3) shows the implementation for building the R1 TRT engine, while the Hugging Face page (https://huggingface.co/nvidia/DeepSeek-R1-FP4) specifies that "you need 8xB200 GPU and TensorRT LLM built from source with the latest main branch." However, the main branch doesn't include any related fields for constructing the R1 engine.

I have two questions:

What is the release plan for version 0.18?
Does the current main branch support the TRT engine required for running R1-FP4? Or I need to build docker image from deepseek branch.
Whether the current V3 engine implementation support Blackwell?

Thanks for your assistance.

Mar 03 '25 04:03 junliu-mde

Update: It looks that only quatilazation.h in main branch includes "NVFP4" and "FP4", so the engine support in deepseek branch should be lacking. Then the question is: How did NVIDIA get the 20k TPS result. Is there a internal version still not public yet?

Mar 03 '25 11:03 StarDuster

@StarDuster @junliu-mde did find a way to build the deepseek-ai/DeepSeek-R1 to trt engine ?

Mar 04 '25 10:03 imenselmi

@StarDuster @junliu-mde did find a way to build the deepseek-ai/DeepSeek-R1 to trt engine ?

I didn't succeed on B200 because deepseek branch doesn't support SM100 and above, but I think use deepseek branch on H100 should be buildable (I didn't try that since it's not useful to me).

Mar 04 '25 12:03 junliu-mde

@junliu-mde I also have the same issue with sm100 and am asking for help with this issue: : https://github.com/NVIDIA/TensorRT-Model-Optimizer/issues/144

Mar 04 '25 12:03 imenselmi

Guys, thanks for the feedback, if you are on Blackwell silicon please try build&run torch flow deepseek v3 follow here instructions, deepseek branch is deprecated and no longer update with new optimized kernels perf

Mar 06 '25 03:03 dominicshanshan

Guys, thanks for the feedback, if you are on Blackwell silicon please try build&run torch flow deepseek v3 follow here instructions, deepseek branch is deprecated and no longer update with new optimized kernels perf

Thank you for the update, is there a roadmap/time schedule for TRT engine support on Blackwell? I think PyTorch backend performance is not that good.

Mar 06 '25 04:03 junliu-mde

Guys, thanks for the feedback, if you are on Blackwell silicon please try build&run torch flow deepseek v3 follow here instructions, deepseek branch is deprecated and no longer update with new optimized kernels perf

hi @dominicshanshan I'm curious about trtllm's roadmap. Will the PyTorch workflow become the new standard and recommended approach for deploying trtllm engines in the future? Or is it primarily a temporary solution to ease the challenges of keeping up with the rapid advancements in community models/archs/techs?

Updated: Never mind, got answer from another issue.

Mar 11 '25 07:03 WingEdge777

@handoku Can you please share the solution if you have it, or the answer from the other issue that you mentioned it ?

Mar 14 '25 11:03 imenselmi

@handoku Can you please share the solution if you have it, or the answer from the other issue that you mentioned it ?

https://github.com/NVIDIA/TensorRT-LLM/issues/2870 TLDR: Only PyTorch workflow under development in the future.

Mar 15 '25 14:03 StarDuster

I’m closing this issue due to its prolonged inactivity. I hope the comments above have addressed the questions. If the issue still exists in the latest release, please open a new issue. Thank you!

Dec 16 '25 23:12 karljang