Inquiry on 0.18 Release plan and R1 TRT engine support in different branch
I've noticed that the main and deepseek branches have diverged, with the deepseek branch being 9 commits ahead of and 9 commits behind main. Given that typical commits include many changes, it has been challenging to determine the exact differences between the main and deepseek branches.
Specifically, the Readme file in deepseek branch (https://github.com/NVIDIA/TensorRT-LLM/tree/deepseek/examples/deepseek_v3) shows the implementation for building the R1 TRT engine, while the Hugging Face page (https://huggingface.co/nvidia/DeepSeek-R1-FP4) specifies that "you need 8xB200 GPU and TensorRT LLM built from source with the latest main branch." However, the main branch doesn't include any related fields for constructing the R1 engine.
I have two questions:
- What is the release plan for version 0.18?
- Does the current main branch support the TRT engine required for running R1-FP4? Or I need to build docker image from
deepseekbranch. - Whether the current V3 engine implementation support Blackwell?
Thanks for your assistance.
Update: It looks that only quatilazation.h in main branch includes "NVFP4" and "FP4", so the engine support in deepseek branch should be lacking.
Then the question is: How did NVIDIA get the 20k TPS result. Is there a internal version still not public yet?
@StarDuster @junliu-mde did find a way to build the deepseek-ai/DeepSeek-R1 to trt engine ?
@StarDuster @junliu-mde did find a way to build the deepseek-ai/DeepSeek-R1 to trt engine ?
I didn't succeed on B200 because deepseek branch doesn't support SM100 and above, but I think use deepseek branch on H100 should be buildable (I didn't try that since it's not useful to me).
@junliu-mde I also have the same issue with sm100 and am asking for help with this issue: : https://github.com/NVIDIA/TensorRT-Model-Optimizer/issues/144
Guys, thanks for the feedback, if you are on Blackwell silicon please try build&run torch flow deepseek v3 follow here instructions, deepseek branch is deprecated and no longer update with new optimized kernels perf
Guys, thanks for the feedback, if you are on Blackwell silicon please try build&run torch flow deepseek v3 follow here instructions,
deepseekbranch is deprecated and no longer update with new optimized kernels perf
Thank you for the update, is there a roadmap/time schedule for TRT engine support on Blackwell? I think PyTorch backend performance is not that good.
Guys, thanks for the feedback, if you are on Blackwell silicon please try build&run torch flow deepseek v3 follow here instructions,
deepseekbranch is deprecated and no longer update with new optimized kernels perf
hi @dominicshanshan I'm curious about trtllm's roadmap. Will the PyTorch workflow become the new standard and recommended approach for deploying trtllm engines in the future? Or is it primarily a temporary solution to ease the challenges of keeping up with the rapid advancements in community models/archs/techs?
Updated: Never mind, got answer from another issue.
@handoku Can you please share the solution if you have it, or the answer from the other issue that you mentioned it ?
@handoku Can you please share the solution if you have it, or the answer from the other issue that you mentioned it ?
https://github.com/NVIDIA/TensorRT-LLM/issues/2870 TLDR: Only PyTorch workflow under development in the future.
I’m closing this issue due to its prolonged inactivity. I hope the comments above have addressed the questions. If the issue still exists in the latest release, please open a new issue. Thank you!