TensorRT-LLM Request for Reproduction Configuration of DeepSeek-R1 on H200 & B200

Hi @kaiyux,

We're curious about the details in this blog post: https://developer.nvidia.com/blog/nvidia-blackwell-delivers-world-record-deepseek-r1-inference-performance/

Specifically, could you share the configuration used to reproduce the results shown in the image below for H200 and B200?

Really appreciate the incredible work!

Best, Shirley

Mar 20 '25 21:03 xwuShirley

trtllm-serve  nvidia/DeepSeek-R1-FP4 \
--max_batch_size 256 --max_num_tokens 32768 \
--max_seq_len 32768 --kv_cache_free_gpu_memory_fraction 0.95 \
--host 0.0.0.0 --port 30001 --trust_remote_code --backend pytorch --tp_size 8 --ep_size 8

It seems for B200, the above did not give us such 253 TPS

Mar 20 '25 21:03 xwuShirley

@Edwardf0t1 You upload the model weight https://huggingface.co/nvidia/DeepSeek-R1-FP4/tree/main. May you know about the deployment configuration for trtllm-serve? Thank you :)

Mar 20 '25 21:03 xwuShirley

@kaiyux @Kefeng-Duan for vis on this question from the community. @laikhtewari for vis also.

June

Mar 24 '25 23:03 juney-nvidia

Hi @xwuShirley, thanks for your attention. There are some changes we haven't update to the main branch yet, we will keep you posted.

Mar 25 '25 12:03 kaiyux

trtllm-serve  nvidia/DeepSeek-R1-FP4 \
--max_batch_size 256 --max_num_tokens 32768 \
--max_seq_len 32768 --kv_cache_free_gpu_memory_fraction 0.95 \
--host 0.0.0.0 --port 30001 --trust_remote_code --backend pytorch --tp_size 8 --ep_size 8

It seems for B200, the above did not give us such 253 TPS

Pls refer this from another community member:

https://github.com/NVIDIA/TensorRT-LLM/issues/3058#issuecomment-2753688626

Mar 26 '25 14:03 juney-nvidia

Closing based on https://github.com/NVIDIA/TensorRT-LLM/issues/2964#issuecomment-2754585600. Feel free to reopen 👍

Aug 12 '25 22:08 poweiw