Request for Reproduction Configuration of DeepSeek-R1 on H200 & B200
Hi @kaiyux,
We're curious about the details in this blog post: https://developer.nvidia.com/blog/nvidia-blackwell-delivers-world-record-deepseek-r1-inference-performance/
Specifically, could you share the configuration used to reproduce the results shown in the image below for H200 and B200?
Really appreciate the incredible work!
Best, Shirley
trtllm-serve nvidia/DeepSeek-R1-FP4 \
--max_batch_size 256 --max_num_tokens 32768 \
--max_seq_len 32768 --kv_cache_free_gpu_memory_fraction 0.95 \
--host 0.0.0.0 --port 30001 --trust_remote_code --backend pytorch --tp_size 8 --ep_size 8
It seems for B200, the above did not give us such 253 TPS
@Edwardf0t1 You upload the model weight https://huggingface.co/nvidia/DeepSeek-R1-FP4/tree/main. May you know about the deployment configuration for trtllm-serve? Thank you :)
@kaiyux @Kefeng-Duan for vis on this question from the community. @laikhtewari for vis also.
June
Hi @xwuShirley, thanks for your attention. There are some changes we haven't update to the main branch yet, we will keep you posted.
trtllm-serve nvidia/DeepSeek-R1-FP4 \ --max_batch_size 256 --max_num_tokens 32768 \ --max_seq_len 32768 --kv_cache_free_gpu_memory_fraction 0.95 \ --host 0.0.0.0 --port 30001 --trust_remote_code --backend pytorch --tp_size 8 --ep_size 8It seems for B200, the above did not give us such 253 TPS
Pls refer this from another community member:
- https://github.com/NVIDIA/TensorRT-LLM/issues/3058#issuecomment-2753688626
Closing based on https://github.com/NVIDIA/TensorRT-LLM/issues/2964#issuecomment-2754585600. Feel free to reopen 👍