TensorRT-LLM icon indicating copy to clipboard operation
TensorRT-LLM copied to clipboard

How to achieve 253 tok/sec with DeepSeek-R1-FP4 on 8xB200

Open jeffye-dev opened this issue 9 months ago • 15 comments

I want to reproduce the DeepSeek-R1-FP4 on B200 deployment solution to align with the blog : https://developer.nvidia.com/blog/nvidia-blackwell-delivers-world-record-deepseek-r1-inference-performance

However, I just get 40 output tokens per per user, comparing with the 253 mentiond in this blog. It is a huge gap. Here is my method to deploy on B200:

  1. The latest official image (nvcr.io/nvidia/tritonserver:25.02-trtllm-python-py3) does not support DeepSeek V3 model, I have to download the newest source code of main branch from this project https://github.com/triton-inference-server/tensorrtllm_backend.git, and build it from scratch by the command:
DOCKER_BUILDKIT=1 docker build -t tritonserver_trtllm -f dockerfile/Dockerfile.triton.trt_llm_backend .
  1. Setup the docker container and launch trtllm-serve inside container:
echo -e "enable_attention_dp: true\npytorch_backend_config:\n enable_overlap_scheduler: true\n print_iter_log: true\n use_cuda_graph: true\n cuda_graph_padding_enabled: true\n cuda_graph_batch_sizes: [1, 512]" > extra-llm-api-config.yml
trtllm-serve nvidia/DeepSeek-R1-FP4 --backend pytorch --max_batch_size 512 --max_num_tokens 1560 --tp_size 8 --pp_size 1 --ep_size 8 --kv_cache_free_gpu_memory_fraction 0.90 --extra_llm_api_options ./extra-llm-api-config.yml

With the engine is up, I constructed hundreds of requests with input length 1000 and output length 1000 and sent to engine in different batch size, the avg output speed is about 40 tokens per request. 3. I also run the trtllm-bench by following the official doc: https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/deepseek_v3/README.md#running-the-benchmark . Then I only got 7 output tokens per user.

Image

Is my method wrong? And what's the correct method & configuration to run engine to serve DeepSeek-R1-FP4 on 8xB200. Appreciate for any help!

jeffye-dev avatar Mar 25 '25 07:03 jeffye-dev

@jeffye-dev

Hi,

The 253 perf number is generated based on trtllm-bench command. Also, to achieve the 253 perf number, some needed MRs are being prepared to get merged to the main branch.

When it becomes ready, there will also be some concrete steps shared.

cc @Kefeng-Duan for vis on this ask from the community.

Thanks June

juney-nvidia avatar Mar 25 '25 08:03 juney-nvidia

Hi, @jeffye-dev As Juney said, some MRs are still being prepared. And besides that, our 253 TPS is for min-latency case (batch=1)

  1. please set --max_batch_size=512
  2. please use real data, better with ISL/OSL = 1K/2K
  3. please disable enable_attention_dp
  4. please don't print iteration log
  5. please set MTP's nextn = 3
  6. please use --ep_size 4

Kefeng-Duan avatar Mar 25 '25 09:03 Kefeng-Duan

Thanks for explanations. So I have to wait until the MRs are merged and use correct configuation? BTW, enable_attention_dp=false might cause GPU hangs in my case.

jeffye-dev avatar Mar 25 '25 11:03 jeffye-dev

So I have to wait until the MRs are merged and use correct configuation

Yes.

BTW, enable_attention_dp=false might cause GPU hangs in my case Pls create another github issue to report this issue explicitly with providing the concrete reproducing steps and we will then follow up.

Thanks June

juney-nvidia avatar Mar 25 '25 12:03 juney-nvidia

When will these MRs be merged? I'd like to have a try in time. It's better to have document about reproduce the performance. @juney-nvidia @Kefeng-Duan

jeffye-dev avatar Mar 26 '25 02:03 jeffye-dev

Hi, @jeffye-dev As Juney said, some MRs are still being prepared. And besides that, our 253 TPS is for min-latency case (batch=1)

  1. please set --max_batch_size=512
  2. please use real data, better with ISL/OSL = 1K/2K
  3. please disable enable_attention_dp
  4. please don't print iteration log
  5. please set MTP's nextn = 3
  6. please use --ep_size 4

@jeffye-dev 7. please do not capture cuda_graph with so large batch size 512, and for our case, 1 is OK 8. please set the env: sudo nvidia-smi -pm 0; sudo nvidia-smi -pm 1; sudo nvidia-smi boost-slider --vboost 4 9. please enable PDL by: export TRTLLM_ENABLE_PDL=1

We have tried our best to push all MRs + reproduce Doc to gitHub recently. And I think you can start to try it now, even without the best perf (253), we will see some good perf based on the latest code.

Thanks

Kefeng-Duan avatar Mar 26 '25 05:03 Kefeng-Duan

Thank @Kefeng-Duan for assistance. I did the changes accordingly and ran some tests using trtllm-bench. Now I get very closed result: 207 tok/sec/user when setting batch=1. If set batch=10, the throughput is 127 tok/sec/user.

jeffye-dev avatar Mar 26 '25 09:03 jeffye-dev

Thank @Kefeng-Duan for assistance. I did the changes accordingly and ran some tests using trtllm-bench. Now I get very closed result: 207 tok/sec/user when setting batch=1. If set batch=10, the throughput is 127 tok/sec/user.

Great to know this, @jeffye-dev . I believe you will see better perf with TensorRT-LLM in the upcoming weeks :)

Thanks June

juney-nvidia avatar Mar 26 '25 11:03 juney-nvidia

Why 253 tok/sec needs to enable MTP? If MTP is allowed, A100 can easily out-perform B200 ..

Hi @ghostplant

Do you mean that on A100, with MTP, it can also achieves up to 253 tok/sec?

If this is what you mean, can you share more details of the A100 benchmark workflow? We are happy to do the performance cross-check to learn more.

Thanks June

juney-nvidia avatar Mar 26 '25 16:03 juney-nvidia

@juney-nvidia Oh, we can just enlarge MTP and assume 100% success ratio, then it would be very close to running a large batch inference, which has been above 500 tok/sec at the moment using A100x8.

Thanks, so what is your benchmarked number without MTP on A100x8? And what is the precision?

June

juney-nvidia avatar Mar 26 '25 23:03 juney-nvidia

The bsz=1 and MTP=0 would be far below, that's why 100% success MTP helps a lot. But I have no idea how this question is related to this topic, we are seeking a faster solution like "253 tok/sec with DeepSeek-R1-FP4 on 8xB200" because we own solution cannot outperform it without large MTP and batch size.

ghostplant avatar Mar 27 '25 01:03 ghostplant

So the point is MTP=3? It's nice to have MPT feature in production unless MTP decrease accuracy. I cannot see the accepted rate at runtime, it's hard to judge how MTP speedup the performance. Probably we need to have A/B test. Metrics are importance to ananlyze engine real behavior so I am also looking forward metrics related features.

jeffye-dev avatar Mar 27 '25 02:03 jeffye-dev

So the point is MTP=3? It's nice to have MPT feature in production unless MTP decrease accuracy. I cannot see the accepted rate at runtime, it's hard to judge how MTP speedup the performance. Probably we need to have A/B test. Metrics are importance to ananlyze engine real behavior so I am also looking forward metrics related features.

In theory MTP will not hurt accuracy, if it does then it is a bug. The AR should be able to be printed with trtllm-bench, @lfr-0531 can chime in to keep me honest,

June

juney-nvidia avatar Mar 28 '25 00:03 juney-nvidia

Sorry for the late response. trtllm-bench still cannot get the correct AR now. We will fix it in the near future.

Thanks, Fanrong

lfr-0531 avatar Mar 31 '25 09:03 lfr-0531

@ghostplant to clarify, the 253 here means 253 TPS per User, which represents a single user's experience and larger batch hurts this. And our measurement should limit bsz=1 and MTP=3.

Another thing is 100% success ratio is too ideal to be achieved, that's the reason why we choose MTP3

Thanks, Kefeng

Kefeng-Duan avatar Mar 31 '25 10:03 Kefeng-Duan

@ghostplant

  1. the success ratio is not from 'assumption' but from 'measurement'.
  2. our TPS definition is here: https://github.com/NVIDIA/TensorRT-LLM/blob/727d78e785da96ce8ac28f920841819efaeee220/tensorrt_llm/bench/dataclasses/reporting.py#L314 Output throughput (total output (OSL) tokens / end-to-end latency), which means we don't take prefill tokens into TPS calculation, but take prefilling time into consideration.

3. BTW, looks like 253 TPS include prefiling. Based on the same unit, MI300x seems to get 406 TPS. Does it mean B200x8 is 40% slower than MI300x8?

Do you know what's resource of the 406 TPS announcement? it is unfair to compare the two without sync-up the Batch/ISL/OSL and statistics methodology.

Kefeng-Duan avatar Apr 01 '25 07:04 Kefeng-Duan

hi, I have a question about tp_size and ep_size in trtllm, for one B200 with 8 gpus,if set --tp_size 8 --ep_size 8,what dose this ep_size do? as usual,gpus should be tp * ep? I can not find a document about this, thanks!

ltm920716 avatar Apr 10 '25 07:04 ltm920716

@ltm920716 sorry for the confusing. --tp_size 8 --ep_size 8 here means tp8 for Attention module but ep8 for moe Module, which means tp1ep8 for MoE module

Kefeng-Duan avatar Apr 10 '25 07:04 Kefeng-Duan

@jeffye-dev @ghostplant the reproduce document is almost ready here: https://github.com/NVIDIA/TensorRT-LLM/pull/3232/files

Kefeng-Duan avatar Apr 10 '25 07:04 Kefeng-Duan

@ltm920716 sorry for the confusing. --tp_size 8 --ep_size 8 here means tp8 for Attention module but ep8 for moe Module, which means tp1ep8 for MoE module

thanks @Kefeng-Duan so dose tp=1 or tp=8 have much difference in throughout?

ltm920716 avatar Apr 10 '25 07:04 ltm920716

@ltm920716 if you care about throughput, i recommend you to enable attentionDP, and when it is enable, the Attention module will be DP8TP1, and MoE EP8TP1.

Kefeng-Duan avatar Apr 10 '25 07:04 Kefeng-Duan

I noticed that the max_num_tokens in the example is quite low. Due to lack of chunked prefill, my understanding is the entire context must fit within max_num_tokens. For example, if I have a usecase which requires 50k context, I would need max_num_tokens = 50000. Is that right? Anyone successful at getting large values of max_num_tokens to work on fp8?

Another thing that's making optimizing memory usage difficult is each trtllm seems to allocate 500MB on gpu 0 (you can see this behavior in nvidia-smi). At tp=8 that means 8x500MB excess is on GPU 0. Combined with other memory spikes during loading, it is difficult for me to even get the model to start or accept any requests at moderately large max_num_tokens without running out of memory. Thanks.

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A   2216226      C   python3                                       590MiB |
|    0   N/A  N/A   2216619      C   /usr/bin/python3                              590MiB |
|    0   N/A  N/A   2216620      C   /usr/bin/python3                              590MiB |
|    0   N/A  N/A   2216621      C   /usr/bin/python3                              590MiB |
|    0   N/A  N/A   2216622      C   /usr/bin/python3                              590MiB |
|    0   N/A  N/A   2216623      C   /usr/bin/python3                              540MiB |
|    0   N/A  N/A   2216624      C   /usr/bin/python3                              520MiB |
|    0   N/A  N/A   2216625      C   /usr/bin/python3                              590MiB |
|    0   N/A  N/A   2216626      C   /usr/bin/python3                              590MiB |
|    1   N/A  N/A   2216620      C   /usr/bin/python3                              520MiB |
|    2   N/A  N/A   2216621      C   /usr/bin/python3                              520MiB |
|    3   N/A  N/A   2216622      C   /usr/bin/python3                              520MiB |
|    4   N/A  N/A   2216623      C   /usr/bin/python3                              520MiB |
|    5   N/A  N/A   2216624      C   /usr/bin/python3                              520MiB |
|    6   N/A  N/A   2216625      C   /usr/bin/python3                              520MiB |
|    7   N/A  N/A   2216626      C   /usr/bin/python3                              520MiB |
+-----------------------------------------------------------------------------------------+

This should probably be its own issue, but I was curious if anyone has also run into this when loading deepseek and if there is a mitigation. on H200, the weights take a sizable fraction of the whole memory so I need every spare gigabyte of memory I can get for kv cache / max_num_tokens

pathorn avatar Apr 10 '25 08:04 pathorn

Hi team, can someone please share the built docker images for the ease of reproducing? I am also trying to reproduce on h200. Really appreciate it!

yubofredwang avatar Apr 21 '25 21:04 yubofredwang

Testing with 1k input seq len is ridiculous. Chat/RAG/Agent, etc. I mean, any user cases will crash the max_num_tokens limit. And It appears that max_num_tokens has a significant impact on latency and throughput.

WingEdge777 avatar Aug 14 '25 12:08 WingEdge777

@WingEdge777 did you set stop_token_ids?

geraldstanje1 avatar Dec 08 '25 18:12 geraldstanje1

@yubofredwang , I hope you’ve found the information you were looking for, but I’m sharing a more recent one: https://catalog.ngc.nvidia.com/orgs/nvidia/teams/tensorrt-llm/containers/release

karljang avatar Dec 12 '25 22:12 karljang