TensorRT-LLM icon indicating copy to clipboard operation
TensorRT-LLM copied to clipboard

How to identify the rest toke latency?

Open RobinJYM opened this issue 1 year ago • 3 comments

System Info

  • CPU: INTEL RPL
  • GPU Name: NVIDIA GTX 3090
  • TensorRT-LLM: tensorrt_llm==0.11.0.dev2024060400
  • Container Used: Yes and reproduced in Conda as well
  • Driver Version: 555.42.02
  • CUDA Version: 12.5
  • OS: Ubuntu 24.04
  • Docker Img: nvidia/cuda:12.5.0-devel-ubuntu22.04

Who can help?

@kaiyux

Information

  • [X] The official example scripts
  • [ ] My own modified scripts

Tasks

  • [X] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • [ ] My own task or dataset (give details below)

Reproduction

run benchmark.py

Expected behavior

as expected

actual behavior

Hi, Here is the output of --input_output_len "1024,512": [BENCHMARK] model_name chatglm3_6b world_size 1 num_heads 32 num_kv_heads 2 num_layers 28 hidden_size 4096 vocab_size 65024 precision float16 batch_size 1 gpu_weights_percent 1.0 input_length 1024 output_length 512 gpu_peak_mem(gb) 13.99 build_time(s) 11.05 tokens_per_sec 62.04 percentile95(ms) 8261.234 percentile99(ms) 8261.234 **latency(ms) 8252.869** compute_cap sm86 quantization QuantMode.0 **generation_time(ms) 8062.798 total_generated_tokens 511.0** generation_tokens_per_second 63.377 and here is the output of --input_output_len "1024,1": [BENCHMARK] model_name chatglm3_6b world_size 1 num_heads 32 num_kv_heads 2 num_layers 28 hidden_size 4096 vocab_size 65024 precision float16 batch_size 1 gpu_weights_percent 1.0 input_length 1024 output_length 1 gpu_peak_mem(gb) 12.742 build_time(s) 0 tokens_per_sec 5.22 percentile95(ms) 192.648 percentile99(ms) 192.727 **latency(ms) 191.697** compute_cap sm86 quantization QuantMode.0 generation_time(ms) 0.013 total_generated_tokens 0.0 generation_tokens_per_second 0.0 How can we get rest(second) token latency? is it generation_time/total_generated_tokens = 8062.768/511 = 15.77 ?

Thanks! BR

additional notes

No

RobinJYM avatar Jun 11 '24 02:06 RobinJYM

Hi @kaiyux ,would u please take a look this question?

nv-guomingz avatar Jun 11 '24 05:06 nv-guomingz

Any Insights?

RobinJYM avatar Jun 13 '24 01:06 RobinJYM

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 15 days."

github-actions[bot] avatar Jul 13 '24 01:07 github-actions[bot]

Hi @RobinJYM , generation_time here means latency of generation stage, so if I understand the question correctly, if you want the latency of "rest tokens apart from the first token", you could just use generation_time in the report.

BTW, please note that gptSessionBenchmark is deprecated because we are not recommending to benchmark static batching anymore. Please use trtllm-bench or gptManagerBenchmark instead. We're actively working on trtllm-bench command to make it stable and ready to reproduce performance numbers.

Please refer to perf-overview.md and cpp benchmark for more details. Thanks a lot for the support.

kaiyux avatar Nov 14 '24 06:11 kaiyux

Hi @RobinJYM do u still have further issue or question now? If not, we'll close it soon.

nv-guomingz avatar Nov 14 '24 08:11 nv-guomingz