TensorRT-LLM icon indicating copy to clipboard operation
TensorRT-LLM copied to clipboard

perf: [AutoDeploy] Enable AutoDeploy as a backend in trtllm-bench

Open suyoggupta opened this issue 9 months ago • 18 comments

  1. This MR enables the integration of TRTLLM-bench with AutoDeploy.
  2. Adds a feature to AutoDeploy inference optimizer to inflate the kv-caches to the available GPU memory. This helps improve the token throughput.

Next step is to close the perf gap between AutoDeploy and Pytorch backends. Current results Max throughput for llama3.1 8B ISL/OSL=128/128, fp16. H100. AutoDeploy using FlashInfer attn backend:

Request Throughput (req/sec):                     67.6804                                                                                                                       
Total Output Throughput (tokens/sec):             8663.0915                                                                                                                     
Per User Output Throughput (tokens/sec/user):     9.0542                                                                                                                        
Per GPU Output Throughput (tokens/sec/gpu):       8663.0915                                                                                                                     
Total Latency (ms):                               14775.3258                                                                                                                    
Average request latency (ms):                     14141.4894   

Pytorch:

Request Throughput (req/sec):                     93.2541                                                                                                                       
Total Output Throughput (tokens/sec):             11936.5301                                                                                                                    
Per User Output Throughput (tokens/sec/user):     12.6582
Per GPU Output Throughput (tokens/sec/gpu):       11936.5301
Total Latency (ms):                               10723.3844
Average request latency (ms):                     10115.3223

suyoggupta avatar Mar 24 '25 21:03 suyoggupta

@kaiyux to make sure he is aware of the addition of AutoDeploy as another backend of trtllm-bench.

Thanks June

juney-nvidia avatar Mar 24 '25 22:03 juney-nvidia

/bot run

suyoggupta avatar Mar 25 '25 02:03 suyoggupta

PR_Github #355 [ run ] triggered by Bot

niukuo avatar Mar 25 '25 02:03 niukuo

PR_Github #355 [ run ] completed with state FAILURE /LLM/main/L0_MergeRequest_PR pipeline #324 completed with status: 'FAILURE'

niukuo avatar Mar 25 '25 02:03 niukuo

/bot run

suyoggupta avatar Mar 25 '25 05:03 suyoggupta

PR_Github #376 [ run ] triggered by Bot

niukuo avatar Mar 25 '25 05:03 niukuo

PR_Github #376 [ run ] completed with state FAILURE /LLM/main/L0_MergeRequest_PR pipeline #339 completed with status: 'FAILURE'

niukuo avatar Mar 25 '25 06:03 niukuo

The trtllm-bench part looks good to me.

@suyoggupta Is it possible to split the PR so that it only includes the changes of "enables the integration of TRTLLM-bench with AutoDeploy"? I see there are also a bunch of changes under AutoDeploy core tensorrt_llm/_torch/auto_deploy

kaiyux avatar Mar 25 '25 06:03 kaiyux

The trtllm-bench part looks good to me.

@suyoggupta Is it possible to split the PR so that it only includes the changes of "enables the integration of TRTLLM-bench with AutoDeploy"? I see there are also a bunch of changes under AutoDeploy core tensorrt_llm/_torch/auto_deploy

@kaiyux : The changes in autodeploy are needed to ensure the throughput benchmark in tensorrt-llm-bench can run without OOM. Those changes ensure that the autodeploy executor allocates a large enough kv-cache as required by the throughput test. So unfortunately, these changes have be bundled together.

suyoggupta avatar Mar 25 '25 15:03 suyoggupta

/bot run

suyoggupta avatar Mar 25 '25 17:03 suyoggupta

PR_Github #455 [ run ] triggered by Bot

niukuo avatar Mar 25 '25 17:03 niukuo

PR_Github #455 [ run ] completed with state SUCCESS /LLM/main/L0_MergeRequest_PR pipeline #390 completed with status: 'FAILURE'

niukuo avatar Mar 25 '25 18:03 niukuo

/bot run

suyoggupta avatar Mar 25 '25 20:03 suyoggupta

PR_Github #465 [ run ] triggered by Bot

niukuo avatar Mar 25 '25 20:03 niukuo

PR_Github #465 [ run ] completed with state SUCCESS /LLM/main/L0_MergeRequest_PR pipeline #398 completed with status: 'FAILURE'

niukuo avatar Mar 25 '25 22:03 niukuo

/bot run

suyoggupta avatar Mar 25 '25 22:03 suyoggupta

PR_Github #475 [ run ] triggered by Bot

niukuo avatar Mar 25 '25 23:03 niukuo

PR_Github #475 [ run ] completed with state SUCCESS /LLM/main/L0_MergeRequest_PR pipeline #408 completed with status: 'FAILURE'

niukuo avatar Mar 26 '25 00:03 niukuo

/bot run

suyoggupta avatar Mar 26 '25 02:03 suyoggupta

PR_Github #496 [ run ] triggered by Bot

niukuo avatar Mar 26 '25 02:03 niukuo

PR_Github #496 [ run ] completed with state SUCCESS /LLM/main/L0_MergeRequest_PR pipeline #428 completed with status: 'FAILURE'

niukuo avatar Mar 26 '25 04:03 niukuo

/bot run

suyoggupta avatar Mar 26 '25 04:03 suyoggupta

PR_Github #514 [ run ] triggered by Bot

niukuo avatar Mar 26 '25 04:03 niukuo

The trtllm-bench part looks good to me. @suyoggupta Is it possible to split the PR so that it only includes the changes of "enables the integration of TRTLLM-bench with AutoDeploy"? I see there are also a bunch of changes under AutoDeploy core tensorrt_llm/_torch/auto_deploy

@kaiyux : The changes in autodeploy are needed to ensure the throughput benchmark in tensorrt-llm-bench can run without OOM. Those changes ensure that the autodeploy executor allocates a large enough kv-cache as required by the throughput test. So unfortunately, these changes have be bundled together.

If that's the case, maybe we should merge those changes in core library firstly? There shouldn't be features "required by" the bench scripts, does that make sense?

Though I agree to save some overheads by avoid submitting another PR, just wanted to clarify and align the principle here. Thanks.

kaiyux avatar Mar 26 '25 06:03 kaiyux

PR_Github #514 [ run ] completed with state SUCCESS /LLM/main/L0_MergeRequest_PR pipeline #439 completed with status: 'SUCCESS'

niukuo avatar Mar 26 '25 07:03 niukuo

/bot run

kaiyux avatar Mar 26 '25 08:03 kaiyux

PR_Github #548 [ run ] triggered by Bot

niukuo avatar Mar 26 '25 08:03 niukuo

PR_Github #548 [ run ] completed with state SUCCESS /LLM/main/L0_MergeRequest_PR pipeline #466 completed with status: 'SUCCESS'

niukuo avatar Mar 26 '25 11:03 niukuo

/bot run

suyoggupta avatar Mar 26 '25 17:03 suyoggupta

PR_Github #612 [ run ] triggered by Bot

tensorrt-cicd avatar Mar 26 '25 17:03 tensorrt-cicd