perf: [AutoDeploy] Enable AutoDeploy as a backend in trtllm-bench
- This MR enables the integration of TRTLLM-bench with AutoDeploy.
- Adds a feature to AutoDeploy inference optimizer to inflate the kv-caches to the available GPU memory. This helps improve the token throughput.
Next step is to close the perf gap between AutoDeploy and Pytorch backends. Current results Max throughput for llama3.1 8B ISL/OSL=128/128, fp16. H100. AutoDeploy using FlashInfer attn backend:
Request Throughput (req/sec): 67.6804
Total Output Throughput (tokens/sec): 8663.0915
Per User Output Throughput (tokens/sec/user): 9.0542
Per GPU Output Throughput (tokens/sec/gpu): 8663.0915
Total Latency (ms): 14775.3258
Average request latency (ms): 14141.4894
Pytorch:
Request Throughput (req/sec): 93.2541
Total Output Throughput (tokens/sec): 11936.5301
Per User Output Throughput (tokens/sec/user): 12.6582
Per GPU Output Throughput (tokens/sec/gpu): 11936.5301
Total Latency (ms): 10723.3844
Average request latency (ms): 10115.3223
@kaiyux to make sure he is aware of the addition of AutoDeploy as another backend of trtllm-bench.
Thanks June
/bot run
PR_Github #355 [ run ] triggered by Bot
PR_Github #355 [ run ] completed with state FAILURE
/LLM/main/L0_MergeRequest_PR pipeline #324 completed with status: 'FAILURE'
/bot run
PR_Github #376 [ run ] triggered by Bot
PR_Github #376 [ run ] completed with state FAILURE
/LLM/main/L0_MergeRequest_PR pipeline #339 completed with status: 'FAILURE'
The trtllm-bench part looks good to me.
@suyoggupta Is it possible to split the PR so that it only includes the changes of "enables the integration of TRTLLM-bench with AutoDeploy"? I see there are also a bunch of changes under AutoDeploy core tensorrt_llm/_torch/auto_deploy
The
trtllm-benchpart looks good to me.@suyoggupta Is it possible to split the PR so that it only includes the changes of "enables the integration of TRTLLM-bench with AutoDeploy"? I see there are also a bunch of changes under AutoDeploy core
tensorrt_llm/_torch/auto_deploy
@kaiyux : The changes in autodeploy are needed to ensure the throughput benchmark in tensorrt-llm-bench can run without OOM. Those changes ensure that the autodeploy executor allocates a large enough kv-cache as required by the throughput test. So unfortunately, these changes have be bundled together.
/bot run
PR_Github #455 [ run ] triggered by Bot
PR_Github #455 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #390 completed with status: 'FAILURE'
/bot run
PR_Github #465 [ run ] triggered by Bot
PR_Github #465 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #398 completed with status: 'FAILURE'
/bot run
PR_Github #475 [ run ] triggered by Bot
PR_Github #475 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #408 completed with status: 'FAILURE'
/bot run
PR_Github #496 [ run ] triggered by Bot
PR_Github #496 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #428 completed with status: 'FAILURE'
/bot run
PR_Github #514 [ run ] triggered by Bot
The
trtllm-benchpart looks good to me. @suyoggupta Is it possible to split the PR so that it only includes the changes of "enables the integration of TRTLLM-bench with AutoDeploy"? I see there are also a bunch of changes under AutoDeploy coretensorrt_llm/_torch/auto_deploy@kaiyux : The changes in autodeploy are needed to ensure the throughput benchmark in tensorrt-llm-bench can run without OOM. Those changes ensure that the autodeploy executor allocates a large enough kv-cache as required by the throughput test. So unfortunately, these changes have be bundled together.
If that's the case, maybe we should merge those changes in core library firstly? There shouldn't be features "required by" the bench scripts, does that make sense?
Though I agree to save some overheads by avoid submitting another PR, just wanted to clarify and align the principle here. Thanks.
PR_Github #514 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #439 completed with status: 'SUCCESS'
/bot run
PR_Github #548 [ run ] triggered by Bot
PR_Github #548 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #466 completed with status: 'SUCCESS'
/bot run
PR_Github #612 [ run ] triggered by Bot