tensorrtllm_backend
tensorrtllm_backend copied to clipboard
why larger kv-cache memory will cause short prompt inference performance decrease?
System Info
NVIDIA-H100
Who can help?
@kaiyux
Information
- [X] The official example scripts
- [ ] My own modified scripts
Tasks
- [X] An officially supported task in the
examples
folder (such as GLUE/SQuAD, ...) - [ ] My own task or dataset (give details below)
Reproduction
trtllm-build --checkpoint_dir ./model/llama3_32k/fp16/4-gpu --gpt_attention_plugin float16 --remove_input_padding enable --paged_kv_cache enable --context_fmha enable --gemm_plugin float16 --output_dir engines/fp16/4-gpu
and follow the llama example to run the trition server https://github.com/triton-inference-server/tensorrtllm_backend/blob/main/docs/llama.md
and use the differentKV_CACHE_FREE_GPU_MEM_FRACTION
value to test the inference performance
I notice that in long prompt case, it will increase the inference performance , but in short prompt case , it will decrease the performance, it should not be that!
Expected behavior
larger kv-cache memory, in long prompt case, it should increase the inference performance , and in short prompt case , it shouldn't be affected.
actual behavior
larger kv-cache memory, in long prompt case, it increase the inference performance , and in short prompt case , it decrease the performance .
additional notes
build and run follow https://github.com/triton-inference-server/tensorrtllm_backend/blob/main/docs/llama.md