tensorrtllm_backend why larger kv-cache memory will cause short prompt inference performance decrease?

why larger kv-cache memory will cause short prompt inference performance decrease?

Open GGBond8488 opened this issue 6 months ago • 0 comments

System Info

NVIDIA-H100

Who can help?

@kaiyux

Information

[X] The official example scripts
[ ] My own modified scripts

Tasks

[X] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[ ] My own task or dataset (give details below)

Reproduction

trtllm-build --checkpoint_dir ./model/llama3_32k/fp16/4-gpu --gpt_attention_plugin float16 --remove_input_padding enable --paged_kv_cache enable --context_fmha enable --gemm_plugin float16 --output_dir engines/fp16/4-gpu

and follow the llama example to run the trition server https://github.com/triton-inference-server/tensorrtllm_backend/blob/main/docs/llama.md

and use the differentKV_CACHE_FREE_GPU_MEM_FRACTION value to test the inference performance

I notice that in long prompt case, it will increase the inference performance , but in short prompt case , it will decrease the performance, it should not be that!

Expected behavior

larger kv-cache memory, in long prompt case, it should increase the inference performance , and in short prompt case , it shouldn't be affected.

actual behavior

larger kv-cache memory, in long prompt case, it increase the inference performance , and in short prompt case , it decrease the performance .

additional notes

build and run follow https://github.com/triton-inference-server/tensorrtllm_backend/blob/main/docs/llama.md

Aug 19 '24 06:08 GGBond8488

tensorrtllm_backend tensorrtllm_backend copied to clipboard

why larger kv-cache memory will cause short prompt inference performance decrease?

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

actual behavior

additional notes

tensorrtllm_backend
tensorrtllm_backend copied to clipboard