TensorRT-LLM How do I specify `max_tokens_in_paged_kv_cache` property during trtllm generation?

System Info

n/a

Who can help?

@byshiue

Information

[ ] The official example scripts
[ ] My own modified scripts

Tasks

[ ] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[ ] My own task or dataset (give details below)

Reproduction

n/a

Expected behavior

n/a

actual behavior

n/a

additional notes

I want to control the max number of tokens in the paged kv cache during generation using the C++ backend (either using the High-Level API or ModelRunnerCpp. How do I do this?

Apr 26 '24 12:04 vnkc1

For the triton backend, it is controlled by https://github.com/triton-inference-server/tensorrtllm_backend/blob/v0.9.0/all_models/inflight_batcher_llm/tensorrt_llm/config.pbtxt#L332

Apr 30 '24 03:04 byshiue

Thanks, is there a way to specify it without using Triton (i.e. either using the High-Level API or ModelRunnerCpp)?

May 01 '24 00:05 ghost

Could you share what example do you use?

May 09 '24 03:05 byshiue

I want to generate context logits efficiently; is there a way to minimize or shut off the KV cache (not needed as I am not doing any generation).

May 09 '24 16:05 ghost

I also want to generate context logits efficiently. Can I shut off the KV cache, and set kv_cache_free_gpu_mem_fraction be a minimum value?

Sep 26 '24 08:09 Z-NAVY