How do I specify `max_tokens_in_paged_kv_cache` property during trtllm generation?
System Info
n/a
Who can help?
@byshiue
Information
- [ ] The official example scripts
- [ ] My own modified scripts
Tasks
- [ ] An officially supported task in the
examplesfolder (such as GLUE/SQuAD, ...) - [ ] My own task or dataset (give details below)
Reproduction
n/a
Expected behavior
n/a
actual behavior
n/a
additional notes
I want to control the max number of tokens in the paged kv cache during generation using the C++ backend (either using the High-Level API or ModelRunnerCpp. How do I do this?
For the triton backend, it is controlled by https://github.com/triton-inference-server/tensorrtllm_backend/blob/v0.9.0/all_models/inflight_batcher_llm/tensorrt_llm/config.pbtxt#L332
Thanks, is there a way to specify it without using Triton (i.e. either using the High-Level API or ModelRunnerCpp)?
Could you share what example do you use?
I want to generate context logits efficiently; is there a way to minimize or shut off the KV cache (not needed as I am not doing any generation).
I also want to generate context logits efficiently. Can I shut off the KV cache, and set kv_cache_free_gpu_mem_fraction be a minimum value?