TensorRT-LLM icon indicating copy to clipboard operation
TensorRT-LLM copied to clipboard

How do I specify `max_tokens_in_paged_kv_cache` property during trtllm generation?

Open vnkc1 opened this issue 1 year ago • 2 comments

System Info

n/a

Who can help?

@byshiue

Information

  • [ ] The official example scripts
  • [ ] My own modified scripts

Tasks

  • [ ] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • [ ] My own task or dataset (give details below)

Reproduction

n/a

Expected behavior

n/a

actual behavior

n/a

additional notes

I want to control the max number of tokens in the paged kv cache during generation using the C++ backend (either using the High-Level API or ModelRunnerCpp. How do I do this?

vnkc1 avatar Apr 26 '24 12:04 vnkc1

For the triton backend, it is controlled by https://github.com/triton-inference-server/tensorrtllm_backend/blob/v0.9.0/all_models/inflight_batcher_llm/tensorrt_llm/config.pbtxt#L332

byshiue avatar Apr 30 '24 03:04 byshiue

Thanks, is there a way to specify it without using Triton (i.e. either using the High-Level API or ModelRunnerCpp)?

ghost avatar May 01 '24 00:05 ghost

Could you share what example do you use?

byshiue avatar May 09 '24 03:05 byshiue

I want to generate context logits efficiently; is there a way to minimize or shut off the KV cache (not needed as I am not doing any generation).

ghost avatar May 09 '24 16:05 ghost

I also want to generate context logits efficiently. Can I shut off the KV cache, and set kv_cache_free_gpu_mem_fraction be a minimum value?

Z-NAVY avatar Sep 26 '24 08:09 Z-NAVY