TensorRT-LLM
TensorRT-LLM copied to clipboard
How to disable KV cache for LLM
I quantized the Qwen2-0.5B model, which is approximately 800M. However, I will need 6GB GPU memory for inference, likely due to the KV cache. Can I disable the KV cache for the LLM to reduce the GPU memory required for inference?
I noticed that I can choose to disable the KV cache when converting weights, but it throws an error during build and run.
There is a similar issue: https://github.com/NVIDIA/TensorRT-LLM/issues/1422 We are actively working on it.