How to disable KV cache for LLM

Open lullabies777 opened this issue 1 year ago • 1 comments

I quantized the Qwen2-0.5B model, which is approximately 800M. However, I will need 6GB GPU memory for inference, likely due to the KV cache. Can I disable the KV cache for the LLM to reduce the GPU memory required for inference?

I noticed that I can choose to disable the KV cache when converting weights, but it throws an error during build and run.

Jul 17 '24 16:07 lullabies777

There is a similar issue: https://github.com/NVIDIA/TensorRT-LLM/issues/1422 We are actively working on it.

Jul 18 '24 00:07 QiJune