mlx-vlm icon indicating copy to clipboard operation
mlx-vlm copied to clipboard

Add Support for KV Cache Quantization

Open Blaizzy opened this issue 9 months ago • 0 comments

Implement quantization techniques for the key-value (KV) cache to reduce memory footprint and potentially improve inference speed. Motivation: KV cache can consume significant memory, especially for long contexts. Quantization would reduce memory requirements while maintaining acceptable quality, enabling more efficient processing of longer conversations or contexts. Implementation Notes:

Research optimal quantization methods for KV cache (e.g., int8, int4) Implement quantization and dequantization functions Evaluate performance and quality impact with various quantization strategies Ensure compatibility with the persistent prompt cache feature

Related Issues: #344

Blaizzy avatar May 06 '25 22:05 Blaizzy