mlx-vlm
mlx-vlm copied to clipboard
Add Support for KV Cache Quantization
Implement quantization techniques for the key-value (KV) cache to reduce memory footprint and potentially improve inference speed. Motivation: KV cache can consume significant memory, especially for long contexts. Quantization would reduce memory requirements while maintaining acceptable quality, enabling more efficient processing of longer conversations or contexts. Implementation Notes:
Research optimal quantization methods for KV cache (e.g., int8, int4) Implement quantization and dequantization functions Evaluate performance and quality impact with various quantization strategies Ensure compatibility with the persistent prompt cache feature
Related Issues: #344