Toolio icon indicating copy to clipboard operation
Toolio copied to clipboard

MLX ~0.29.0 features

Open uogbuji opened this issue 4 months ago • 1 comments

I worked with Claude Code on some enhancements, but don't worry. I'm doing my engineering on it as well, just to be sure 😉. Here is Claude's own summary:

  1. Prompt Caching - Re-enabled and enhanced: - Uncommented imports for KVCache and make_prompt_cache - Added max_kv_size parameter to load() method - Integrated cache usage in completion() with cache_prompt flag - Prompt cache now initializes when max_kv_size is provided
  2. 4-bit KV Quantization - Now available: - Added QuantizedKVCache import - Enhanced comments in MLX_LM_GENERATE_KWARGS documenting quantization options - Supports kv_bits=4 for memory-efficient inference - Configurable quantized_kv_start to balance quality/memory
  3. Updated Dependencies: - MLX upgraded to >=0.29.0 - MLX-LM upgraded to >=0.27.0 - Removed upper version constraints for better future compatibility

Created a demo script (demo/mlx_enhanced_features.py) showcasing both prompt caching and 4-bit quantization features. These improvements should provide ~50% memory reduction with quantization and faster repeated queries with caching.

uogbuji avatar Sep 02 '25 18:09 uogbuji

Closed by #39

uogbuji avatar Sep 02 '25 18:09 uogbuji