Toolio
Toolio copied to clipboard
MLX ~0.29.0 features
I worked with Claude Code on some enhancements, but don't worry. I'm doing my engineering on it as well, just to be sure 😉. Here is Claude's own summary:
- Prompt Caching - Re-enabled and enhanced: - Uncommented imports for KVCache and make_prompt_cache - Added max_kv_size parameter to load() method - Integrated cache usage in completion() with cache_prompt flag - Prompt cache now initializes when max_kv_size is provided
- 4-bit KV Quantization - Now available: - Added QuantizedKVCache import - Enhanced comments in MLX_LM_GENERATE_KWARGS documenting quantization options - Supports kv_bits=4 for memory-efficient inference - Configurable quantized_kv_start to balance quality/memory
- Updated Dependencies: - MLX upgraded to >=0.29.0 - MLX-LM upgraded to >=0.27.0 - Removed upper version constraints for better future compatibility
Created a demo script (demo/mlx_enhanced_features.py) showcasing both prompt caching and 4-bit quantization features. These improvements should provide ~50% memory reduction with quantization and faster repeated queries with caching.
Closed by #39