Add MXFP4 Quantization Support
This PR introduces MXFP4 quantization support to the mlx-vlm library, extending the existing quantization capabilities.
Summary of Changes
- New quantization mode: Added --q-mode command-line option with support for both affine (traditional) and mxfp4 (new) quantization modes
- MXFP4-specific constraints: MXFP4 mode enforces group_size=32 and bits=4 parameters to ensure proper operation
- Enhanced quantization function: Updated the quantization workflow to support the new mode parameter and improved quantization predicates
- Testing: Added new test cases to verify MXFP4 quantization functionality
- Robustness improvements: Enhanced tokenizer handling and improved the logic for processing inputs
Technical Details
- Added a new --q-mode argument to the conversion script with choices between affine (default) and mxfp4
- When MXFP4 mode is selected, the code automatically enforces the required parameters (group_size=32, bits=4)
- Modified the quantization utilities to properly support different quantization modes while maintaining backward compatibility
- Improved the get_class_predicate function to handle various quantization scenarios
- Added robustness improvements for tokenizer inputs and audio processing
Testing
- Added tests to verify both affine and mxfp4 quantization modes
- Verified that mxfp4 mode correctly enforces required parameters
- Ensured existing functionality remains intact
This feature allows users to leverage MXFP4 quantization, which can provide benefits in specific use cases, especially for memory-constrained environments while maintaining model performance.
MXFP4 is awesome, at least on text 😄 @Blaizzy Can you run workflows? They don't run automatically.
@Blaizzy This PR is updated to date.
Thanks @zhutao100!
Quick question, since we have mlx-lm as a dependency isn't there a simpler approach of doing this?
For instance, check how mixed quant works.