mlx-vlm Add MXFP4 Quantization Support

This PR introduces MXFP4 quantization support to the mlx-vlm library, extending the existing quantization capabilities.

Summary of Changes

New quantization mode: Added --q-mode command-line option with support for both affine (traditional) and mxfp4 (new) quantization modes
MXFP4-specific constraints: MXFP4 mode enforces group_size=32 and bits=4 parameters to ensure proper operation
Enhanced quantization function: Updated the quantization workflow to support the new mode parameter and improved quantization predicates
Testing: Added new test cases to verify MXFP4 quantization functionality
Robustness improvements: Enhanced tokenizer handling and improved the logic for processing inputs

Technical Details

Added a new --q-mode argument to the conversion script with choices between affine (default) and mxfp4
When MXFP4 mode is selected, the code automatically enforces the required parameters (group_size=32, bits=4)
Modified the quantization utilities to properly support different quantization modes while maintaining backward compatibility
Improved the get_class_predicate function to handle various quantization scenarios
Added robustness improvements for tokenizer inputs and audio processing

Testing

Added tests to verify both affine and mxfp4 quantization modes
Verified that mxfp4 mode correctly enforces required parameters
Ensured existing functionality remains intact

This feature allows users to leverage MXFP4 quantization, which can provide benefits in specific use cases, especially for memory-constrained environments while maintaining model performance.

Sep 27 '25 17:09 zhutao100

MXFP4 is awesome, at least on text 😄 @Blaizzy Can you run workflows? They don't run automatically.

Oct 31 '25 16:10 reneleonhardt

@Blaizzy This PR is updated to date.

Dec 08 '25 18:12 zhutao100

Thanks @zhutao100!

Quick question, since we have mlx-lm as a dependency isn't there a simpler approach of doing this?

For instance, check how mixed quant works.

Dec 08 '25 21:12 Blaizzy