mlx-vlm icon indicating copy to clipboard operation
mlx-vlm copied to clipboard

Add MXFP4 Quantization Support

Open zhutao100 opened this issue 3 months ago • 1 comments

This PR introduces MXFP4 quantization support to the mlx-vlm library, extending the existing quantization capabilities.

Summary of Changes

  • New quantization mode: Added --q-mode command-line option with support for both affine (traditional) and mxfp4 (new) quantization modes
  • MXFP4-specific constraints: MXFP4 mode enforces group_size=32 and bits=4 parameters to ensure proper operation
  • Enhanced quantization function: Updated the quantization workflow to support the new mode parameter and improved quantization predicates
  • Testing: Added new test cases to verify MXFP4 quantization functionality
  • Robustness improvements: Enhanced tokenizer handling and improved the logic for processing inputs

Technical Details

  • Added a new --q-mode argument to the conversion script with choices between affine (default) and mxfp4
  • When MXFP4 mode is selected, the code automatically enforces the required parameters (group_size=32, bits=4)
  • Modified the quantization utilities to properly support different quantization modes while maintaining backward compatibility
  • Improved the get_class_predicate function to handle various quantization scenarios
  • Added robustness improvements for tokenizer inputs and audio processing

Testing

  • Added tests to verify both affine and mxfp4 quantization modes
  • Verified that mxfp4 mode correctly enforces required parameters
  • Ensured existing functionality remains intact

This feature allows users to leverage MXFP4 quantization, which can provide benefits in specific use cases, especially for memory-constrained environments while maintaining model performance.

zhutao100 avatar Sep 27 '25 17:09 zhutao100

MXFP4 is awesome, at least on text 😄 @Blaizzy Can you run workflows? They don't run automatically.

reneleonhardt avatar Oct 31 '25 16:10 reneleonhardt

@Blaizzy This PR is updated to date.

zhutao100 avatar Dec 08 '25 18:12 zhutao100

Thanks @zhutao100!

Quick question, since we have mlx-lm as a dependency isn't there a simpler approach of doing this?

For instance, check how mixed quant works.

Blaizzy avatar Dec 08 '25 21:12 Blaizzy