This PR adds support for the Quartet QAT method.

The goal of this PR is to integrate inference and training support for the Quartet QAT method. That would allow to perform both forward and backward passes in MXFP4, allowing for very fast training on Blackwell GPUs.

Currently, we're working on the kernels here, here and here (some of the libs aren't public yet). We're planning to release the first version of the kernels this week and have optimized performance by end of June.

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag members/contributors who may be interested in your PR.

Jun 09 '25 12:06 BlackSamorez

cc @mekkcyber

Jun 09 '25 14:06 Rocketknight1

Hi @BlackSamorez, I'm really looking forward to experimenting with this.

When can we expect to have the kernels public so we can begin testing, even if they are still WIP?

Jun 30 '25 21:06 kooshi

@MekkCyber Hi, thanks for reviewing this! It took us a while, but all the kernels necessary for inference have been published: I've updated the PR description. May I ask you to do another pass? Your previous comments mostly don't apply anymore because of refactoring.

Jul 14 '25 16:07 BlackSamorez

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Jul 15 '25 10:07 HuggingFaceDocBuilderDev

@SunMarc added docs, improved docstring, cleaned the code where you asked.

Jul 18 '25 14:07 BlackSamorez

Actually, give me a minute. I'm adding Triton pseudo-quantization kernels for people without Blackwell GPUs to be able to evaluate the models (although without speedups).

Jul 18 '25 15:07 BlackSamorez

Added pseudoquantization, updated requirements to run the method with it (doesn't require qutlass like that). Added pseudoquant tests. Updated the documentation.

Jul 18 '25 16:07 BlackSamorez

@SunMarc please take a look at the new raised errors and and warning in the quantizer_fp_quant.py.

Jul 18 '25 16:07 BlackSamorez

Should be good

Jul 22 '25 12:07 BlackSamorez

One last nit, the build PR documentation is not passing:

    raise RuntimeError(
RuntimeError: The following files are not present in the table of contents:
- quantization/fp_quant
Add them to ../transformers/docs/source/en/_toctree.yml.

Jul 22 '25 15:07 SunMarc

Added it to toctree

Jul 22 '25 15:07 BlackSamorez

@SunMarc it hit job cancellation somehow. Might need a restart. It should be good.

Jul 23 '25 08:07 BlackSamorez

[For maintainers] Suggested jobs to run (before merge)

run-slow: fp_quant_integration

Jul 23 '25 09:07 github-actions[bot]

Merged ! Thanks for your work

Jul 23 '25 09:07 SunMarc

Hey @BlackSamorez, is there a way to make fp_quant compatible with py3.9 ? Our CI runs on this version but fp_quant requires 3.11

Jul 24 '25 15:07 SunMarc

I guess I'll have to remove match-case constructions and it'll work. Why run on 3.9 in 2025 though?

Jul 24 '25 15:07 BlackSamorez

We want to make sure that the min version of python that is maintained runs transformers correctly. When it will reach EOL, we switch to the next version

Jul 24 '25 15:07 SunMarc

[WIP] Quartet QAT support

This PR adds support for the Quartet QAT method.

Who can review?