[WIP] Quartet QAT support
This PR adds support for the Quartet QAT method.
The goal of this PR is to integrate inference and training support for the Quartet QAT method. That would allow to perform both forward and backward passes in MXFP4, allowing for very fast training on Blackwell GPUs.
Currently, we're working on the kernels here, here and here (some of the libs aren't public yet). We're planning to release the first version of the kernels this week and have optimized performance by end of June.
Who can review?
Anyone in the community is free to review the PR once the tests have passed. Feel free to tag members/contributors who may be interested in your PR.
cc @mekkcyber
Hi @BlackSamorez, I'm really looking forward to experimenting with this.
When can we expect to have the kernels public so we can begin testing, even if they are still WIP?
@MekkCyber Hi, thanks for reviewing this! It took us a while, but all the kernels necessary for inference have been published: I've updated the PR description. May I ask you to do another pass? Your previous comments mostly don't apply anymore because of refactoring.
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.
@SunMarc added docs, improved docstring, cleaned the code where you asked.
Actually, give me a minute. I'm adding Triton pseudo-quantization kernels for people without Blackwell GPUs to be able to evaluate the models (although without speedups).
Added pseudoquantization, updated requirements to run the method with it (doesn't require qutlass like that). Added pseudoquant tests. Updated the documentation.
@SunMarc please take a look at the new raised errors and and warning in the quantizer_fp_quant.py.
Should be good
One last nit, the build PR documentation is not passing:
raise RuntimeError(
RuntimeError: The following files are not present in the table of contents:
- quantization/fp_quant
Add them to ../transformers/docs/source/en/_toctree.yml.
Added it to toctree
@SunMarc it hit job cancellation somehow. Might need a restart. It should be good.
[For maintainers] Suggested jobs to run (before merge)
run-slow: fp_quant_integration
Merged ! Thanks for your work
Hey @BlackSamorez, is there a way to make fp_quant compatible with py3.9 ? Our CI runs on this version but fp_quant requires 3.11
I guess I'll have to remove match-case constructions and it'll work. Why run on 3.9 in 2025 though?
We want to make sure that the min version of python that is maintained runs transformers correctly. When it will reach EOL, we switch to the next version