[WIP] Refactor Attention Modules
What does this PR do?
Draft PR to address some issues with the attention modules.
-
Moves a number of transformer related blocks/objects in the attention module that would have a better home under models/transformers
-
Moves the definition of the Attention module into
attention.py, rather than have it live inattention_prcessors.py. -
We have a very large number of processors, but with #11368 we should no longer need a good chunk of them and they can be deprecated. I think with these changes we would end up with ~3 processors per model (Attn, IPAdapter, PAG)
-
Make it so that we can bump up our minimum supported Torch version to
>=2.0and use theF.sdpaAPI for all processors. -
There was some discussion around naming of the processors: https://huggingface.slack.com/archives/C065E480NN9/p1737130514639479 We landed on calling the processors something like AttnProcessorsSDPA, but with #11368 we no longer need to use a dedicated processor per backend, so I think it's okay to just have the class be named AttnProcessor.
-
Move processor definitions into the model files, so we don't end up with very large files containing all processors.
-
Introduce AttentionModuleMixin that contains all common methods related to attention operations. New attention modules would inherit from this class.
-
Introduce an
AttentionMixinfor models so that methods likeset_processorare not duplicated across models. Although we can probably just add the methods of this Mixin toModelMixin -
Using Flux as an example here to show how we can define a single Processor to support both fused/unfused qkv projections.
Fixes # (issue)
Before submitting
- [ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
- [ ] Did you read the contributor guideline?
- [ ] Did you read our philosophy doc (important for complex PRs)?
- [ ] Was this discussed/approved via a GitHub issue or the forum? Please add a link to it if that's the case.
- [ ] Did you make sure to update the documentation with your changes? Here are the documentation guidelines, and here are tips on formatting docstrings.
- [ ] Did you write any new necessary tests?
Who can review?
Anyone in the community is free to review the PR once the tests have passed. Feel free to tag members/contributors who may be interested in your PR.