Natively support SONAR text models as M2M100 encoder and decoder models
What does this PR do?
This PR adds native support for SONAR text encoders are decoders (https://github.com/facebookresearch/SONAR).
SONAR for text is architecturally an NLLB model, but with the encoder representations mean-pooled into a single fixed-sized vector before passing them to the decoder. Thus, SONAR encoder works as a sentence embedder, and thanks to pretraining on translation data, it is massively multilingual and language-agnostic. And, unlike other sentence encoder, this one has a decoder that can reconstruct the original texts back from their embeddings.
To supports such models natively, the easiest way would be to create ~~NLLB~~ M2M100 model classes with encoder only or decoder only, similarly to the existing classes T5EncoderModel or MT5EncoderModel.
Before submitting
- [ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
- [x] Did you read the contributor guideline, Pull Request section?
- [ ] Was this discussed/approved via a Github issue or the forum? Please add a link to it if that's the case.
- [x] Did you make sure to update the documentation with your changes? Here are the documentation guidelines, and here are tips on formatting docstrings.
- [x] Did you write any new necessary tests?
Who can review?
Anyone in the community is free to review the PR once the tests have passed. Feel free to tag members/contributors who may be interested in your PR.
Details
- I add two new public model classes,
M2M100EncoderModelandM2M100DecoderModel. M2M100EncoderModelis a typical encoder-only model (BERT-style), with an additional option of applying mean pooling to its outputs (this is how SONAR text embeddings are computed)M2M100DecoderModelis a module consisting of M2M100Decoder and an output projection layer. As the input, it always expects theencoder_outputsargument to be present, and ignores itsinput_ids.- Unlike
M2M100ForConditionalGeneration,M2M100DecoderModelalways has its decoder input embedding and decoder output projection layers tied, because this is how SONAR decoder originally was implemented. - I add a script for creating these models from the original
fairseq2checkpoints (it doesn't requirefairseq2as a dependency; instead, it just reads and reformats the torch model state dicts). - I add specialized unit tests for the encoder-only model (implemented following
T5EncoderOnlyModelTest, see https://github.com/huggingface/transformers/pull/8717), and for the decoder-only model (based loosely on similar ideas, but with more tweaks). - I add an integration test based on the checkpoints that I published to the HF hub. They reproduce the example sentence encoding and decoding from the readme in the SONAR repo: https://github.com/facebookresearch/SONAR/tree/main.
Testing
All the unit tests I added are run by
python -m pytest tests/models/m2m_100/test_modeling_m2m_100.py
The integration tests that I added are marked as slow, so they could be run with
RUN_SLOW=1 python -m pytest tests/models/m2m_100/test_modeling_m2m_100_sonar.py::SonarIntegrationTests
Hi @avidale, thanks for opening this PR! Let us know when it's ready for review
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.
Hey, I want to continue this work! How to I reopen it?
@avidale I can reopen for you
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.
Commenting to avoid it getting closed
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.
Still planning to return to it
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.
Hi @amyeroberts! The PR is mostly ready for the review. I am still uncertain about several things, but I would appreciate if you commented on the current design!
One of the main questions: is it alright to always demand encoder_outputs in the M2M100DecoderModel, or a different presentation of its inputs is preferrable?
Also tagging @ArthurZucker as a potential reviewer
Tagging @ArthurZucker, @younesbelkada, and @amyeroberts once more, as potential reviewers. Could you please give some feedback?
Yes! Sorry it has been on my stack for a while, and summer vacations hit!
@ArthurZucker any other comments? What do you think about my responses?
Hi @ArthurZucker! I have responded to your last comments. Some tests are failing, but they all seem to be unrelated to my code.
Will review ASAP
@ArthurZucker do you think this PR has a chance?
What exactly am I supposed to do with modular approach here?
My guess is that you suggest creating a "modular" sonar directory that imports some code from m2m100, instead of adding more code to m2m100. Is it correct?
Yes @avidale ! This should make it easier to isolate the changes and iterate!
@avidale do you plan to update this, would really appreciate using this
Hi @danikhan632! This PR grew very stale; I think it might need a complete rewriting from scratch. I might return to it nearer to the end of the year, but if you are willing to revive it sooner, I'd be happy to hand it over to you!