transformers icon indicating copy to clipboard operation
transformers copied to clipboard

Natively support SONAR text models as M2M100 encoder and decoder models

Open avidale opened this issue 1 year ago • 13 comments

What does this PR do?

This PR adds native support for SONAR text encoders are decoders (https://github.com/facebookresearch/SONAR).

SONAR for text is architecturally an NLLB model, but with the encoder representations mean-pooled into a single fixed-sized vector before passing them to the decoder. Thus, SONAR encoder works as a sentence embedder, and thanks to pretraining on translation data, it is massively multilingual and language-agnostic. And, unlike other sentence encoder, this one has a decoder that can reconstruct the original texts back from their embeddings.

To supports such models natively, the easiest way would be to create ~~NLLB~~ M2M100 model classes with encoder only or decoder only, similarly to the existing classes T5EncoderModel or MT5EncoderModel.

Before submitting

  • [ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • [x] Did you read the contributor guideline, Pull Request section?
  • [ ] Was this discussed/approved via a Github issue or the forum? Please add a link to it if that's the case.
  • [x] Did you make sure to update the documentation with your changes? Here are the documentation guidelines, and here are tips on formatting docstrings.
  • [x] Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag members/contributors who may be interested in your PR.

Details

  • I add two new public model classes, M2M100EncoderModel and M2M100DecoderModel.
  • M2M100EncoderModel is a typical encoder-only model (BERT-style), with an additional option of applying mean pooling to its outputs (this is how SONAR text embeddings are computed)
  • M2M100DecoderModel is a module consisting of M2M100Decoder and an output projection layer. As the input, it always expects the encoder_outputs argument to be present, and ignores its input_ids.
  • Unlike M2M100ForConditionalGeneration, M2M100DecoderModel always has its decoder input embedding and decoder output projection layers tied, because this is how SONAR decoder originally was implemented.
  • I add a script for creating these models from the original fairseq2 checkpoints (it doesn't require fairseq2 as a dependency; instead, it just reads and reformats the torch model state dicts).
  • I add specialized unit tests for the encoder-only model (implemented following T5EncoderOnlyModelTest, see https://github.com/huggingface/transformers/pull/8717), and for the decoder-only model (based loosely on similar ideas, but with more tweaks).
  • I add an integration test based on the checkpoints that I published to the HF hub. They reproduce the example sentence encoding and decoding from the readme in the SONAR repo: https://github.com/facebookresearch/SONAR/tree/main.

Testing

All the unit tests I added are run by

python -m pytest tests/models/m2m_100/test_modeling_m2m_100.py

The integration tests that I added are marked as slow, so they could be run with

RUN_SLOW=1 python -m pytest tests/models/m2m_100/test_modeling_m2m_100_sonar.py::SonarIntegrationTests

avidale avatar Mar 13 '24 22:03 avidale

Hi @avidale, thanks for opening this PR! Let us know when it's ready for review

amyeroberts avatar Mar 18 '24 14:03 amyeroberts

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

github-actions[bot] avatar Apr 13 '24 08:04 github-actions[bot]

Hey, I want to continue this work! How to I reopen it?

avidale avatar Apr 25 '24 08:04 avidale

@avidale I can reopen for you

amyeroberts avatar Apr 25 '24 09:04 amyeroberts

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

github-actions[bot] avatar May 20 '24 08:05 github-actions[bot]

Commenting to avoid it getting closed

avidale avatar May 23 '24 07:05 avidale

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

github-actions[bot] avatar Jul 11 '24 08:07 github-actions[bot]

Still planning to return to it

avidale avatar Jul 11 '24 08:07 avidale

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

github-actions[bot] avatar Aug 05 '24 08:08 github-actions[bot]

Hi @amyeroberts! The PR is mostly ready for the review. I am still uncertain about several things, but I would appreciate if you commented on the current design!

One of the main questions: is it alright to always demand encoder_outputs in the M2M100DecoderModel, or a different presentation of its inputs is preferrable?

avidale avatar Aug 05 '24 15:08 avidale

Also tagging @ArthurZucker as a potential reviewer

avidale avatar Aug 06 '24 08:08 avidale

Tagging @ArthurZucker, @younesbelkada, and @amyeroberts once more, as potential reviewers. Could you please give some feedback?

avidale avatar Aug 22 '24 13:08 avidale

Yes! Sorry it has been on my stack for a while, and summer vacations hit!

ArthurZucker avatar Aug 26 '24 09:08 ArthurZucker

@ArthurZucker any other comments? What do you think about my responses?

avidale avatar Oct 01 '24 08:10 avidale

Hi @ArthurZucker! I have responded to your last comments. Some tests are failing, but they all seem to be unrelated to my code.

avidale avatar Nov 08 '24 13:11 avidale

Will review ASAP

ArthurZucker avatar Nov 28 '24 16:11 ArthurZucker

@ArthurZucker do you think this PR has a chance?

avidale avatar Feb 12 '25 16:02 avidale

What exactly am I supposed to do with modular approach here? My guess is that you suggest creating a "modular" sonar directory that imports some code from m2m100, instead of adding more code to m2m100. Is it correct?

avidale avatar Feb 25 '25 09:02 avidale

Yes @avidale ! This should make it easier to isolate the changes and iterate!

ArthurZucker avatar Jun 19 '25 02:06 ArthurZucker

@avidale do you plan to update this, would really appreciate using this

danikhan632 avatar Oct 01 '25 20:10 danikhan632

Hi @danikhan632! This PR grew very stale; I think it might need a complete rewriting from scratch. I might return to it nearer to the end of the year, but if you are willing to revive it sooner, I'd be happy to hand it over to you!

avidale avatar Oct 13 '25 18:10 avidale