rust-bert icon indicating copy to clipboard operation
rust-bert copied to clipboard

[FEATURE] NLLB translator support.

Open npatsakula opened this issue 1 year ago • 1 comments

Hello!

Facebook No Language Left Behind description page: https://github.com/facebookresearch/fairseq/tree/nllb

Huggingface

It's already hosted in Huggingface.

https://huggingface.co/facebook/nllb-200-1.3B https://huggingface.co/facebook/nllb-200-3.3B https://huggingface.co/facebook/nllb-200-distilled-600M

Plan

[ ] Merge NLLB support into rust-tokenizers: https://github.com/guillaume-be/rust-tokenizers/pull/76 [ ] Copy M2M model code and fix broken (it's almost the same model). [ ] Add pre-trained model URL's. [ ] Expand language list.

Help wanted

I am not fluent with code base and all possible help is wanted.

  1. NLLB has different language-codes format. This is a problem? https://github.com/npatsakula/rust-tokenizers/blob/nllb_support/main/src/vocab/nllb_vocab.rs#L9
  2. TranslationOptions requires vocab_resource. What it will be for NLLB? Same as for tokenizer?

npatsakula avatar Aug 19 '22 14:08 npatsakula

Hello @npatsakula ,

Thank you for working on adding support for NLLB - this will be a very useful addition to this library. I suggest using the distilled verion (600M) parameters for the tests and development, but it would be great to open pull requests on Huggingface's modelhub for the larger models as well.

Regarding your questions:

  1. The language-code format is not an issue. You will want to register the missing languages in the Language enum. You may need to add a new implementation for Language to get the NLLB language code, but this is not necessary. If the 200+ language from NLLB can be mapped to ISO codes (this is the preferred option), you could simply extend one of the existing methods. The tokenizer/vocab just needs to be consistent. Please check the example of M2M100: instead of using specific language code it was possible to fall back to the same ISO mapping. The two sections of the code that need to agree are the following:

For M2M100 as you can see the language codes are all padded to all be of the same length (7) -- this makes tokenization easier 2. TranslationOption require a TranslationConfig that requires a vocab_resource and optionally a vocab_merges. This are indeed the same file resources needed to instantiate the tokenizer.

Please let me know if you have any further questions as you work towards the implementation

guillaume-be avatar Aug 19 '22 16:08 guillaume-be