fairseq icon indicating copy to clipboard operation
fairseq copied to clipboard

Where to find the list of source languages?

Open nithinreddyy opened this issue 2 years ago • 2 comments

I'm using the below code which will try to translate from Romanian to English

from transformers import AutoModelForSeq2SeqLM, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("facebook/nllb-200-distilled-600M", src_lang="ron_Latn")
model = AutoModelForSeq2SeqLM.from_pretrained("facebook/nllb-200-distilled-600M")

article = "Şeful ONU spune că nu există o soluţie militară în Siria"
inputs = tokenizer(article, return_tensors="pt")

translated_tokens = model.generate(
     **inputs, forced_bos_token_id=tokenizer.lang_code_to_id["en_Latn"], max_length=30
 )
tokenizer.batch_decode(translated_tokens, skip_special_tokens=True)[0]

Can anyone please share me the link of the source_lang list or share the path if present in this github repo?

nithinreddyy avatar Jan 24 '23 15:01 nithinreddyy

Supported languages can be found in the paper https://arxiv.org/pdf/2207.04672.pdf

dimabendera avatar May 11 '23 16:05 dimabendera

Check out the metedata section in README.md in their huggingface repo.:

Input and output languages are entirely customizable with BCP-47 codes used by the FLORES-200 dataset

suzinyou avatar Feb 23 '24 21:02 suzinyou