fairseq
fairseq copied to clipboard
Where to find the list of source languages?
I'm using the below code which will try to translate from Romanian to English
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("facebook/nllb-200-distilled-600M", src_lang="ron_Latn")
model = AutoModelForSeq2SeqLM.from_pretrained("facebook/nllb-200-distilled-600M")
article = "Şeful ONU spune că nu există o soluţie militară în Siria"
inputs = tokenizer(article, return_tensors="pt")
translated_tokens = model.generate(
**inputs, forced_bos_token_id=tokenizer.lang_code_to_id["en_Latn"], max_length=30
)
tokenizer.batch_decode(translated_tokens, skip_special_tokens=True)[0]
Can anyone please share me the link of the source_lang list or share the path if present in this github repo?
Supported languages can be found in the paper https://arxiv.org/pdf/2207.04672.pdf
Check out the metedata section in README.md in their huggingface repo.:
Input and output languages are entirely customizable with BCP-47 codes used by the FLORES-200 dataset