fairseq icon indicating copy to clipboard operation
fairseq copied to clipboard

Adding an unseen language to NLLB

Open sete-nay opened this issue 3 years ago • 2 comments

Hi. I'm trying to finetune NLLB on a new unseen language according to the steps from here and the readme. My source language is a part of NLLB200, but the target language is not included in it. There is also no other language included from the same language family - no related languages I can refer to. What should I set as a target language? Can you refer to an example code adding an unseen language into NLLB?

Thank you!

DROP=0.1 python examples/nllb/modeling/train/train_script.py \ cfg=nllb200_dense3.3B_finetune_on_fbseed \ cfg/dataset=$DATA_CONFIG \ cfg.dataset.lang_pairs="$SRC-$TGT" \ cfg.fairseq_root=$(pwd) \ cfg.output_dir=$OUTPUT_DIR \ cfg.dropout=$DROP \ cfg.warmup=10 \ cfg.finetune_from_model=$MODEL_FOLDER/checkpoint.pt

sete-nay avatar Nov 02 '22 12:11 sete-nay

I second this. I am also interested into steps required to add a new language to NLLB model.

bt2901 avatar Nov 12 '22 21:11 bt2901

Hi! As the Fairseq code for NLLB is not very actively supported, my recipe for adding a new language to the Huggingface implementation of NLLB might be relevant: https://cointegrated.medium.com/a37fc706b865.

avidale avatar Feb 20 '24 21:02 avidale