firefox-translations-training
firefox-translations-training copied to clipboard
Training pipelines for Firefox Translations neural machine translation models
fixes #102 (tested for my training, this works) One thing that is a bit ugly: To make it so `marian` early stops on the correct metric, I added it to...
In the train file, we save the `final` model always to be the `chrf` model. https://github.com/mozilla/firefox-translations-training/blob/main/pipeline/train/train.sh#L53-L54 This makes it so the model has the wrong name. The pipeline wants: https://github.com/mozilla/firefox-translations-training/blob/main/Snakefile#L433...
At the moment, the vocab is trained on both the `src` and `trg` files. Instead, I'd like to have two separate vocabularies. My use case is having two completely different...
To set a `custom-corpus`, I do something like this, which works at loading and saving the data: > custom-corpus_/custom_corpus/fingerspelling/devtest However, doing something similar for `mono-corpus`: > custom-mono_/custom_corpus/common_words/mono It can not...
The forward translation performed by the teacher has this setting: ``` max-length: 200 max-length-crop: true ``` Do not do that. This will create training data that has a long source...
It appears that the pipeline only supports training a joint BPE model, but it is sometimes better to have separate source/target BPE vocabularies
This introduces two new options: 1. In addition to external test sets, extract held-out test sets from each parallel training corpus and evaluate models on those held-out sets as well....
300M dataset, 128 GB RAM the workaround is to shuffle dataset after the merge step, disable `--shuffle-in-ram` and use `--shuffle batches`
I continue testing the pipeline and I see that almost all teacher models don't continue training even after I increased patience by setting `early-stopping: 20`. Currently, continuation happens by training...
https://github.com/mozilla/firefox-translations-training/blob/03a2ddaa3f7d9c9af3a236bb2dbb94db36c16373/pipeline/translate/translate.sh#L22 When performing backtranslation, we want slightly different settings for the decoder as we should be doing output sampling as opposed to beam search. Relevant marian setting: https://github.com/marian-nmt/marian-dev/blob/master/src/common/config_parser.cpp#L711 `-b 1...