firefox-translations-training icon indicating copy to clipboard operation
firefox-translations-training copied to clipboard

Training pipelines for Firefox Translations neural machine translation models

Results 311 firefox-translations-training issues
Sort by recently updated
recently updated
newest added

fixes #102 (tested for my training, this works) One thing that is a bit ugly: To make it so `marian` early stops on the correct metric, I added it to...

In the train file, we save the `final` model always to be the `chrf` model. https://github.com/mozilla/firefox-translations-training/blob/main/pipeline/train/train.sh#L53-L54 This makes it so the model has the wrong name. The pipeline wants: https://github.com/mozilla/firefox-translations-training/blob/main/Snakefile#L433...

At the moment, the vocab is trained on both the `src` and `trg` files. Instead, I'd like to have two separate vocabularies. My use case is having two completely different...

To set a `custom-corpus`, I do something like this, which works at loading and saving the data: > custom-corpus_/custom_corpus/fingerspelling/devtest However, doing something similar for `mono-corpus`: > custom-mono_/custom_corpus/common_words/mono It can not...

The forward translation performed by the teacher has this setting: ``` max-length: 200 max-length-crop: true ``` Do not do that. This will create training data that has a long source...

bug
quality

It appears that the pipeline only supports training a joint BPE model, but it is sometimes better to have separate source/target BPE vocabularies

enhancement

This introduces two new options: 1. In addition to external test sets, extract held-out test sets from each parallel training corpus and evaluate models on those held-out sets as well....

300M dataset, 128 GB RAM the workaround is to shuffle dataset after the merge step, disable `--shuffle-in-ram` and use `--shuffle batches`

bug
optimization

I continue testing the pipeline and I see that almost all teacher models don't continue training even after I increased patience by setting `early-stopping: 20`. Currently, continuation happens by training...

bug
quality

https://github.com/mozilla/firefox-translations-training/blob/03a2ddaa3f7d9c9af3a236bb2dbb94db36c16373/pipeline/translate/translate.sh#L22 When performing backtranslation, we want slightly different settings for the decoder as we should be doing output sampling as opposed to beam search. Relevant marian setting: https://github.com/marian-nmt/marian-dev/blob/master/src/common/config_parser.cpp#L711 `-b 1...

quality