firefox-translations-training issues

Pass `best_model` name to `train.sh` script

fixes #102 (tested for my training, this works) One thing that is a bit ugly: To make it so `marian` early stops on the correct metric, I added it to...

AmitMY

Bug: `train.sh` does not respect `best-model` setting

In the train file, we save the `final` model always to be the `chrf` model. https://github.com/mozilla/firefox-translations-training/blob/main/pipeline/train/train.sh#L53-L54 This makes it so the model has the wrong name. The pipeline wants: https://github.com/mozilla/firefox-translations-training/blob/main/Snakefile#L433...

AmitMY

Feature request: support separate source and target vocabularies

At the moment, the vocab is trained on both the `src` and `trg` files. Instead, I'd like to have two separate vocabularies. My use case is having two completely different...

AmitMY

`download_mono` fails to load custom datasets with `/` in name

To set a `custom-corpus`, I do something like this, which works at loading and saving the data: > custom-corpus_/custom_corpus/fingerspelling/devtest However, doing something similar for `mono-corpus`: > custom-mono_/custom_corpus/common_words/mono It can not...

AmitMY

max-length and max-length-crop considered harmful

3

The forward translation performed by the teacher has this setting: ``` max-length: 200 max-length-crop: true ``` Do not do that. This will create training data that has a long source...

kpu

bug

quality

Support training separate source/target SentencePiece Models

1

It appears that the pipeline only supports training a joint BPE model, but it is sometimes better to have separate source/target BPE vocabularies

radinplaid

enhancement

Fine-tune teachers to parallel corpora

3

This introduces two new options: 1. In addition to external test sets, extract held-out test sets from each parallel training corpus and evaluate models on those held-out sets as well....

lisskor

Out of memory on shuffling huge datasets

5

300M dataset, 128 GB RAM the workaround is to shuffle dataset after the merge step, disable `--shuffle-in-ram` and use `--shuffle batches`

eu9ene

bug

optimization

Teacher does not continue training after pretraining on augmented corpus

1

I continue testing the pipeline and I see that almost all teacher models don't continue training even after I increased patience by setting `early-stopping: 20`. Currently, continuation happens by training...

eu9ene

bug

quality

Translation settings for backtranslation are suboptimal

3

https://github.com/mozilla/firefox-translations-training/blob/03a2ddaa3f7d9c9af3a236bb2dbb94db36c16373/pipeline/translate/translate.sh#L22 When performing backtranslation, we want slightly different settings for the decoder as we should be doing output sampling as opposed to beam search. Relevant marian setting: https://github.com/marian-nmt/marian-dev/blob/master/src/common/config_parser.cpp#L711 `-b 1...

XapaJIaMnu

quality

firefox-translations-training
firefox-translations-training copied to clipboard

Metadata

Pass `best_model` name to `train.sh` script

Bug: `train.sh` does not respect `best-model` setting

Feature request: support separate source and target vocabularies

`download_mono` fails to load custom datasets with `/` in name

max-length and max-length-crop considered harmful

Support training separate source/target SentencePiece Models

Fine-tune teachers to parallel corpora

Out of memory on shuffling huge datasets

Teacher does not continue training after pretraining on augmented corpus

Translation settings for backtranslation are suboptimal

← Metadata

Owner

Metadata

firefox-translations-training firefox-translations-training copied to clipboard

Metadata

← Metadata

Owner

Metadata

firefox-translations-training
firefox-translations-training copied to clipboard