firefox-translations-training issues

Investigate the effect of back-translations

We should figure out what proportion of back-translated data to use for teacher training. For example based on this validation curve 70:30 one stage training slightly outperforms 60:40 + fine-tuning...

eu9ene

quality

Migrate all the run_task tests into a separate folder

They have a rather unique behavior, and so it would be helpful to have them separated out. Something like: * `tests/task` * `tests/unit`

gregtatum

refactoring

Do not use WMTNews as training!

The WMTNews corpus at OPUS is just a compilation of the WMT test sets, so it must not be included as training https://github.com/mozilla/firefox-translations-training/blob/1f7ab70cd4dbb64e16bb6b38840490c2f2259cb0/configs/autogenerated/en-tr-spring-2024.yml#L79-L79 https://github.com/mozilla/firefox-translations-training/blob/1f7ab70cd4dbb64e16bb6b38840490c2f2259cb0/configs/autogenerated/en-ro-spring-2024.yml#L123

ZJaume

More corpora specific fixes

[Here](https://github.com/ZJaume/clean/tree/master/fixes) there is a compilaton of fixes I did in the past that could be incorporated to the OpusCleaner configs.

ZJaume

Limit the amount of data used for distillation

1

In #771 I ran an experiment to see the effects of the size of the distillation corpus for the change in the COMET score for the students. Adding more data...

gregtatum

cost & perf

add testing & linting for taskcluster directory

2

At the very least: python and yaml We might want to consider adding some linting and checks around the `pipeline` directory at the same time.

bhearsum

taskcluster

Investigate removing teacher ensemble training

Training a second teacher improves performance only slightly. It may be more cost efficient to take the quality hit and remove it. Comet Change | Average Type -- | --...

gregtatum

cost & perf

experiment

Update data importer to support CJK

The goal is to treat the language code zh as Mandarin Chinese in Simplified script for now. - convert all Chinese to Simplified (support for Traditional will be handled separately...

eu9ene

Update training to support CJK

1

- Output Moses-tokenized text from the alignments step (we used to remap alignments to whitespace-based tokenization to match the text) - Use detok OpusTrainer modifiers to detokenize the text back...

eu9ene

Investigate OpusTrainer compatibility for CJK

4

As far as I understand some modifiers are not needed (UpperCase, TitleCase) but some can still be used: - Noise - Inline noise - Typos? (will the current typos library...

eu9ene

language-coverage

firefox-translations-training
firefox-translations-training copied to clipboard

Metadata

Investigate the effect of back-translations

Migrate all the run_task tests into a separate folder

Do not use WMTNews as training!

More corpora specific fixes

Limit the amount of data used for distillation

add testing & linting for taskcluster directory

Investigate removing teacher ensemble training

Update data importer to support CJK

Update training to support CJK

Investigate OpusTrainer compatibility for CJK

← Metadata

Owner

Metadata

firefox-translations-training firefox-translations-training copied to clipboard

Metadata

← Metadata

Owner

Metadata

firefox-translations-training
firefox-translations-training copied to clipboard