firefox-translations-training icon indicating copy to clipboard operation
firefox-translations-training copied to clipboard

Training pipelines for Firefox Translations neural machine translation models

Results 311 firefox-translations-training issues
Sort by recently updated
recently updated
newest added

We should figure out what proportion of back-translated data to use for teacher training. For example based on this validation curve 70:30 one stage training slightly outperforms 60:40 + fine-tuning...

quality

They have a rather unique behavior, and so it would be helpful to have them separated out. Something like: * `tests/task` * `tests/unit`

refactoring

The WMTNews corpus at OPUS is just a compilation of the WMT test sets, so it must not be included as training https://github.com/mozilla/firefox-translations-training/blob/1f7ab70cd4dbb64e16bb6b38840490c2f2259cb0/configs/autogenerated/en-tr-spring-2024.yml#L79-L79 https://github.com/mozilla/firefox-translations-training/blob/1f7ab70cd4dbb64e16bb6b38840490c2f2259cb0/configs/autogenerated/en-ro-spring-2024.yml#L123

[Here](https://github.com/ZJaume/clean/tree/master/fixes) there is a compilaton of fixes I did in the past that could be incorporated to the OpusCleaner configs.

In #771 I ran an experiment to see the effects of the size of the distillation corpus for the change in the COMET score for the students. Adding more data...

cost & perf

At the very least: python and yaml We might want to consider adding some linting and checks around the `pipeline` directory at the same time.

taskcluster

Training a second teacher improves performance only slightly. It may be more cost efficient to take the quality hit and remove it. Comet Change | Average Type -- | --...

cost & perf
experiment

The goal is to treat the language code zh as Mandarin Chinese in Simplified script for now. - convert all Chinese to Simplified (support for Traditional will be handled separately...

- Output Moses-tokenized text from the alignments step (we used to remap alignments to whitespace-based tokenization to match the text) - Use detok OpusTrainer modifiers to detokenize the text back...

As far as I understand some modifiers are not needed (UpperCase, TitleCase) but some can still be used: - Noise - Inline noise - Typos? (will the current typos library...

language-coverage