firefox-translations-training
firefox-translations-training copied to clipboard
Training pipelines for Firefox Translations neural machine translation models
We should figure out what proportion of back-translated data to use for teacher training. For example based on this validation curve 70:30 one stage training slightly outperforms 60:40 + fine-tuning...
They have a rather unique behavior, and so it would be helpful to have them separated out. Something like: * `tests/task` * `tests/unit`
The WMTNews corpus at OPUS is just a compilation of the WMT test sets, so it must not be included as training https://github.com/mozilla/firefox-translations-training/blob/1f7ab70cd4dbb64e16bb6b38840490c2f2259cb0/configs/autogenerated/en-tr-spring-2024.yml#L79-L79 https://github.com/mozilla/firefox-translations-training/blob/1f7ab70cd4dbb64e16bb6b38840490c2f2259cb0/configs/autogenerated/en-ro-spring-2024.yml#L123
[Here](https://github.com/ZJaume/clean/tree/master/fixes) there is a compilaton of fixes I did in the past that could be incorporated to the OpusCleaner configs.
In #771 I ran an experiment to see the effects of the size of the distillation corpus for the change in the COMET score for the students. Adding more data...
At the very least: python and yaml We might want to consider adding some linting and checks around the `pipeline` directory at the same time.
Training a second teacher improves performance only slightly. It may be more cost efficient to take the quality hit and remove it. Comet Change | Average Type -- | --...
The goal is to treat the language code zh as Mandarin Chinese in Simplified script for now. - convert all Chinese to Simplified (support for Traditional will be handled separately...
- Output Moses-tokenized text from the alignments step (we used to remap alignments to whitespace-based tokenization to match the text) - Use detok OpusTrainer modifiers to detokenize the text back...
As far as I understand some modifiers are not needed (UpperCase, TitleCase) but some can still be used: - Noise - Inline noise - Typos? (will the current typos library...