firefox-translations-training
firefox-translations-training copied to clipboard
Training pipelines for Firefox Translations neural machine translation models
If we're not filtering soft hyphen 0xad already, we should be. https://github.com/browsermt/bergamot-translator/issues/337
This issue is important only for HPC training where we don't want jobs to be too small, so we have to group them. It is even beneficial to have smaller...
We currently have 2 data backends: OpusMT and mtdata. They both seem to get the user a semi-disjoint set of datasets, and on top of that, those in mtdata are...
After training is finished we would like to generate a report similar to this one: https://github.com/browsermt/students/blob/master/bgen/README.md More ideally, we would like to generate all the companion scripts so that the...
A lot of the parallel corpora are located on statmt.org. As far as I could gather, the download-data step of Snakemake executes all corpora downloads in parallel, which unfortunately means...
Me and @kpu both noticed the issue that sometimes tokenized texts appear in the wild resulting in our models learning to place space around punctuation marks which just looks bad...
https://github.com/mozilla/firefox-translations-training/blob/main/pipeline/setup/install-deps.sh crashes if the user doesn't have root access /apt is configured to be run as a user. I can see that https://github.com/mozilla/firefox-translations-training/blob/174cceaa6f70b81d4fe68b124e00e118a76084c9/Makefile#L77 This step is hardcoded. It should be...
Currently, it's difficult to reuse data between two translation directions as majority of the files are placed in different directories https://github.com/mozilla/firefox-translations-training/blob/3b3f33bf2581238d325f05015123fc0a026c394e/configs/config.prod.yml#L18 eg: `exp-name/src-trg`, meaning that all datasets will be redownloaded....
1. Better integrate with the pipeline settings 2. Automatically discover models in MODELS_DIR 3. Remove intermediate file 4. Do not require to restart the script when a new model was...
Ulrich: >The SentencePiece tokenizer should probably be trained with a custom normalization table (see the SentencePiece documentation) that removes soft hyphens in addition to the existing normalization steps. It requires...