firefox-translations-training icon indicating copy to clipboard operation
firefox-translations-training copied to clipboard

Training pipelines for Firefox Translations neural machine translation models

Results 311 firefox-translations-training issues
Sort by recently updated
recently updated
newest added

If we're not filtering soft hyphen 0xad already, we should be. https://github.com/browsermt/bergamot-translator/issues/337

quality

This issue is important only for HPC training where we don't want jobs to be too small, so we have to group them. It is even beneficial to have smaller...

HPC

We currently have 2 data backends: OpusMT and mtdata. They both seem to get the user a semi-disjoint set of datasets, and on top of that, those in mtdata are...

enhancement

After training is finished we would like to generate a report similar to this one: https://github.com/browsermt/students/blob/master/bgen/README.md More ideally, we would like to generate all the companion scripts so that the...

enhancement

A lot of the parallel corpora are located on statmt.org. As far as I could gather, the download-data step of Snakemake executes all corpora downloads in parallel, which unfortunately means...

bug

Me and @kpu both noticed the issue that sometimes tokenized texts appear in the wild resulting in our models learning to place space around punctuation marks which just looks bad...

quality

https://github.com/mozilla/firefox-translations-training/blob/main/pipeline/setup/install-deps.sh crashes if the user doesn't have root access /apt is configured to be run as a user. I can see that https://github.com/mozilla/firefox-translations-training/blob/174cceaa6f70b81d4fe68b124e00e118a76084c9/Makefile#L77 This step is hardcoded. It should be...

enhancement

Currently, it's difficult to reuse data between two translation directions as majority of the files are placed in different directories https://github.com/mozilla/firefox-translations-training/blob/3b3f33bf2581238d325f05015123fc0a026c394e/configs/config.prod.yml#L18 eg: `exp-name/src-trg`, meaning that all datasets will be redownloaded....

enhancement

1. Better integrate with the pipeline settings 2. Automatically discover models in MODELS_DIR 3. Remove intermediate file 4. Do not require to restart the script when a new model was...

enhancement

Ulrich: >The SentencePiece tokenizer should probably be trained with a custom normalization table (see the SentencePiece documentation) that removes soft hyphens in addition to the existing normalization steps. It requires...

good first issue
quality