firefox-translations-training issues

Remove soft hyphen character

If we're not filtering soft hyphen 0xad already, we should be. https://github.com/browsermt/bergamot-translator/issues/337

kpu

quality

Group jobs request too many cores on slurm

This issue is important only for HPC training where we don't want jobs to be too small, so we have to group them. It is even beneficial to have smaller...

eu9ene

HPC

Dataset deduplication issues.

We currently have 2 data backends: OpusMT and mtdata. They both seem to get the user a semi-disjoint set of datasets, and on top of that, those in mtdata are...

XapaJIaMnu

enhancement

Generate a report after training is finished, and ideally ready to distribute models

1

After training is finished we would like to generate a report similar to this one: https://github.com/browsermt/students/blob/master/bgen/README.md More ideally, we would like to generate all the companion scripts so that the...

XapaJIaMnu

enhancement

Download jobs fail as we hit statmt rate limit

4

A lot of the parallel corpora are located on statmt.org. As far as I could gather, the download-data step of Snakemake executes all corpora downloads in parallel, which unfortunately means...

XapaJIaMnu

bug

Add a check for to verify datasets aren't already tokenized (similar to JW300 issue)

3

Me and @kpu both noticed the issue that sometimes tokenized texts appear in the wild resulting in our models learning to place space around punctuation marks which just looks bad...

XapaJIaMnu

quality

install-deps should be optional as it requires root

2

https://github.com/mozilla/firefox-translations-training/blob/main/pipeline/setup/install-deps.sh crashes if the user doesn't have root access /apt is configured to be run as a user. I can see that https://github.com/mozilla/firefox-translations-training/blob/174cceaa6f70b81d4fe68b124e00e118a76084c9/Makefile#L77 This step is hardcoded. It should be...

XapaJIaMnu

enhancement

Train both directions at once

7

Currently, it's difficult to reuse data between two translation directions as majority of the files are placed in different directories https://github.com/mozilla/firefox-translations-training/blob/3b3f33bf2581238d325f05015123fc0a026c394e/configs/config.prod.yml#L18 eg: `exp-name/src-trg`, meaning that all datasets will be redownloaded....

XapaJIaMnu

enhancement

Improve tensorboard

1. Better integrate with the pipeline settings 2. Automatically discover models in MODELS_DIR 3. Remove intermediate file 4. Do not require to restart the script when a new model was...

eu9ene

enhancement

Handle soft hyphens with custom normalization tables

4

Ulrich: >The SentencePiece tokenizer should probably be trained with a custom normalization table (see the SentencePiece documentation) that removes soft hyphens in addition to the existing normalization steps. It requires...

eu9ene

good first issue

quality

firefox-translations-training
firefox-translations-training copied to clipboard

Metadata

Remove soft hyphen character

Group jobs request too many cores on slurm

Dataset deduplication issues.

Generate a report after training is finished, and ideally ready to distribute models

Download jobs fail as we hit statmt rate limit

Add a check for to verify datasets aren't already tokenized (similar to JW300 issue)

install-deps should be optional as it requires root

Train both directions at once

Improve tensorboard

Handle soft hyphens with custom normalization tables

← Metadata

Owner

Metadata

firefox-translations-training firefox-translations-training copied to clipboard

Metadata

← Metadata

Owner

Metadata

firefox-translations-training
firefox-translations-training copied to clipboard