firefox-translations-training icon indicating copy to clipboard operation
firefox-translations-training copied to clipboard

Training pipelines for Firefox Translations neural machine translation models

Results 311 firefox-translations-training issues
Sort by recently updated
recently updated
newest added

The `mtdata_ELRC-luxembourg.lu-1-eng-fra.{en,fr}.gz` files had HTML in them. This is an attempt to strip that. This patch is still untested. I've run this script manually on the files, not through the...

Following the instructions on the README, I get to the `make dry-run` step, and `snakemake` errors it does not know mamba. To fix, I had to additionally ```bash export PATH=$(conda...

bug

These changes allow marian training jobs on slurm to be interrupted without losing training progress. The script requests an early warning from slurm a set amount of time (currently 300...

We need this to prevent further training if there is a bug. We can add an assert to the evaluation script. It will check that metrics are higher than some...

enhancement

mtdata sources include BCP-47 datasets with tag format being xxx_Yyyy_ZZ where Yyyy and ZZ are optional. Compressed download from these includes the tag in the extension e.g. downloading `- mtdata_Statmt-ccaligned-1-eng-zho_CN`...

bug
data

I see that bicleaner-ai takes more time than 36 hours for some large datasets on pretty good GPU. This really depends on GPU model on HPC. Maybe it it's A100...

HPC

I've seen that you integrate Bicleaner in your pipeline and you choose the tool based on available languages. FYI there's the [full en-xx](https://github.com/bitextor/bicleaner-ai-data/releases/download/v1.0/full-en-xx.tgz) model that is trained with the concatenation...

quality
coverage

When training a French system, we ran into a failed "download" by sacrebleu because the training pipeline presumes it can pass en-fr to sacrebleu when the data set is only...

bug
data

Our models are trained mostly on data that has proper capitalisation, but in the wild people and websites sometimes use ALL CAPS when typing. Since our models haven't seen those...

quality

Chinese poses several unique challenges not present in other language pairs. I will start this mega-issue and update the individual points that need to happen for those languages to be...

language-coverage