firefox-translations-training
firefox-translations-training copied to clipboard
Training pipelines for Firefox Translations neural machine translation models
The `mtdata_ELRC-luxembourg.lu-1-eng-fra.{en,fr}.gz` files had HTML in them. This is an attempt to strip that. This patch is still untested. I've run this script manually on the files, not through the...
Following the instructions on the README, I get to the `make dry-run` step, and `snakemake` errors it does not know mamba. To fix, I had to additionally ```bash export PATH=$(conda...
These changes allow marian training jobs on slurm to be interrupted without losing training progress. The script requests an early warning from slurm a set amount of time (currently 300...
We need this to prevent further training if there is a bug. We can add an assert to the evaluation script. It will check that metrics are higher than some...
mtdata sources include BCP-47 datasets with tag format being xxx_Yyyy_ZZ where Yyyy and ZZ are optional. Compressed download from these includes the tag in the extension e.g. downloading `- mtdata_Statmt-ccaligned-1-eng-zho_CN`...
I see that bicleaner-ai takes more time than 36 hours for some large datasets on pretty good GPU. This really depends on GPU model on HPC. Maybe it it's A100...
I've seen that you integrate Bicleaner in your pipeline and you choose the tool based on available languages. FYI there's the [full en-xx](https://github.com/bitextor/bicleaner-ai-data/releases/download/v1.0/full-en-xx.tgz) model that is trained with the concatenation...
When training a French system, we ran into a failed "download" by sacrebleu because the training pipeline presumes it can pass en-fr to sacrebleu when the data set is only...
Our models are trained mostly on data that has proper capitalisation, but in the wild people and websites sometimes use ALL CAPS when typing. Since our models haven't seen those...
Chinese poses several unique challenges not present in other language pairs. I will start this mega-issue and update the individual points that need to happen for those languages to be...