firefox-translations-training
firefox-translations-training copied to clipboard
Add community contribution guidelines
People keep asking how to help add another language.
-
The first good step would be helping to research datasets. To estimate feasibility of training we need statistics on how much data is there, including monolingual datasets.
-
Contributing datasets that are not on OPUS or mtdata. A good example is when folks provided data for Catalan and now @gregtatum is experimenting with it.
-
Helping tuning cleaning rules. We just started looking into OpusCleaner ourselves. In the future we could provide a guide on how to run the UI, tune rules for a language pair and contribute configs to the repo.
-
For those looking to train a language pair themselves helping with maintaining Snakemake would be handy.
-
We might have simple issues to take care of as a part of the training pipeline
We can setup a workflow on Github by creating an issue for a language (ideally with a template) and adding all the stats and discussing things related to the language there.
We should add a doc with clear guidelines on all this.
Helping tuning cleaning rules. We just started looking into OpusCleaner ourselves. In the future we could provide a guide on how to run the UI, tune rules for a language pair and contribute configs to the repo.
I think something like https://github.com/hplt-project/OpusCleaner/issues/148#issuecomment-1905590936 would be ideal here.