firefox-translations-training icon indicating copy to clipboard operation
firefox-translations-training copied to clipboard

Add community contribution guidelines

Open eu9ene opened this issue 1 year ago • 1 comments

People keep asking how to help add another language.

  1. The first good step would be helping to research datasets. To estimate feasibility of training we need statistics on how much data is there, including monolingual datasets.

  2. Contributing datasets that are not on OPUS or mtdata. A good example is when folks provided data for Catalan and now @gregtatum is experimenting with it.

  3. Helping tuning cleaning rules. We just started looking into OpusCleaner ourselves. In the future we could provide a guide on how to run the UI, tune rules for a language pair and contribute configs to the repo.

  4. For those looking to train a language pair themselves helping with maintaining Snakemake would be handy.

  5. We might have simple issues to take care of as a part of the training pipeline

We can setup a workflow on Github by creating an issue for a language (ideally with a template) and adding all the stats and discussing things related to the language there.

We should add a doc with clear guidelines on all this.

eu9ene avatar Jan 23 '24 19:01 eu9ene

Helping tuning cleaning rules. We just started looking into OpusCleaner ourselves. In the future we could provide a guide on how to run the UI, tune rules for a language pair and contribute configs to the repo.

I think something like https://github.com/hplt-project/OpusCleaner/issues/148#issuecomment-1905590936 would be ideal here.

marco-c avatar Jan 23 '24 19:01 marco-c