firefox-translations-training
firefox-translations-training copied to clipboard
[meta] Train easy to segment LTR languages
In the short term we are focusing on building up our language list by training easy to segment LTR languages, as they don't require changes to the training pipeline, and are immediately supported in Firefox. These are broken into 3 groups, based on resource count from the OPUS datasets.
Data Availability | Sentence Count |
---|---|
High Resource | > 80 million |
Med Resource | 20 - 80 million |
Low Resource | < 20 million |
Assuming that resource availability is roughly equivalent to the quality we will be available to achieve yields the following table:
High Quality | Medium Quality | Low Quality |
---|---|---|
Russian (en-ru) | Vietnamese | Norwegian (Bokmål) |
Indonesian | Slovak | Basque |
Czech (en-cs) | Ukrainian (en-uk) | Galician |
Hungarian (en-hu) | Slovenian (en-sl) | Norwegian (Nynorsk) |
Turkish (en-tr) | Catalan (ready to ship) | |
Greek (en-el) | Lithuanian | |
Finnish (en-fi) | Croatian | |
Swedish | Serbian | |
Romanian | Latvian | |
Danish | Valenciano | |
Bosnian |
We will focus on potentially "high quality" languages first, and follow-up with "medium quality". It's unclear how well the "low quality" languages will be and if they will meet our shippable criteria or not, but that can be evaluated.
More links
- We have a dashboard for an up-to-date list of what models we have shipped.
- To request additional languages post a request on Mozilla Connect or find an existing request for a language and give it a thumbs up.
Native Speakers
If you are a native speaker (L1 language) in any of these languages and want to help out, feel free to leave a comment on this issue or join us in Firefox Translations on matrix. We can always use help with qualitative model evaluation, and questions regarding language.
For our upcoming training run, this table should summarize what monolingual data is available.
Name | Difficulty | To en |
From en |
Newscrawl |
---|---|---|---|---|
Russian | ready to train | Released | Nightly | yes |
Indonesian | ready to train | yes | ||
Czech | ready to train | Nightly | Nightly | yes |
Hungarian | ready to train | Released | Nightly | yes |
Turkish | ready to train | yes | ||
Greek | ready to train | yes | ||
Finnish | ready to train | Released | Nightly | yes |
Romanian | ready to train | yes | ||
Ukrainian | medium resource | Released | Nightly | yes |
Lithuanian | medium resource | Nightly | yes | |
Croatian | medium resource | yes | ||
Serbian | medium resource | yes | ||
Latvian | medium resource | yes | ||
Bosnian | ready to train | yes | ||
Vietnamese | medium resource | no | ||
Swedish | ready to train | no | ||
Slovak | medium resource | no | ||
Danish | ready to train | no | ||
Slovenian | medium resource | no | ||
Valenciano | medium resource | no |
Macocu has monolingual data for some of these languages: https://macocu.eu/#corpora-section.