firefox-translations-training issues

Use our localization data for training

2

https://github.com/mozilla-l10n/mt-training-data Maybe we could add it to OPUS.

marco-c

data sources

Check training for CJK

3

Does it require any adjustment? Should we change any hyperparameters etc.?

eu9ene

language-coverage

Use HPLT 2.0

https://hplt-project.org/datasets/v2.0

eu9ene

data sources

Migrate Taskcluster UI tools to this repo

3

I'm talking about the set of tools by @gregtatum: https://gregtatum.github.io/taskcluster-tools/ At least the training dashboard is translations specific and ideally should live in this repo following the monorepo idea. It...

eu9ene

Retrain old models with robustness fixes

1

We will need to retrain models that don't have the robustness fixes from using OpusTrainer. - [ ] bg-en - [ ] de-en - [ ] en-bg - [ ]...

gregtatum

quality

Consider stastically translating short sentences from monolingual datasets.

Short sentences are frequently removed from parallel datasets, so there aren't enough to train on. In HPLT 2.0 the data is filtered at the document level, rather than sentence level....

gregtatum

quality

data sources

Consider harvesting short sentences from parallel data

It's not guaranteed that parallel "sentences" are actually sentences. We could write a script to detect how many sentences are in each parallel datum, and then attempt to extract them...

gregtatum

quality

data sources

Consider using data augmentation to synthesize one word translations

We already statistically generate word alignment information, it should be possible to go through parallel datasets, and generate word pairs of the most common words that are aligned. Since the...

gregtatum

quality

data sources

temp: switch to temporary gpu worker image that issues dnsmasq to sanity check it

bhearsum

Don't translate idioms literally

Try translating https://www.omniglot.com/language/idioms/swedish.htm to Swedish. The English idioms are translated literally. While that may be useful *for this particular page* (if one doesn't know English), for content in general when...

zcorpan

feedback

firefox-translations-training
firefox-translations-training copied to clipboard

Metadata

Use our localization data for training

Check training for CJK

Use HPLT 2.0

Migrate Taskcluster UI tools to this repo

Retrain old models with robustness fixes

Consider stastically translating short sentences from monolingual datasets.

Consider harvesting short sentences from parallel data

Consider using data augmentation to synthesize one word translations

temp: switch to temporary gpu worker image that issues dnsmasq to sanity check it

Don't translate idioms literally

← Metadata

Owner

Metadata

firefox-translations-training firefox-translations-training copied to clipboard

Metadata

← Metadata

Owner

Metadata

firefox-translations-training
firefox-translations-training copied to clipboard