firefox-translations-training issues

Some words should be passing through untranslated (e.g. IPA characters, emojis, etc.)

4

See also: https://github.com/mozilla/firefox-translations/issues/365 https://github.com/browsermt/bergamot-translator/issues/185 https://github.com/browsermt/bergamot-translator/issues/419 https://github.com/mozilla/firefox-translations/issues/514 https://github.com/mozilla/firefox-translations/issues/511 https://github.com/mozilla/firefox-translations/issues/442 https://github.com/mozilla/firefox-translations/issues/375 https://bugzilla.mozilla.org/show_bug.cgi?id=1862017

marco-c

quality

Investigate automatic generation of cleaning rules using OpusFilter

This would allow to more quickly define cleaning rules. https://helsinki-nlp.github.io/OpusFilter/automatic_configuration.html

marco-c

quality

language-coverage

[Experiment] Inline noise Feb 2024

6

- produce space-tokenized alignments for teacher and student corpus to work with Tags (inline noise) modifier - jointly train alignments for the original and back-translated corpus to improve accuracy -...

eu9ene

Add COMET to the evaluation steps

We already use COMET in models repo, but don't have it in the training steps. This makes it hard to cross-compare with our release criteria while training models.

gregtatum

evals

enable memory monitoring on CPU workers

1

We already have this for GPU workers because we've installed [the GCP Ops Agent](https://cloud.google.com/stackdriver/docs/solutions/agents) there. The CPU workers use a different image and worker type though, and this will be...

bhearsum

[Experiment] Train en-ca - Feb 2024

13

[skip ci] Continued from #284.

gregtatum

automatically upload important artifacts to a GCP bucket

16

Everything we run in Taskcluster expires eventually. Anything we care about retaining must be uploaded elsewhere, and to ensure it happens we should automate it. We have existing tools for...

bhearsum

taskcluster

tc-p1

[meta] issues before primary maintance of taskgraph code in this repository is handed off to translations engineers

This list is for things that RelEng needs to take care of before we take a step back from active involvement in this project. ```[tasklist] - [ ] https://github.com/mozilla/firefox-translations-training/issues/487 -...

bhearsum

taskcluster

[Experiment] Train en cs - Mar 2024

1

This is a training branch of en-cs. It is based off of #384. The purpose of it is to see the results of the combination of NLLB + OpenSubtitles.

gregtatum

Investigate monolingual cleaning

3

Based on @marco-c's feedback we should investigate how the HPLT project cleans monolingual data and whether we should adjust our cleaning procedure. https://hplt-project.org/HPLT_D3_1___Software_for_cleaning_data_sets.pdf

eu9ene

quality

firefox-translations-training
firefox-translations-training copied to clipboard

Metadata

Some words should be passing through untranslated (e.g. IPA characters, emojis, etc.)

Investigate automatic generation of cleaning rules using OpusFilter

[Experiment] Inline noise Feb 2024

Add COMET to the evaluation steps

enable memory monitoring on CPU workers

[Experiment] Train en-ca - Feb 2024

automatically upload important artifacts to a GCP bucket

[meta] issues before primary maintance of taskgraph code in this repository is handed off to translations engineers

[Experiment] Train en cs - Mar 2024

Investigate monolingual cleaning

← Metadata

Owner

Metadata

firefox-translations-training firefox-translations-training copied to clipboard

Metadata

← Metadata

Owner

Metadata

firefox-translations-training
firefox-translations-training copied to clipboard