firefox-translations-training icon indicating copy to clipboard operation
firefox-translations-training copied to clipboard

Training pipelines for Firefox Translations neural machine translation models

Results 311 firefox-translations-training issues
Sort by recently updated
recently updated
newest added

See also: https://github.com/mozilla/firefox-translations/issues/365 https://github.com/browsermt/bergamot-translator/issues/185 https://github.com/browsermt/bergamot-translator/issues/419 https://github.com/mozilla/firefox-translations/issues/514 https://github.com/mozilla/firefox-translations/issues/511 https://github.com/mozilla/firefox-translations/issues/442 https://github.com/mozilla/firefox-translations/issues/375 https://bugzilla.mozilla.org/show_bug.cgi?id=1862017

quality

This would allow to more quickly define cleaning rules. https://helsinki-nlp.github.io/OpusFilter/automatic_configuration.html

quality
language-coverage

- produce space-tokenized alignments for teacher and student corpus to work with Tags (inline noise) modifier - jointly train alignments for the original and back-translated corpus to improve accuracy -...

We already use COMET in models repo, but don't have it in the training steps. This makes it hard to cross-compare with our release criteria while training models.

evals

We already have this for GPU workers because we've installed [the GCP Ops Agent](https://cloud.google.com/stackdriver/docs/solutions/agents) there. The CPU workers use a different image and worker type though, and this will be...

[skip ci] Continued from #284.

Everything we run in Taskcluster expires eventually. Anything we care about retaining must be uploaded elsewhere, and to ensure it happens we should automate it. We have existing tools for...

taskcluster
tc-p1

This list is for things that RelEng needs to take care of before we take a step back from active involvement in this project. ```[tasklist] - [ ] https://github.com/mozilla/firefox-translations-training/issues/487 -...

taskcluster

This is a training branch of en-cs. It is based off of #384. The purpose of it is to see the results of the combination of NLLB + OpenSubtitles.

Based on @marco-c's feedback we should investigate how the HPLT project cleans monolingual data and whether we should adjust our cleaning procedure. https://hplt-project.org/HPLT_D3_1___Software_for_cleaning_data_sets.pdf

quality