Greg Tatum

Results 204 issues of Greg Tatum

We already use COMET in models repo, but don't have it in the training steps. This makes it hard to cross-compare with our release criteria while training models.

evals

[skip ci] Continued from #284.

This is a training branch of en-cs. It is based off of #384. The purpose of it is to see the results of the combination of NLLB + OpenSubtitles.

This is the lists of tasks that we need to handle in order to ramp up our ability to train many languages. These are things that break training runs, make...

meta

It would be nice to optimize the end to end time. Here it is 1 hour and 25 minutes https://share.firefox.dev/3I192Z3 The training steps are the ones that take the longest...

cost & perf

When training languages it can be hard to know if alignments are generated correctly and behaving as expected. They can be visualized with a tool like [word-alignment-visualization](https://pypi.org/project/word-alignment-visualization/). During the alignment...

enhancement

For harder to segment languages we have Chinese, Japanese, and Korean. We'll need to implement better tokenization support and segmentation support for these languages in order to train them. This...

meta
language-coverage

Right now it splits on word boundaries, and limits the size of the monolingual data to be less than 100 "words". This needs to be changed to support another segmentation...

language-coverage

I haven't fully audited the code, but I suspect that the monolingual data is not being deduplicated from the parallel data. For instance, in the `ca-en` model, OpenSubtitles was used...

quality

In this training run I didn't apply any cleaning configuration, which broke the task building process: https://firefox-ci-tc.services.mozilla.com/tasks/bb_He8xySyee_GnvB4Ucog Here is the config that I was using: https://github.com/mozilla/firefox-translations-training/blob/train-en-ca/configs/tc.prod.yml

enhancement
taskcluster