Greg Tatum issues

Results 204 issues of


                                            Greg Tatum

Add COMET to the evaluation steps

We already use COMET in models repo, but don't have it in the training steps. This makes it hard to cross-compare with our release criteria while training models.

evals

[Experiment] Train en-ca - Feb 2024

[skip ci] Continued from #284.

[Experiment] Train en cs - Mar 2024

This is a training branch of en-cs. It is based off of #384. The purpose of it is to see the results of the combination of NLLB + OpenSubtitles.

[meta] Make the pipeline reliable enough to train many languages

This is the lists of tasks that we need to handle in order to ramp up our ability to train many languages. These are things that break training runs, make...

Investigate optimizing the CI training run

It would be nice to optimize the end to end time. Here it is 1 hour and 25 minutes https://share.firefox.dev/3I192Z3 The training steps are the ones that take the longest...

cost & perf

Visualize a sample of the alignments in the task alignments task

When training languages it can be hard to know if alignments are generated correctly and behaving as expected. They can be visualized with a tool like [word-alignment-visualization](https://pypi.org/project/word-alignment-visualization/). During the alignment...

enhancement

[meta] Train harder to segment languages, like CJK languages

For harder to segment languages we have Chinese, Japanese, and Korean. We'll need to implement better tokenization support and segmentation support for these languages in order to train them. This...

Monolingual data has a word splitter that won't work for CJK

Right now it splits on word boundaries, and limits the size of the monolingual data to be less than 100 "words". This needs to be changed to support another segmentation...

language-coverage

Ensure monolingual corpus is de-duplicated from the parallel corpus

I haven't fully audited the code, but I suspect that the monolingual data is not being deduplicated from the parallel data. For instance, in the `ca-en` model, OpenSubtitles was used...

quality

`merge-corpus` and `merge-mono` do not work no cleaning is done

In this training run I didn't apply any cleaning configuration, which broke the task building process: https://firefox-ci-tc.services.mozilla.com/tasks/bb_He8xySyee_GnvB4Ucog Here is the config that I was using: https://github.com/mozilla/firefox-translations-training/blob/train-en-ca/configs/tc.prod.yml

enhancement

taskcluster