firefox-translations-training issues

Tracking does not supports override a run: wandb [409] run was previously created and deleted

3

Publication from a Taskcluster group using the `--overide-runs` agrument manages to delete the existing runs of a group, but fails creating new runs: ``` wandb: ERROR Error while calling W&B...

vrigal

weights and biases

Remove max_words filtering from data importers

4

Any filtering should happen only in the cleaning stage (eventually in OpusCleaner). The max_words filtering on importing was originally a copy-paste from some random Bergamot bash script and was not...

eu9ene

Investigate word-based filtering for CJK

1

Nikolay: Length filtering. As Chinese sentences come normally as one continuous string of characters, traditional length filtering doesn't work. Furthermore, as one word can be made of 1-4 Chinese characters,...

eu9ene

language-coverage

Reduce monolingual data for da-en to investigate distillation performance

7

An experiment for #231 da-en is one of our best models from the spring-2024 run. The teacher ensemble had a COMET score of 0.9013. The student COMET was 0.8950, with...

gregtatum

experiment

Check if issues with short sentences were caused by bicleaner hard rules

We can check whether the language pairs that used multilingual Bicleaner models (for which hardrules were disabled) work better than the ones trained with the regular models and enabled hard...

eu9ene

quality

Support data import for CJK

Make sure we map language codes correctly. Maybe there are some other things to adjust.

eu9ene

language-coverage

Support CJK in OpusCleaner

3

Nikolay: Chinese alphabet should be added. In general we can use a unicode ranges to do so, but they are somewhat complicated: https://stackoverflow.com/questions/43418812/check-whether-a-string-contains-japanese-chinese-characters In the past i have used something...

eu9ene

language-coverage

[meta] Retrain older models

A meta-issue to track retraining of the older models like Italian, Portuguese, French, German, Spanish etc. We'll need to retrain them to incorporate the latest robustness fixes. Also open-source datasets...

eu9ene

Run dhat or similar memory tools on a native built version of the the browsermt marian-dev fork

In Firefox the memory size of the inference engine is quite large in wasm. There aren't good memory tools to analyze the wasm. Instead, we should compile it natively, and...

gregtatum

inference

English to Serbian has low quality of the teacher models

5

After a quick investigation, I see that the original parallel corpus was filtered from 70M to 35M sentences. Serbian is digraphic and uses both Latin and Cyrillic scripts. I see...

eu9ene

bug

firefox-translations-training
firefox-translations-training copied to clipboard

Metadata

Tracking does not supports override a run: wandb [409] run was previously created and deleted

Remove max_words filtering from data importers

Investigate word-based filtering for CJK

Reduce monolingual data for da-en to investigate distillation performance

Check if issues with short sentences were caused by bicleaner hard rules

Support data import for CJK

Support CJK in OpusCleaner

[meta] Retrain older models

Run dhat or similar memory tools on a native built version of the the browsermt marian-dev fork

English to Serbian has low quality of the teacher models

← Metadata

Owner

Metadata

firefox-translations-training firefox-translations-training copied to clipboard

Metadata

← Metadata

Owner

Metadata

firefox-translations-training
firefox-translations-training copied to clipboard