Greg Tatum comments

Results 371 comments of


                                            Greg Tatum

Made-up words in translations

I think this is fixed by removing the lexical shortlists.

Investigate word-based filtering for CJK

This is another case where the ICU segmenter could be useful, see #860

Consolidate all of the training scripts into a main pipeline/train/train.py script

I haven't commented on this one as I think it's worth looking at both approaches and come to a consensus on the design. In #842 I'm rewriting the train.sh to...

Reduce monolingual data for da-en to investigate distillation performance

I wrote a hacky truncation script to try this out and reuse the cached artifacts. https://github.com/mozilla/firefox-translations-training/tree/da-en-experiment

Reduce monolingual data for da-en to investigate distillation performance

[W&B Reports](https://wandb.ai/moz-translations/en-lt/reports/en-lt-hplt-experiment--Vmlldzo5MTkwMzMx/edit) [Training Dashboard](https://gregtatum.github.io/taskcluster-tools/src/training/?taskGroupIds=aXYThcEdSxi0JyAnwvJvYw%2CZJB0rHNATqiwZKAemQfiwQ%2CTffJZrTqQQyx3qs-edkcpA%2CSQHCSSwgQt6kjHkDK07__g%2CaF0rWZYsQdSYrO1O-MZhOQ%2CHZErNXKYRTmn2--UlVnC-Q&taskGroupNames=%5B%2225%25+%28cancelled%29%22%2C%2250%25+%28cancelled%29%22%2C%2275%25+%28cancelled%29%22%2C%22truncate.sh+fix%22%5D&showAll=true&taskGroupNames2=%7B%22SognJy1pQKG__xUQjAMqGg%22%3A%22canceled%22%2C%22ChvYUt2-QE25v4FlCx_0EQ%22%3A%22canceled%22%2C%22ZOWT1aQsQrqvHQ24x3anGA%22%3A%22truncate.sh+error%22%2C%22K9WqQqFsQUi2-6rdS0tfSQ%22%3A%2225%25%22%2C%22BCppIZyLQjS1AAiEXBLwVQ%22%3A%2250%25%22%2C%22Q49vTgTOStWGio7axf-Gfg%22%3A%2275%25%22%7D) Status: All 3 students are training

Reduce monolingual data for da-en to investigate distillation performance

So I screwed up the truncation script, and everything trained as the same. I'm re-running things. [New dashboard link](https://gregtatum.github.io/taskcluster-tools/src/training/?taskGroupIds=IhBuygMfR3yahUq0smq4fg) 25% [train-student](https://firefox-ci-tc.services.mozilla.com/tasks/MoJCveAoQyywBru298PuKg) 50% [train action](https://firefox-ci-tc.services.mozilla.com/tasks/WjdXyNiWTrCCCGBf_V_DMA) 75% [train action](https://firefox-ci-tc.services.mozilla.com/tasks/ZMSf5m9bSTCAiBoBIz89kA)

Reduce monolingual data for da-en to investigate distillation performance

I'm starting another attempt on the latest main, and using `previous_group_ids`. ``` config: configs/experiments-H2-2024/da-en.yml name: mono_75_percent langpair: da-en time: 2024-10-18 16:08:24.167164 train action: https://firefox-ci-tc.services.mozilla.com/tasks/W_feK0IfSNiJ7PyY0cg5rg branch: dev-da-en-mono-reduction hash: 52f8874c config: configs/experiments-H2-2024/da-en.yml...

Reduce monolingual data for da-en to investigate distillation performance

[train-student 75%](https://firefox-ci-tc.services.mozilla.com/tasks/Mjiu8_JzTOWdM3BzhZXVcQ) [train-student 50%](https://firefox-ci-tc.services.mozilla.com/tasks/d2E-auy2RMaRbh0BzO-wbA) [train-student 25%](https://firefox-ci-tc.services.mozilla.com/tasks/U86sUclHTEWk346706Q1bg)

Reduce monolingual data for da-en to investigate distillation performance

I'm running another experiment with 1 million, and 10,000 as a confidence check on my experiment. The 10,000 looks like it's correctly training badly, so it looks like my truncation...

Reduce monolingual data for da-en to investigate distillation performance

Here is a [W&B view for the runs](https://wandb.ai/moz-translations/da-en/workspace?nw=g5g2daxqtz8)