firefox-translations-training icon indicating copy to clipboard operation
firefox-translations-training copied to clipboard

Training pipelines for Firefox Translations neural machine translation models

Results 311 firefox-translations-training issues
Sort by recently updated
recently updated
newest added

Publication from a Taskcluster group using the `--overide-runs` agrument manages to delete the existing runs of a group, but fails creating new runs: ``` wandb: ERROR Error while calling W&B...

weights and biases

Any filtering should happen only in the cleaning stage (eventually in OpusCleaner). The max_words filtering on importing was originally a copy-paste from some random Bergamot bash script and was not...

Nikolay: Length filtering. As Chinese sentences come normally as one continuous string of characters, traditional length filtering doesn't work. Furthermore, as one word can be made of 1-4 Chinese characters,...

language-coverage

An experiment for #231 da-en is one of our best models from the spring-2024 run. The teacher ensemble had a COMET score of 0.9013. The student COMET was 0.8950, with...

experiment

We can check whether the language pairs that used multilingual Bicleaner models (for which hardrules were disabled) work better than the ones trained with the regular models and enabled hard...

quality

Make sure we map language codes correctly. Maybe there are some other things to adjust.

language-coverage

Nikolay: Chinese alphabet should be added. In general we can use a unicode ranges to do so, but they are somewhat complicated: https://stackoverflow.com/questions/43418812/check-whether-a-string-contains-japanese-chinese-characters In the past i have used something...

language-coverage

A meta-issue to track retraining of the older models like Italian, Portuguese, French, German, Spanish etc. We'll need to retrain them to incorporate the latest robustness fixes. Also open-source datasets...

meta
quality

In Firefox the memory size of the inference engine is quite large in wasm. There aren't good memory tools to analyze the wasm. Instead, we should compile it natively, and...

inference

After a quick investigation, I see that the original parallel corpus was filtered from 70M to 35M sentences. Serbian is digraphic and uses both Latin and Cyrillic scripts. I see...

bug