firefox-translations-training
firefox-translations-training copied to clipboard
Training pipelines for Firefox Translations neural machine translation models
Publication from a Taskcluster group using the `--overide-runs` agrument manages to delete the existing runs of a group, but fails creating new runs: ``` wandb: ERROR Error while calling W&B...
Any filtering should happen only in the cleaning stage (eventually in OpusCleaner). The max_words filtering on importing was originally a copy-paste from some random Bergamot bash script and was not...
Nikolay: Length filtering. As Chinese sentences come normally as one continuous string of characters, traditional length filtering doesn't work. Furthermore, as one word can be made of 1-4 Chinese characters,...
An experiment for #231 da-en is one of our best models from the spring-2024 run. The teacher ensemble had a COMET score of 0.9013. The student COMET was 0.8950, with...
We can check whether the language pairs that used multilingual Bicleaner models (for which hardrules were disabled) work better than the ones trained with the regular models and enabled hard...
Make sure we map language codes correctly. Maybe there are some other things to adjust.
Nikolay: Chinese alphabet should be added. In general we can use a unicode ranges to do so, but they are somewhat complicated: https://stackoverflow.com/questions/43418812/check-whether-a-string-contains-japanese-chinese-characters In the past i have used something...
A meta-issue to track retraining of the older models like Italian, Portuguese, French, German, Spanish etc. We'll need to retrain them to incorporate the latest robustness fixes. Also open-source datasets...
In Firefox the memory size of the inference engine is quite large in wasm. There aren't good memory tools to analyze the wasm. Instead, we should compile it natively, and...
After a quick investigation, I see that the original parallel corpus was filtered from 70M to 35M sentences. Serbian is digraphic and uses both Latin and Cyrillic scripts. I see...