firefox-translations-training
firefox-translations-training copied to clipboard
Training pipelines for Firefox Translations neural machine translation models
I have no idea how to fix this, any help or at least guidance is appreciated. And here is my current log for a new job. It seems to be...
Now when it properly runs on GPUs CPU utilizaiton is ~10%. We have 40 vCPUs now. We can experiment with it and maybe reduce to 8 - 16.
If practical, the LLMs might be useful for a variety of tasks: - Quality evaluation - Data augmentation (including back translation for low-resource languages) - Using as a teacher model...
Bumps [urllib3](https://github.com/urllib3/urllib3) from 1.26.15 to 1.26.18. Release notes Sourced from urllib3's releases. 1.26.18 Made body stripped from HTTP requests changing the request method to GET after HTTP 303 "See Other"...
Bumps [werkzeug](https://github.com/pallets/werkzeug) from 2.2.3 to 2.3.8. Release notes Sourced from werkzeug's releases. 2.3.8 This is a security release for the 2.3.x feature branch. Changes: https://werkzeug.palletsprojects.com/en/2.3.x/changes/#version-2-3-8 2.3.7 This is a fix...
It would be easier to maintain the docker images for all tasks in this repo compared to updating the generic worker image elsewhere every time we need to add something....
I haven't fully audited the code, but I suspect that the monolingual data is not being deduplicated from the parallel data. For instance, in the `ca-en` model, OpenSubtitles was used...
People keep asking how to help add another language. 1. The first good step would be helping to research datasets. To estimate feasibility of training we need statistics on how...
I replaced it with the small one after this bug https://github.com/hplt-project/OpusCleaner/issues/122. We should revert it and see whether it's fixed.
We are still using 2.0 (https://github.com/mozilla/firefox-translations-training/blob/71013bcea0e4647d04d508daf45fe2a96c27ef0d/pipeline/bicleaner/requirements/bicleaner-ai.in), but the latest version is 2.3.2. 2.2.0 adds support for tokenizing by characters (for Chinese).