firefox-translations-training issues

Snakemake pipeline is not in the right order

2

I have no idea how to fix this, any help or at least guidance is appreciated. And here is my current log for a new job. It seems to be...

AmitMY

snakemake

Reduce the number of vCPUs for bicleaner machines

Now when it properly runs on GPUs CPU utilizaiton is ~10%. We have 40 vCPUs now. We can experiment with it and maybe reduce to 8 - 16.

eu9ene

cost & perf

Evaluate translation capabilities of LLMs

6

If practical, the LLMs might be useful for a variety of tasks: - Quality evaluation - Data augmentation (including back translation for low-resource languages) - Using as a teacher model...

eu9ene

Bump urllib3 from 1.26.15 to 1.26.18 in /pipeline/bicleaner/requirements

Bumps [urllib3](https://github.com/urllib3/urllib3) from 1.26.15 to 1.26.18. Release notes Sourced from urllib3's releases. 1.26.18 Made body stripped from HTTP requests changing the request method to GET after HTTP 303 "See Other"...

dependabot[bot]

dependencies

Bump werkzeug from 2.2.3 to 2.3.8 in /pipeline/bicleaner/requirements

Bumps [werkzeug](https://github.com/pallets/werkzeug) from 2.2.3 to 2.3.8. Release notes Sourced from werkzeug's releases. 2.3.8 This is a security release for the 2.3.x feature branch. Changes: https://werkzeug.palletsprojects.com/en/2.3.x/changes/#version-2-3-8 2.3.7 This is a fix...

dependabot[bot]

dependencies

Consider switching to Docker for all tasks

2

It would be easier to maintain the docker images for all tasks in this repo compared to updating the generic worker image elsewhere every time we need to add something....

eu9ene

enhancement

taskcluster

Ensure monolingual corpus is de-duplicated from the parallel corpus

I haven't fully audited the code, but I suspect that the monolingual data is not being deduplicated from the parallel data. For instance, in the `ca-en` model, OpenSubtitles was used...

gregtatum

quality

Add community contribution guidelines

1

People keep asking how to help add another language. 1. The first good step would be helping to research datasets. To estimate feasibility of training we need statistics on how...

eu9ene

documentation

community

Switch to fasttext large model in OpusCleaner

I replaced it with the small one after this bug https://github.com/hplt-project/OpusCleaner/issues/122. We should revert it and see whether it's fixed.

eu9ene

quality

Bump bicleaner-ai dependency

1

We are still using 2.0 (https://github.com/mozilla/firefox-translations-training/blob/71013bcea0e4647d04d508daf45fe2a96c27ef0d/pipeline/bicleaner/requirements/bicleaner-ai.in), but the latest version is 2.3.2. 2.2.0 adds support for tokenizing by characters (for Chinese).

marco-c

language-coverage

firefox-translations-training
firefox-translations-training copied to clipboard

Metadata

Snakemake pipeline is not in the right order

Reduce the number of vCPUs for bicleaner machines

Evaluate translation capabilities of LLMs

Bump urllib3 from 1.26.15 to 1.26.18 in /pipeline/bicleaner/requirements

Bump werkzeug from 2.2.3 to 2.3.8 in /pipeline/bicleaner/requirements

Consider switching to Docker for all tasks

Ensure monolingual corpus is de-duplicated from the parallel corpus

Add community contribution guidelines

Switch to fasttext large model in OpusCleaner

Bump bicleaner-ai dependency

← Metadata

Owner

Metadata

firefox-translations-training firefox-translations-training copied to clipboard

Metadata

← Metadata

Owner

Metadata

firefox-translations-training
firefox-translations-training copied to clipboard