Evgeny Pavlov comments

Results 295 comments of


                                            Evgeny Pavlov

automatically upload important artifacts to a GCP bucket

It's basically models, logs and experiments from the [snakemake directory structure](https://mozilla.github.io/firefox-translations-training/snakemake.html#directory-structure): ``` gsutil ls gs://releng-translations-dev gs://releng-translations-dev/data/ gs://releng-translations-dev/experiments/ gs://releng-translations-dev/logs/ gs://releng-translations-dev/models/ ``` We use `/data` to store custom datasets, unlike for snakemake,...

automatically upload important artifacts to a GCP bucket

Yes, it looks correct overall, some additions: - let's not forget about `quantize` and `evaluate quantized`, they also produce the model and evaluation results - instead of `retrain` let's use...

automatically upload important artifacts to a GCP bucket

The old config is yaml, so let's upload the yaml from Taskcluster instead of json. Old en-ru experiment: `gs://releng-translations-dev/experiments/en-ru/ny-retraining/config.yml` Both train.log and live.log are useful. Vocab is also needed even...

automatically upload important artifacts to a GCP bucket

> > Vocab is also needed even though we now also store it as model artifacts. > > And it belongs in directories like `models/en-ru/retrain1_/vocab` ? Correct > > ```...

automatically upload important artifacts to a GCP bucket

> > > I did include evaluate-quantized - I assumed that was what ended up in the speed directory - is that wrong? > > > > > > Yes,...

Bump bicleaner-ai dependency

We should update to 3.0, so closing in favour of #528. Also, we're already using the models from HF.

In the cleaning task, output statistics about filtered out/kept sentences and maybe attach list of filtered out sentences as an artifact

Very good idea! I've been thinking about it too.

In the cleaning task, output statistics about filtered out/kept sentences and maybe attach list of filtered out sentences as an artifact

We run the pipeline by dataset so the filtered sentences will likely be an artifact of the clean step. What would be useful is seeing what was filtered by each...

Add Hugging Face data importer

Agreed. Just a note that we already download bicleaner-ai models from hugging face using `bicleaner-ai-download` tool that uses their lib to pull the data.

A job does not exit on OpusCleaner error

It's also interesting that the same job has completed successfully on restart. https://firefox-ci-tc.services.mozilla.com/tasks/d95IBhOiS0OYp5LR69Sk6w/runs/0/logs/public/logs/live.log It was likely a temporary issue with FastText model downloading. If OpusCleaner properly failed on error Taskcluster...