Evgeny Pavlov comments

Results 112 comments of


                                            Evgeny Pavlov

Automatic updates of model registry

Let's push this out of schedule too. This is a nice-to-have new feature that still requires redesigning the model repo and the extension to be able to reload the model...

problem with workspace > 26000

I have the same problem on GPU Quadro RTX 6000 (24 Gb) using 21000 workspace size with the mentioned above student model. Teacher model training works fine. `Error: Labels not...

Generate a report after training is finished, and ideally ready to distribute models

The models produced by the Export step are ready to distribute as a part of https://github.com/mozilla/firefox-translations-models. There is a snakemake report that includes a part of this information. A report...

tedx download via sacrebleu fails due to directionality of the data set

This must be a result of the recent mtdata update https://github.com/mozilla/firefox-translations-training/pull/60

Train both directions at once

It is supposed to be partly fixed by: 1. Snakemake caching - it is implemented but I couldn't make it work, snakemake somehow doesn't recognize symlinks to the cached files....

Train both directions at once

Actually, the naming convention everywhere is `..gz`, so you can copy directories `original`, `clean` and `biclean` between language pairs assuming that cleaning is symmetrical and you use the same monolingual...

Train both directions at once

`merge.bgen.gz` this is an intermediate file for deduplication that doesn't affect pipeline execution, it should probably be deleted after the job is completed. The final results are `devset.bg.gz devset.en.gz`, so...

Train both directions at once

Vocabulary is stored in a directory like `models/en-ru/test/vocab` and named `vocab.spm`, so you can copy this directory too. You shouldn't copy any other directories except those I mentioned, the results...

max-length and max-length-crop considered harmful

Ok, but we do remove long sentences on cleaning steps [ ` if src_len > MAX_LENGTH or trg_len > MAX_LENGTH:`](https://github.com/mozilla/firefox-translations-training/blob/355d9b958eb6216cd316f97c9a20368c75a6ce3b/pipeline/clean/tools/clean_parallel.py#L100).

Fine-tune teachers to parallel corpora

I'm trying to understand the logic of fine-tuning on held-out datasets. I imagined that domain adaptation is when we train a model on generic datasets and then finetune on in-domain...