Evgeny Pavlov
Evgeny Pavlov
Let's push this out of schedule too. This is a nice-to-have new feature that still requires redesigning the model repo and the extension to be able to reload the model...
I have the same problem on GPU Quadro RTX 6000 (24 Gb) using 21000 workspace size with the mentioned above student model. Teacher model training works fine. `Error: Labels not...
The models produced by the Export step are ready to distribute as a part of https://github.com/mozilla/firefox-translations-models. There is a snakemake report that includes a part of this information. A report...
This must be a result of the recent mtdata update https://github.com/mozilla/firefox-translations-training/pull/60
It is supposed to be partly fixed by: 1. Snakemake caching - it is implemented but I couldn't make it work, snakemake somehow doesn't recognize symlinks to the cached files....
Actually, the naming convention everywhere is `..gz`, so you can copy directories `original`, `clean` and `biclean` between language pairs assuming that cleaning is symmetrical and you use the same monolingual...
`merge.bgen.gz` this is an intermediate file for deduplication that doesn't affect pipeline execution, it should probably be deleted after the job is completed. The final results are `devset.bg.gz devset.en.gz`, so...
Vocabulary is stored in a directory like `models/en-ru/test/vocab` and named `vocab.spm`, so you can copy this directory too. You shouldn't copy any other directories except those I mentioned, the results...
Ok, but we do remove long sentences on cleaning steps [ ` if src_len > MAX_LENGTH or trg_len > MAX_LENGTH:`](https://github.com/mozilla/firefox-translations-training/blob/355d9b958eb6216cd316f97c9a20368c75a6ce3b/pipeline/clean/tools/clean_parallel.py#L100).
I'm trying to understand the logic of fine-tuning on held-out datasets. I imagined that domain adaptation is when we train a model on generic datasets and then finetune on in-domain...