firefox-translations-training icon indicating copy to clipboard operation
firefox-translations-training copied to clipboard

Train both directions at once

Open XapaJIaMnu opened this issue 2 years ago • 7 comments

Currently, it's difficult to reuse data between two translation directions as majority of the files are placed in different directories https://github.com/mozilla/firefox-translations-training/blob/3b3f33bf2581238d325f05015123fc0a026c394e/configs/config.prod.yml#L18 eg: exp-name/src-trg, meaning that all datasets will be redownloaded.

Furthermore data cleaning is done via concating src and target files and is asymetrical at places: https://github.com/mozilla/firefox-translations-training/pull/41#discussion_r775474797

In practise preprocessing can be symmetrical and once a good student model is trained in one direction, it may even be used for producing backtranslations in the other automatically (prior to quantising). By training src-trg and trg-src at the same time, we can avoid data duplication, lengthy and space consuming preprocessing, training vocabulary and training one separate model

XapaJIaMnu avatar Dec 28 '21 21:12 XapaJIaMnu

It is supposed to be partly fixed by:

  1. Snakemake caching - it is implemented but I couldn't make it work, snakemake somehow doesn't recognize symlinks to the cached files. I disabled it because of that. See cache: False directive for data downloading and monolingual data cleaning in the pipeline. You can try to enable it and check if it works. I assumed that parallel data cleaning is asymmetrical.
  2. You can specify a path to backward model in the config - for example to a student for the opposite direction
experiment:
  ...
  backward-model: "..."

If we make caching work it will cover most of the cases. It will allow reusing of downloaded and cleaned mono data for English between all language pairs, not only the ones of opposite directions. Parallel cleaning is relatively fast compared to training and decoding, so I think it's fine to do it in both directions. If we still want to reuse data, we can either copy it manually or try to normalize lang pair path for this step (sort languages), so that it always point to the same directory, which will require rethinking of the directory structure.

eu9ene avatar Jan 04 '22 01:01 eu9ene

I think that we shouldn't rely on snakemake caching to get it to work but it should be part of the pipeline with something like "train-reverse-model: true" appended at the end. I do recognised that it's a lot of work and this is mostly just an enhancement.

Copying the files over doesn't work because they get concatenated as src-trg which means that one needs to go and manually rename everything. Not something you want to do in general.

Cheers,

Nick

XapaJIaMnu avatar Jan 04 '22 12:01 XapaJIaMnu

Actually, the naming convention everywhere is <corpus_name>.<lang>.gz, so you can copy directories original, clean and biclean between language pairs assuming that cleaning is symmetrical and you use the same monolingual datasets for back and forward translation.

Where are they get concatenated as src-trg?

eu9ene avatar Jan 05 '22 20:01 eu9ene

data/data/bg-en/snakemake-bg-en/original/eval$ ls
custom-corpus_  devset.bg.gz  devset.en.gz  merge.bgen.gz

The merge is direction dependent.

XapaJIaMnu avatar Jan 05 '22 20:01 XapaJIaMnu

merge.bgen.gz this is an intermediate file for deduplication that doesn't affect pipeline execution, it should probably be deleted after the job is completed. The final results are devset.bg.gz devset.en.gz, so it's safe to copy.

eu9ene avatar Jan 05 '22 21:01 eu9ene

I see, so i could in theory do a blanket copy of all the clean biclean etc directories and the only thing that would be rebuilt is the vocabulary (since it's named vocab.$SRC$TRG.spm)?

XapaJIaMnu avatar Jan 06 '22 07:01 XapaJIaMnu

Vocabulary is stored in a directory like models/en-ru/test/vocab and named vocab.spm, so you can copy this directory too. You shouldn't copy any other directories except those I mentioned, the results are not usable for training in the opposite direction.

eu9ene avatar Jan 06 '22 23:01 eu9ene

Taskcluster caching is pretty robust these days for this type of issue.

gregtatum avatar Apr 09 '24 21:04 gregtatum