firefox-translations-training icon indicating copy to clipboard operation
firefox-translations-training copied to clipboard

Fine-tune teachers to parallel corpora

Open lisskor opened this issue 2 years ago • 3 comments

This introduces two new options:

  1. In addition to external test sets, extract held-out test sets from each parallel training corpus and evaluate models on those held-out sets as well. Controlled by held-out-dev-test, held-out-dev-size, and held-out-test-size in the datasets section of config. If using held-out sets but not teacher fine-tuning (see item 2), held-out-dev-size should be set to 0, as held-out dev sets will not be utilized. Rulegraph of the previously existing workflow Rulegraph with held-out test sets

  2. Add the step of fine-tuning teacher models to each of the parallel corpora. The parallel data is then forward-translated with the corresponding teacher. Controlled by fine-tune-to-corpus in the experiment section of config. If fine-tune-to-corpus: true, held-out dev and test sets must be used. Rulegraph with teachers fine-tuned to parallel corpora

If held-out-dev-test and fine-tune-to-corpus are set to false or not set in the config, the workflow should be exactly the same as it was without my changes.

I have not run a sufficiently large experiment with the new workflow yet, so can't say how significantly it influences multi-domain performance in real-life scenarios.

These changes can be kept in a separate branch if needed.

lisskor avatar Apr 12 '22 15:04 lisskor

I'm trying to understand the logic of fine-tuning on held-out datasets. I imagined that domain adaptation is when we train a model on generic datasets and then finetune on in-domain datasets (for example medical ones). Here we train on all provided datasets including back-translated, then finetune on parallel only, then split all parallel datasets to train/dev/test, and fine-tune on train part of each provided parallel dataset separately, then decode each dataset with the corresponding teacher and then merge it all together for student training. I hope I got it right.

So there is no option to train on generic parallel datasets and fine-tune on in-domain ones. Do I misunderstand how it's intended to be used? Is it just to improve the overall student quality by having decoding done by specialized teachers and has nothing to do with domain adaptation?

Another thing that comes to mind is that usually we train on a vast number of the available datasets and I suspect fine-tuning teachers for each of them might be very impractical (taking a looong time). So I would propose to introduce another section in the config for a list of datasets for fine-tuning. That can solve the generic use case of domain adaptation and be applicable in real-life scenarios. Please let me know if I got it all wrong =)

eu9ene avatar Apr 22 '22 00:04 eu9ene

please sync with main, it includes a simple CI now

eu9ene avatar Apr 22 '22 01:04 eu9ene

Yes, the goal here is to do decoding with specialized teachers and improve the resulting student performance on the corresponding datasets. We would like to still have a single student model in the end. You got it all correct (except that the held-out train/dev/test sets are actually created before any training so that the teacher models do not see the test data either).

As for introducing manually defined groups of corpora, that's exactly what I was planning to add as a next step!

lisskor avatar May 11 '22 15:05 lisskor

This work was a part of project Bergamot, we can revisit it later if we want to implement this approach as a part of the TaskCluster pipeline.

eu9ene avatar Sep 20 '23 17:09 eu9ene