firefox-translations-training icon indicating copy to clipboard operation
firefox-translations-training copied to clipboard

Limit the amount of data used for distillation

Open gregtatum opened this issue 3 months ago • 1 comments

In #771 I ran an experiment to see the effects of the size of the distillation corpus for the change in the COMET score for the students. Adding more data to this step did not affect the COMET score beyond the standard deviation (±0.12 COMET) of training student models.

Synthesizing the training pairs from the monolingual data is one of the more expensive parts of the pipeline, so we should limit the amount of data we throw at it.

For this work we need to:

  1. Determine the threshold that we cut off.
  2. Determine how we mix the source part of the parallel corpus, and the source monolingual data.

1. Threshold cut-off

In our 1:1 @eu9ene proposed 50 million, which feels like a reasonable initial threshold to me. He mentioned that we shouldn't 100% rely on the evaluation metrics since more data diversity could create a better general translation model for translating the web. There is a risk that our evaluation data is not diverse enough to capture this, so we should be conservative in how much we cut off.

I think we can probably go even lower if we wanted, as the results were the same for 30M in da-en. I have an experiment still running with 1M and 10,000 to further test the limits here.

We should verify that these results still hold for a Balto-Slavic language, like en-lt.

2. How to mix

I'm not sure how we want to mix our data or if @eu9ene has thoughts here. We could collect all of our source parallel data and all of the monolingual available, and then mix and truncate it. This is what I was doing in my experiment.

It's likely that we'll have more parallel source data than the 50 million cut-off for many languages.

gregtatum avatar Oct 29 '24 15:10 gregtatum