firefox-translations-training Out of memory on shuffling huge datasets

300M dataset, 128 GB RAM

the workaround is to shuffle dataset after the merge step, disable --shuffle-in-ram and use --shuffle batches

Aug 26 '21 00:08 eu9ene

This might be a bug of Marian. Memory shouldn't grow after --shuffle-in-ram is removed and we should use --shuffle data mode. It was discussed in https://github.com/mozilla/firefox-translations-training/pull/70#discussion_r800975032

Feb 10 '22 00:02 eu9ene

Training teachers with --shuffle batches leads to such training curves. Maybe other factors are at play here. Screen Shot 2022-06-10 at 12 40 14 PM

Jun 10 '22 19:06 eu9ene

Related Marian issue: https://github.com/marian-nmt/marian-dev/issues/148

Jun 10 '22 23:06 eu9ene

--sqlite should help but I've found it slow in practise.

Jun 11 '22 08:06 XapaJIaMnu

I suspect the running out of memory, even when --shuffle-in-ram is not used, comes from here:

https://github.com/marian-nmt/marian-dev/blob/042ed8f2e23557d0cdb956aea7d79be8c817e0b0/src/data/corpus.cpp#L227-L241

Assuming that's actually the cause, we could replace it with a two-pass shuffle:

Read the unshuffled dataset, and write each line to one of N temp files, chosen randomly for each line. How large N needs to be can probably be determined by looking at how large the input file is, and how much memory is available for shuffling. Might be trickier to estimate if the input file is gzipped.
Shuffle each of the temp files as is done now:
read it into memory, do std::shuffle.
Concatenate temp files into the shuffled temp files.
Or implement some reader class that takes ownership and reads from the bunch of temp files consecutively as if it were one.

Edit: or do it like this

Edit: for why --shuffle batches performs worse: in the training loop the corpus is shuffled repeatedly (the batchGenerator->prepare() call). I don't know how often this happens in practice, but I can imagine that without that shuffle the order isn't random enough.

Jun 11 '22 17:06 jelmervdl

I didn't see this for some time and I assume it's fixed by using OpusTrainer.

May 08 '24 23:05 eu9ene

firefox-translations-training firefox-translations-training copied to clipboard

Out of memory on shuffling huge datasets

firefox-translations-training
firefox-translations-training copied to clipboard