firefox-translations-training icon indicating copy to clipboard operation
firefox-translations-training copied to clipboard

Out of memory on shuffling huge datasets

Open eu9ene opened this issue 3 years ago • 5 comments

300M dataset, 128 GB RAM

the workaround is to shuffle dataset after the merge step, disable --shuffle-in-ram and use --shuffle batches

eu9ene avatar Aug 26 '21 00:08 eu9ene

This might be a bug of Marian. Memory shouldn't grow after --shuffle-in-ram is removed and we should use --shuffle data mode. It was discussed in https://github.com/mozilla/firefox-translations-training/pull/70#discussion_r800975032

eu9ene avatar Feb 10 '22 00:02 eu9ene

Training teachers with --shuffle batches leads to such training curves. Maybe other factors are at play here. Screen Shot 2022-06-10 at 12 40 14 PM

eu9ene avatar Jun 10 '22 19:06 eu9ene

Related Marian issue: https://github.com/marian-nmt/marian-dev/issues/148

eu9ene avatar Jun 10 '22 23:06 eu9ene

--sqlite should help but I've found it slow in practise.

XapaJIaMnu avatar Jun 11 '22 08:06 XapaJIaMnu

I suspect the running out of memory, even when --shuffle-in-ram is not used, comes from here:

https://github.com/marian-nmt/marian-dev/blob/042ed8f2e23557d0cdb956aea7d79be8c817e0b0/src/data/corpus.cpp#L227-L241

Assuming that's actually the cause, we could replace it with a two-pass shuffle:

  1. Read the unshuffled dataset, and write each line to one of N temp files, chosen randomly for each line. How large N needs to be can probably be determined by looking at how large the input file is, and how much memory is available for shuffling. Might be trickier to estimate if the input file is gzipped.
  2. Shuffle each of the temp files as is done now:
    read it into memory, do std::shuffle.
  3. Concatenate temp files into the shuffled temp files.
    Or implement some reader class that takes ownership and reads from the bunch of temp files consecutively as if it were one.

Edit: or do it like this

Edit: for why --shuffle batches performs worse: in the training loop the corpus is shuffled repeatedly (the batchGenerator->prepare() call). I don't know how often this happens in practice, but I can imagine that without that shuffle the order isn't random enough.

jelmervdl avatar Jun 11 '22 17:06 jelmervdl

I didn't see this for some time and I assume it's fixed by using OpusTrainer.

eu9ene avatar May 08 '24 23:05 eu9ene