firefox-translations-training
firefox-translations-training copied to clipboard
Out of memory on shuffling huge datasets
300M dataset, 128 GB RAM
the workaround is to shuffle dataset after the merge step, disable --shuffle-in-ram
and use --shuffle batches
This might be a bug of Marian. Memory shouldn't grow after --shuffle-in-ram
is removed and we should use --shuffle data
mode. It was discussed in https://github.com/mozilla/firefox-translations-training/pull/70#discussion_r800975032
Training teachers with --shuffle batches
leads to such training curves. Maybe other factors are at play here.
Related Marian issue: https://github.com/marian-nmt/marian-dev/issues/148
--sqlite
should help but I've found it slow in practise.
I suspect the running out of memory, even when --shuffle-in-ram is not used, comes from here:
https://github.com/marian-nmt/marian-dev/blob/042ed8f2e23557d0cdb956aea7d79be8c817e0b0/src/data/corpus.cpp#L227-L241
Assuming that's actually the cause, we could replace it with a two-pass shuffle:
- Read the unshuffled dataset, and write each line to one of N temp files, chosen randomly for each line. How large N needs to be can probably be determined by looking at how large the input file is, and how much memory is available for shuffling. Might be trickier to estimate if the input file is gzipped.
- Shuffle each of the temp files as is done now:
read it into memory, dostd::shuffle
. - Concatenate temp files into the shuffled temp files.
Or implement some reader class that takes ownership and reads from the bunch of temp files consecutively as if it were one.
Edit: or do it like this
Edit: for why --shuffle batches
performs worse: in the training loop the corpus is shuffled repeatedly (the batchGenerator->prepare()
call). I don't know how often this happens in practice, but I can imagine that without that shuffle the order isn't random enough.
I didn't see this for some time and I assume it's fixed by using OpusTrainer.