cryodrgn Chunked data loading for large datasets

The default behavior is to load the whole dataset in memory for training, however for particularly large datasets that don't fit in memory, an option is currently provided (--lazy) to load images from disk on the fly during training. This is a very bad filesystem access pattern, and the latency of disk access can be a severe bottleneck in some cases.

Probably what makes more sense is to preprocess the data into chunks (how big?) and train on each chunk sequentially. Slightly less randomness in the mini-batches, but assuming there's no order in the dataset, this likely doesn't matter. Could also add FFT + normalization in this preprocessing step too. The main downside I see is additional disk space usage for storing the chunks.

Aug 07 '20 15:08 zhonge

Hello @zhonge

I wonder if you have any updates regarding this issue. We are working on a very large dataset (~2.5M images) which can't fit in the ram and using a --lazy flag makes the training too slow to be useful.

Thanks in advance, Alex

Jan 07 '21 22:01 aleksspasic

Thanks for the heads up. I can prioritize this feature.

Jan 12 '21 01:01 zhonge

Just as an additional data point -- for a 1.4M particle dataset (D=128) I'm trying out, the training time goes from 43min -> 5:50hr per epoch if I load the whole dataset into memory vs. using the --lazy flag.

Todo: Look into the access patterns for processing large datasets in RELION/cryoSPARC/etc.

Apr 21 '21 19:04 zhonge

@aleksspasic have you tried temporarily copying the particle stack to an SSD (assuming you have one) and running the cryoDRGN training reading particles from there? Optimizing disk access patterns will only get you so far if the data is on regular hard drives.

Apr 22 '21 08:04 Guillawme

I added a new script cryodrgn preprocess which preprocesses images before training and significantly reduces the memory requirement of cryodrgn train_vae. This is now available in the top of tree (commit d4b21957beb7d92952bbd2cdfe34ab9401113e6e). I'm going to beta test this a little further before officially releasing.

Some brief documentation here (linked to in the tutorial): https://www.notion.so/cryodrgn-preprocess-d84a9d9df8634a6a8bfd32d6b5e737ef

Jul 10 '21 19:07 zhonge

@vineetbansal, we should think about how to implement chunked data loading instead of the current options of either 1) loading the whole dataset into memory or 2) accessing each image on the fly.

One issue is that images are usually ordered (e.g. by defocus). One options is to shuffle the entire dataset, but this seems extremely suboptimal for many reasons... Another option could be to randomly sample a couple of smaller chunks, load all the chunks, then train on random minibatches within the combined chunk.

Nov 15 '22 15:11 zhonge

cryodrgn cryodrgn copied to clipboard

Chunked data loading for large datasets

cryodrgn
cryodrgn copied to clipboard