cryodrgn icon indicating copy to clipboard operation
cryodrgn copied to clipboard

Select particle subset during preprocess?

Open olibclarke opened this issue 1 year ago • 7 comments

Hi,

i have a very large dataset of a membrane protein (1200k particles), refined in cryoSPARC. I would like to test cryodrgn on this dataset, but ideally first on a subset, before I try on the entire dataset.

Currently, it seems like I need to preprocess or downsample the entire dataset, before I select a subset.

Would it be possible to add an option to preprocess or downsample to just preprocess a random 100k particle selection, for example? The reason is that the preprocessing for such a large dataset seems to take a very long time (many hours), so it would be convenient to be able to test out different things using a smaller subset.

Cheers Oli

olibclarke avatar Mar 25 '23 17:03 olibclarke

That's a good suggestion. It should be very straightforward to add an --ind flag to cryodrgn downsample or cryodrgn preprocess.

zhonge avatar Mar 25 '23 17:03 zhonge

I added a flag --ind to cryodrgn downsample. It looks like there is already an --ind flag to cryodrgn preprocess.

You can generate a random 100k selection with the command:

(cryodrgn) $ cryodrgn_utils select_random -h
usage: cryodrgn_utils select_random [-h] -o O [-n N] [-s S] [--frac FRAC] [--seed SEED] N

Select a random subset of particles

positional arguments:
  N            Total number of particles

options:
  -h, --help   show this help message and exit
  -o O         Output selection (.pkl)
  -n N         Number of particles to select
  -s S         Optionally save out inverted selection (.pkl)
  --frac FRAC  Optionally specify fraction of particles to select
  --seed SEED  Random seed (default: 0)

For example (assuming there are exactly 1.2M particles in your dataset):

(cryodrgn) $ cryodrgn_utils select_random 1200000 -n 100000 -o ind100k.pkl

zhonge avatar Mar 25 '23 23:03 zhonge

Thanks Ellen - sorry I overlooked that there was already one for preprocess, you're right it does exist!

olibclarke avatar Mar 25 '23 23:03 olibclarke

Hi Ellen, I tried this but when I proceed to train_vae using the selected subset, I run into an error:

(cryodrgn) user@ubuntu:~/processing/cryosparc_projects/francesca/P40/J1649$ cryodrgn train_vae cryodrgn_data/preprocessed/80/cryodrgn_particles.80.0.ft.mrcs --ctf cryodrgn_ctf.pkl --poses cryodrgn_poses.pkl --zdim 8 -n 50 -o test/00_vae_80 --preprocessed --ind ind100k.pkl > test.log
Traceback (most recent call last):
  File "/home/user/software/miniconda3/envs/cryodrgn/bin/cryodrgn", line 8, in <module>
    sys.exit(main())
  File "/home/user/software/miniconda3/envs/cryodrgn/lib/python3.9/site-packages/cryodrgn/__main__.py", line 72, in main
    args.func(args)
  File "/home/user/software/miniconda3/envs/cryodrgn/lib/python3.9/site-packages/cryodrgn/commands/train_vae.py", line 621, in main
    data = dataset.PreprocessedMRCData(
  File "/home/user/software/miniconda3/envs/cryodrgn/lib/python3.9/site-packages/cryodrgn/dataset.py", line 246, in __init__
    particles = particles[ind]
IndexError: index 100006 is out of bounds for axis 0 with size 100000

Am I doing something wrong? The poses and ctf params were generated from the full dataset, but that shouldn't be a problem, right?

ind100k.pkl was generated like so:

cryodrgn_utils select_random 1180486 -n 100000 -o ind100k.pkl

And this was the preprocess command:

cryodrgn preprocess J1649_007_particles.cs -D 80 -o cryodrgn_data/preprocessed/80/cryodrgn_particles.80.mrcs --datadir ../ --ind ind100k.pkl

olibclarke avatar Mar 26 '23 00:03 olibclarke

Ah, I forgot to mention. Since the particle stack is now a subset, you'll have to filter the ctf.pkl and pose.pkl by the selection as well before giving it to cryodrgn train_vae. You can use cryodrgn_utils filter_pkl, e.g.:

(cryodrgn) $ cryodrgn_utils filter_pkl ctf.pkl --ind ind100k.pkl -o ctf.100k.pkl
(cryodrgn) $ cryodrgn_utils filter_pkl pose.pkl --ind ind100k.pkl -o pose.100k.pkl

The --ind selection given to train_vae is applied to each of the input particles / poses / ctf data (without any assumptions of whether it's already been filtered or not), so all the inputs have to match. In your case, it's trying to filter the already filtered cryodrgn_particles.80.mrcs, which is why it's running into an out-of-bounds problem.

zhonge avatar Mar 26 '23 02:03 zhonge

Ok, I tried this, but I still get the same error using the filtered ctf and pose files...

cryodrgn train_vae cryodrgn_data/preprocessed/80/cryodrgn_particles.80.0.ft.mrcs --ctf cryodrgn_ctf_100.pkl --poses cryodrgn_poses_100.pkl --zdim 8 -n 50 -o test/00_vae_80 --preprocessed --ind ind100k.pkl > test.log
Traceback (most recent call last):
  File "/home/user/software/miniconda3/envs/cryodrgn/bin/cryodrgn", line 8, in <module>
    sys.exit(main())
  File "/home/user/software/miniconda3/envs/cryodrgn/lib/python3.9/site-packages/cryodrgn/__main__.py", line 72, in main
    args.func(args)
  File "/home/user/software/miniconda3/envs/cryodrgn/lib/python3.9/site-packages/cryodrgn/commands/train_vae.py", line 621, in main
    data = dataset.PreprocessedMRCData(
  File "/home/user/software/miniconda3/envs/cryodrgn/lib/python3.9/site-packages/cryodrgn/dataset.py", line 246, in __init__
    particles = particles[ind]
IndexError: index 100006 is out of bounds for axis 0 with size 100000

EDIT: never mind, I'm an idiot, of course we don't need the --ind flag for train_vae anymore, because we have pre-filtered our particle set. I do get a warning stating that apex.amp is deprecated:

/home/user/software/miniconda3/envs/cryodrgn/lib/python3.9/site-packages/apex/__init__.py:68: DeprecatedFeatureWarning: apex.amp is deprecated and will be removed by the end of February 2023. Use [PyTorch AMP](https://pytorch.org/docs/stable/amp.html)
  warnings.warn(msg, DeprecatedFeatureWarning)

But I guess this can safely be ignored.

olibclarke avatar Mar 26 '23 13:03 olibclarke

Great, glad to hear that it worked!

Thanks for reporting the AMP warning... good to know about the deprecation date. The latest version of cryoDRGN uses the recommended Pytorch.AMP library, but we still support / default to apex.amp if it is installed.

Btw, I just tagged the latest version (with the --ind argument to downsample) as the official v2.2.0 release (things have been stable for a while), so you'll have a version number associated with the version of the code you're using.

zhonge avatar Mar 26 '23 13:03 zhonge