cryodrgn
cryodrgn copied to clipboard
Select particle subset during preprocess?
Hi,
i have a very large dataset of a membrane protein (1200k particles), refined in cryoSPARC. I would like to test cryodrgn on this dataset, but ideally first on a subset, before I try on the entire dataset.
Currently, it seems like I need to preprocess or downsample the entire dataset, before I select a subset.
Would it be possible to add an option to preprocess or downsample to just preprocess a random 100k particle selection, for example? The reason is that the preprocessing for such a large dataset seems to take a very long time (many hours), so it would be convenient to be able to test out different things using a smaller subset.
Cheers Oli
That's a good suggestion. It should be very straightforward to add an --ind
flag to cryodrgn downsample
or cryodrgn preprocess
.
I added a flag --ind
to cryodrgn downsample
. It looks like there is already an --ind
flag to cryodrgn preprocess
.
You can generate a random 100k selection with the command:
(cryodrgn) $ cryodrgn_utils select_random -h
usage: cryodrgn_utils select_random [-h] -o O [-n N] [-s S] [--frac FRAC] [--seed SEED] N
Select a random subset of particles
positional arguments:
N Total number of particles
options:
-h, --help show this help message and exit
-o O Output selection (.pkl)
-n N Number of particles to select
-s S Optionally save out inverted selection (.pkl)
--frac FRAC Optionally specify fraction of particles to select
--seed SEED Random seed (default: 0)
For example (assuming there are exactly 1.2M particles in your dataset):
(cryodrgn) $ cryodrgn_utils select_random 1200000 -n 100000 -o ind100k.pkl
Thanks Ellen - sorry I overlooked that there was already one for preprocess, you're right it does exist!
Hi Ellen, I tried this but when I proceed to train_vae
using the selected subset, I run into an error:
(cryodrgn) user@ubuntu:~/processing/cryosparc_projects/francesca/P40/J1649$ cryodrgn train_vae cryodrgn_data/preprocessed/80/cryodrgn_particles.80.0.ft.mrcs --ctf cryodrgn_ctf.pkl --poses cryodrgn_poses.pkl --zdim 8 -n 50 -o test/00_vae_80 --preprocessed --ind ind100k.pkl > test.log
Traceback (most recent call last):
File "/home/user/software/miniconda3/envs/cryodrgn/bin/cryodrgn", line 8, in <module>
sys.exit(main())
File "/home/user/software/miniconda3/envs/cryodrgn/lib/python3.9/site-packages/cryodrgn/__main__.py", line 72, in main
args.func(args)
File "/home/user/software/miniconda3/envs/cryodrgn/lib/python3.9/site-packages/cryodrgn/commands/train_vae.py", line 621, in main
data = dataset.PreprocessedMRCData(
File "/home/user/software/miniconda3/envs/cryodrgn/lib/python3.9/site-packages/cryodrgn/dataset.py", line 246, in __init__
particles = particles[ind]
IndexError: index 100006 is out of bounds for axis 0 with size 100000
Am I doing something wrong? The poses and ctf params were generated from the full dataset, but that shouldn't be a problem, right?
ind100k.pkl
was generated like so:
cryodrgn_utils select_random 1180486 -n 100000 -o ind100k.pkl
And this was the preprocess command:
cryodrgn preprocess J1649_007_particles.cs -D 80 -o cryodrgn_data/preprocessed/80/cryodrgn_particles.80.mrcs --datadir ../ --ind ind100k.pkl
Ah, I forgot to mention. Since the particle stack is now a subset, you'll have to filter the ctf.pkl and pose.pkl by the selection as well before giving it to cryodrgn train_vae
. You can use cryodrgn_utils filter_pkl
, e.g.:
(cryodrgn) $ cryodrgn_utils filter_pkl ctf.pkl --ind ind100k.pkl -o ctf.100k.pkl
(cryodrgn) $ cryodrgn_utils filter_pkl pose.pkl --ind ind100k.pkl -o pose.100k.pkl
The --ind
selection given to train_vae
is applied to each of the input particles / poses / ctf data (without any assumptions of whether it's already been filtered or not), so all the inputs have to match. In your case, it's trying to filter the already filtered cryodrgn_particles.80.mrcs
, which is why it's running into an out-of-bounds problem.
Ok, I tried this, but I still get the same error using the filtered ctf and pose files...
cryodrgn train_vae cryodrgn_data/preprocessed/80/cryodrgn_particles.80.0.ft.mrcs --ctf cryodrgn_ctf_100.pkl --poses cryodrgn_poses_100.pkl --zdim 8 -n 50 -o test/00_vae_80 --preprocessed --ind ind100k.pkl > test.log
Traceback (most recent call last):
File "/home/user/software/miniconda3/envs/cryodrgn/bin/cryodrgn", line 8, in <module>
sys.exit(main())
File "/home/user/software/miniconda3/envs/cryodrgn/lib/python3.9/site-packages/cryodrgn/__main__.py", line 72, in main
args.func(args)
File "/home/user/software/miniconda3/envs/cryodrgn/lib/python3.9/site-packages/cryodrgn/commands/train_vae.py", line 621, in main
data = dataset.PreprocessedMRCData(
File "/home/user/software/miniconda3/envs/cryodrgn/lib/python3.9/site-packages/cryodrgn/dataset.py", line 246, in __init__
particles = particles[ind]
IndexError: index 100006 is out of bounds for axis 0 with size 100000
EDIT: never mind, I'm an idiot, of course we don't need the --ind
flag for train_vae
anymore, because we have pre-filtered our particle set. I do get a warning stating that apex.amp
is deprecated:
/home/user/software/miniconda3/envs/cryodrgn/lib/python3.9/site-packages/apex/__init__.py:68: DeprecatedFeatureWarning: apex.amp is deprecated and will be removed by the end of February 2023. Use [PyTorch AMP](https://pytorch.org/docs/stable/amp.html)
warnings.warn(msg, DeprecatedFeatureWarning)
But I guess this can safely be ignored.
Great, glad to hear that it worked!
Thanks for reporting the AMP warning... good to know about the deprecation date. The latest version of cryoDRGN uses the recommended Pytorch.AMP library, but we still support / default to apex.amp if it is installed.
Btw, I just tagged the latest version (with the --ind
argument to downsample) as the official v2.2.0 release (things have been stable for a while), so you'll have a version number associated with the version of the code you're using.