cryodrgn icon indicating copy to clipboard operation
cryodrgn copied to clipboard

Incompatibility with RELION 5 .star files

Open michal-g opened this issue 5 months ago • 0 comments

Thanks. I reinstalled the v3.4.0 pre-release as you suggested, and it does parse the pose data from run_data.star without error.

If I try to move to the next step: cryodrgn train_vae Refine3D/job150/run_data_modcryodrgn.star --poses cryodrgn/pose.pkl --encode-mode tilt --dose-per-tilt 1.8 --zdim 8 -n 50 --beta 0.025 -o cryodrgn/

I get the following error: (INFO) (train_vae.py) (27-Aug-24 14:25:07) /home/sconnell/local/miniconda3/envs/cryodrgn/bin/cryodrgn train_vae Refine3D/job150/run_data_modcryodrgn.star --poses cryodrgn/pose.pkl --encode-mode tilt --dose-per-tilt 1.8 --zdim 8 -n 50 --beta 0.025 -o cryodrgn/ (INFO) (train_vae.py) (27-Aug-24 14:25:07) cryoDRGN 3.4.0b0 (INFO) (train_vae.py) (27-Aug-24 14:25:07) Namespace(particles='/mnt/McQueen-002/sconnell/EMDB/EMPIAR-10860-Ecoli-GOSLAR/relion/Refine3D/job150/run_data_modcryodrgn.star', outdir='/mnt/McQueen-002/sconnell/EMDB/EMPIAR-10860-Ecoli-GOSLAR/relion/cryodrgn', zdim=8, poses='/mnt/McQueen-002/sconnell/EMDB/EMPIAR-10860-Ecoli-GOSLAR/relion/cryodrgn/pose.pkl', ctf=None, load=None, checkpoint=1, log_interval=1000, verbose=False, seed=95994, ind=None, invert_data=True, window=True, window_r=0.85, datadir=None, lazy=False, shuffler_size=0, num_workers=0, max_threads=16, ntilts=10, random_tilts=False, t_emb_dim=64, tlayers=3, tdim=1024, dose_per_tilt=1.8, angle_per_tilt=3, num_epochs=50, batch_size=8, wd=0, lr=0.0001, beta='0.025', beta_control=None, norm=None, amp=True, multigpu=False, do_pose_sgd=False, pretrain=1, emb_type='quat', pose_lr=0.0003, qlayers=3, qdim=1024, encode_mode='tilt', enc_mask=None, use_real=False, players=3, pdim=1024, pe_type='gaussian', feat_sigma=0.5, pe_dim=None, domain='fourier', activation='relu', func=<function main at 0x7750c1022dc0>) (INFO) (train_vae.py) (27-Aug-24 14:25:07) Use cuda True (INFO) (train_vae.py) (27-Aug-24 14:25:07) Loading dataset from /mnt/McQueen-002/sconnell/EMDB/EMPIAR-10860-Ecoli-GOSLAR/relion/Refine3D/job150/run_data_modcryodrgn.star Traceback (most recent call last): File "/home/sconnell/local/miniconda3/envs/cryodrgn/bin/cryodrgn", line 8, in <module> sys.exit(main_commands()) File "/home/sconnell/local/miniconda3/envs/cryodrgn/lib/python3.9/site-packages/cryodrgn/command_line.py", line 80, in main_commands _get_commands( File "/home/sconnell/local/miniconda3/envs/cryodrgn/lib/python3.9/site-packages/cryodrgn/command_line.py", line 75, in _get_commands args.func(args) File "/home/sconnell/local/miniconda3/envs/cryodrgn/lib/python3.9/site-packages/cryodrgn/commands/train_vae.py", line 698, in main data = dataset.TiltSeriesData( # FIXME: maybe combine with above? File "/home/sconnell/local/miniconda3/envs/cryodrgn/lib/python3.9/site-packages/cryodrgn/dataset.py", line 150, in __init__ super().__init__(tiltstar, ind=ind, **kwargs) File "/home/sconnell/local/miniconda3/envs/cryodrgn/lib/python3.9/site-packages/cryodrgn/dataset.py", line 49, in __init__ self.src = ImageSource.from_file( File "/home/sconnell/local/miniconda3/envs/cryodrgn/lib/python3.9/site-packages/cryodrgn/source.py", line 141, in from_file return StarfileSource( File "/home/sconnell/local/miniconda3/envs/cryodrgn/lib/python3.9/site-packages/cryodrgn/source.py", line 641, in __init__ sdata[["__mrc_index", "__mrc_filename"]] = sdata["_rlnImageName"].str.split( File "/home/sconnell/local/miniconda3/envs/cryodrgn/lib/python3.9/site-packages/pandas/core/frame.py", line 3968, in __setitem__ self._setitem_array(key, value) File "/home/sconnell/local/miniconda3/envs/cryodrgn/lib/python3.9/site-packages/pandas/core/frame.py", line 4010, in _setitem_array check_key_length(self.columns, key, value) File "/home/sconnell/local/miniconda3/envs/cryodrgn/lib/python3.9/site-packages/pandas/core/indexers/utils.py", line 401, in check_key_length raise ValueError("Columns must be same length as key") ValueError: Columns must be same length as key

Setting the --datadir in the cryodrgn train_vae does not help. Note I did not downsample the data as I am using 96x96 2D projections.

I think the issue is that in the run_data.star my "particles" have _rlnImageName names like: Extract/job222/Subtomograms/ts-69/12_stack2d.mrcs Namely each tilt series has its own directory and each particle its own stack. I think in source.py it tries to get the index and filename from splitting the _rlnImageName using the @ symbol.

Am I extracting the 2D particle tilt series images wrong in Relion5 to get these different file names? If it helps, I can share the required files. Is it better to pass the output of Relion5 through M to get it in a better format?

If I am not making a mistake somewhere , maybe it is better to have a "pre-processing" script to bring the Relion5 files more in line with what cryoDRGN is expecting.

Originally posted by @frozenfas in https://github.com/ml-struct-bio/cryodrgn/discussions/394#discussioncomment-10462883

michal-g avatar Sep 11 '24 03:09 michal-g