cryodrgn Error when running cryodrgn analyze (maybe error when model saving?)

Describe the bug After trained with cryodrgn abinit_het xxx and then analyze with cryodrgn analyze xxx the bug occurs: Traceback (most recent call last): File "/opt/conda/envs/cryodrgn/bin/cryodrgn", line 8, in <module> sys.exit(main()) File "/opt/conda/envs/cryodrgn/lib/python3.9/site-packages/cryodrgn/__main__.py", line 72, in main args.func(args) File "/opt/conda/envs/cryodrgn/lib/python3.9/site-packages/cryodrgn/commands/analyze.py", line 256, in main analyze_zN( File "/opt/conda/envs/cryodrgn/lib/python3.9/site-packages/cryodrgn/commands/analyze.py", line 113, in analyze_zN vg.gen_volumes(f"{outdir}/pc{i+1}", z_pc) File "/opt/conda/envs/cryodrgn/lib/python3.9/site-packages/cryodrgn/commands/analyze.py", line 218, in gen_volumes analysis.gen_volumes(self.weights, self.config, zfile, outdir, **self.vol_args) File "/opt/conda/envs/cryodrgn/lib/python3.9/site-packages/cryodrgn/analysis.py", line 577, in gen_volumes return eval_vol.main(args) File "/opt/conda/envs/cryodrgn/lib/python3.9/site-packages/cryodrgn/commands/eval_vol.py", line 170, in main model, lattice = HetOnlyVAE.load(cfg, args.weights, device=device) File "/opt/conda/envs/cryodrgn/lib/python3.9/site-packages/cryodrgn/models.py", line 124, in load model.load_state_dict(ckpt["model_state_dict"]) File "/opt/conda/envs/cryodrgn/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1604, in load_state_dict raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format( RuntimeError: Error(s) in loading state_dict for HetOnlyVAE: Missing key(s) in state_dict: "encoder.main.0.weight", "encoder.main.0.bias", "encoder.main.2.linear.weight", "encoder.main.2.linear.bias", "encoder.main.4.linear.weight", "encoder.main.4.linear.bias", "encoder.main.6.linear.weight", "encoder.main.6.linear.bias", "encoder.main.8.weight", "encoder.main.8.bias", "decoder.rand_freqs", "decoder.decoder.main.0.weight", "decoder.decoder.main.0.bias", "decoder.decoder.main.2.linear.weight", "decoder.decoder.main.2.linear.bias", "decoder.decoder.main.4.linear.weight", "decoder.decoder.main.4.linear.bias", "decoder.decoder.main.6.linear.weight", "decoder.decoder.main.6.linear.bias", "decoder.decoder.main.8.weight", "decoder.decoder.main.8.bias". Unexpected key(s) in state_dict: "module.encoder.main.0.weight", "module.encoder.main.0.bias", "module.encoder.main.2.linear.weight", "module.encoder.main.2.linear.bias", "module.encoder.main.4.linear.weight", "module.encoder.main.4.linear.bias", "module.encoder.main.6.linear.weight", "module.encoder.main.6.linear.bias", "module.encoder.main.8.weight", "module.encoder.main.8.bias", "module.decoder.rand_freqs", "module.decoder.decoder.main.0.weight", "module.decoder.decoder.main.0.bias", "module.decoder.decoder.main.2.linear.weight", "module.decoder.decoder.main.2.linear.bias", "module.decoder.decoder.main.4.linear.weight", "module.decoder.decoder.main.4.linear.bias", "module.decoder.decoder.main.6.linear.weight", "module.decoder.decoder.main.6.linear.bias", "module.decoder.decoder.main.8.weight", "module.decoder.decoder.main.8.bias".

To Reproduce Use cryodrgn analyze to analyze an abinit train.

Expected behavior Add module. in state_dict

Mar 05 '23 05:03 xiazeqing

Can you include which version of the software you are running?

Mar 06 '23 17:03 zhonge

This seems related to: https://github.com/zhonge/cryodrgn/pull/190 @xiazeqing - perhaps the model checkpoint was saved while using the --multigpu flag? In that case, make sure that --multigpu is specified when reloading the checkpointed model too.

Mar 06 '23 17:03 vineetbansal

Can you include which version of the software you are running?

both 2.0.0 and 2.1.0 tried.

This seems related to: #190 @xiazeqing - perhaps the model checkpoint was saved while using the --multigpu flag? In that case, make sure that --multigpu is specified when reloading the checkpointed model too.

Thanks, I'll try it lager. But I have two questions:

Without --multigpu flag, train_vae still works well.
There is no place for --multigpu when you infer models using cryodrgn.analysis.eval_volumes() (For example) in python.

Thank you!

Mar 08 '23 03:03 xiazeqing

@xiazeqing - thanks for the details. I can confirm that I'm able to reproduce the error (on the last commit in the master branch). The scenario to reproduce this seems to be:

Run abinit_het --multigpu ... in a situation where you do have multiple GPUs.
Try to run analyze on the output.

We'll investigate this deeper and have a fix in a couple of days.

Mar 09 '23 18:03 vineetbansal