Error when running cryodrgn analyze (maybe error when model saving?)
Describe the bug
After trained with
cryodrgn abinit_het xxx
and then analyze with
cryodrgn analyze xxx
the bug occurs:
Traceback (most recent call last): File "/opt/conda/envs/cryodrgn/bin/cryodrgn", line 8, in <module> sys.exit(main()) File "/opt/conda/envs/cryodrgn/lib/python3.9/site-packages/cryodrgn/__main__.py", line 72, in main args.func(args) File "/opt/conda/envs/cryodrgn/lib/python3.9/site-packages/cryodrgn/commands/analyze.py", line 256, in main analyze_zN( File "/opt/conda/envs/cryodrgn/lib/python3.9/site-packages/cryodrgn/commands/analyze.py", line 113, in analyze_zN vg.gen_volumes(f"{outdir}/pc{i+1}", z_pc) File "/opt/conda/envs/cryodrgn/lib/python3.9/site-packages/cryodrgn/commands/analyze.py", line 218, in gen_volumes analysis.gen_volumes(self.weights, self.config, zfile, outdir, **self.vol_args) File "/opt/conda/envs/cryodrgn/lib/python3.9/site-packages/cryodrgn/analysis.py", line 577, in gen_volumes return eval_vol.main(args) File "/opt/conda/envs/cryodrgn/lib/python3.9/site-packages/cryodrgn/commands/eval_vol.py", line 170, in main model, lattice = HetOnlyVAE.load(cfg, args.weights, device=device) File "/opt/conda/envs/cryodrgn/lib/python3.9/site-packages/cryodrgn/models.py", line 124, in load model.load_state_dict(ckpt["model_state_dict"]) File "/opt/conda/envs/cryodrgn/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1604, in load_state_dict raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format( RuntimeError: Error(s) in loading state_dict for HetOnlyVAE: Missing key(s) in state_dict: "encoder.main.0.weight", "encoder.main.0.bias", "encoder.main.2.linear.weight", "encoder.main.2.linear.bias", "encoder.main.4.linear.weight", "encoder.main.4.linear.bias", "encoder.main.6.linear.weight", "encoder.main.6.linear.bias", "encoder.main.8.weight", "encoder.main.8.bias", "decoder.rand_freqs", "decoder.decoder.main.0.weight", "decoder.decoder.main.0.bias", "decoder.decoder.main.2.linear.weight", "decoder.decoder.main.2.linear.bias", "decoder.decoder.main.4.linear.weight", "decoder.decoder.main.4.linear.bias", "decoder.decoder.main.6.linear.weight", "decoder.decoder.main.6.linear.bias", "decoder.decoder.main.8.weight", "decoder.decoder.main.8.bias". Unexpected key(s) in state_dict: "module.encoder.main.0.weight", "module.encoder.main.0.bias", "module.encoder.main.2.linear.weight", "module.encoder.main.2.linear.bias", "module.encoder.main.4.linear.weight", "module.encoder.main.4.linear.bias", "module.encoder.main.6.linear.weight", "module.encoder.main.6.linear.bias", "module.encoder.main.8.weight", "module.encoder.main.8.bias", "module.decoder.rand_freqs", "module.decoder.decoder.main.0.weight", "module.decoder.decoder.main.0.bias", "module.decoder.decoder.main.2.linear.weight", "module.decoder.decoder.main.2.linear.bias", "module.decoder.decoder.main.4.linear.weight", "module.decoder.decoder.main.4.linear.bias", "module.decoder.decoder.main.6.linear.weight", "module.decoder.decoder.main.6.linear.bias", "module.decoder.decoder.main.8.weight", "module.decoder.decoder.main.8.bias".
To Reproduce
Use
cryodrgn analyze
to analyze an abinit train.
Expected behavior Add module. in state_dict
Can you include which version of the software you are running?
This seems related to: https://github.com/zhonge/cryodrgn/pull/190
@xiazeqing - perhaps the model checkpoint was saved while using the --multigpu flag? In that case, make sure that --multigpu is specified when reloading the checkpointed model too.
Can you include which version of the software you are running?
both 2.0.0 and 2.1.0 tried.
This seems related to: #190 @xiazeqing - perhaps the model checkpoint was saved while using the
--multigpuflag? In that case, make sure that--multigpuis specified when reloading the checkpointed model too.
Thanks, I'll try it lager. But I have two questions:
- Without --multigpu flag, train_vae still works well.
- There is no place for --multigpu when you infer models using cryodrgn.analysis.eval_volumes() (For example) in python.
Thank you!
@xiazeqing - thanks for the details. I can confirm that I'm able to reproduce the error (on the last commit in the master branch). The scenario to reproduce this seems to be:
- Run
abinit_het --multigpu ...in a situation where you do have multiple GPUs. - Try to run
analyzeon the output.
We'll investigate this deeper and have a fix in a couple of days.