cryodrgn icon indicating copy to clipboard operation
cryodrgn copied to clipboard

RuntimeError: KLD is nan

Open JunhoeK2 opened this issue 2 years ago • 2 comments

Hello,

I am trying to run cryodrgn with my own dataset of 532,539 particles, which are pretty much cleaned and globally aligned in CryoSPARC. Its original box size was 256 and down sampled to 128 for training, and I got an error message saying 'RuntimeError: KLD is nan' as shown at the end of the running log. I confirmed the conversion was done correctly by checking backproject map. I see multiple warning messages as well, but not sure they are directly related to the error.

The script that I used for running is below: cryodrgn train_vae particles_128.mrcs --ctf ctf.pkl --poses pose.pkl --zdim 8 -n 25 --enc-dim 256 --enc-layers 3 --dec-dim 256 --dec-layers 3 --multigpu -o training03_128_vae

Does anyone have experience with this issue or know how to solve it?

(INFO) (dataset.py) (05-Sep-23 17:56:58) Loaded 532539 128x128 images (INFO) (dataset.py) (05-Sep-23 17:56:58) Windowing images with radius 0.85 (INFO) (dataset.py) (05-Sep-23 17:57:00) Computing FFT (INFO) (dataset.py) (05-Sep-23 17:57:00) Spawning 16 processes /goliath/sw/conda4.10/envs/cryodrgn-2.3/lib/python3.9/site-packages/cryodrgn/dataset.py:191: RuntimeWarning: overflow encountered in cast particles = pp.asarray( (INFO) (dataset.py) (05-Sep-23 17:59:55) Symmetrizing image data /goliath/sw/conda4.10/envs/cryodrgn-2.3/lib/python3.9/site-packages/numpy/core/_methods.py:181: RuntimeWarning: overflow encountered in reduce ret = umr_sum(arr, axis, dtype, out, keepdims, where=where) /goliath/sw/conda4.10/envs/cryodrgn-2.3/lib/python3.9/site-packages/numpy/core/_methods.py:181: RuntimeWarning: invalid value encountered in reduce ret = umr_sum(arr, axis, dtype, out, keepdims, where=where) /goliath/sw/conda4.10/envs/cryodrgn-2.3/lib/python3.9/site-packages/numpy/core/_methods.py:215: RuntimeWarning: overflow encountered in reduce arrmean = umr_sum(arr, axis, dtype, keepdims=True, where=where) /goliath/sw/conda4.10/envs/cryodrgn-2.3/lib/python3.9/site-packages/numpy/core/_methods.py:215: RuntimeWarning: invalid value encountered in reduce arrmean = umr_sum(arr, axis, dtype, keepdims=True, where=where) (INFO) (dataset.py) (05-Sep-23 18:00:44) Normalized HT by 0 +/- nan (INFO) (train_vae.py) (05-Sep-23 18:00:46) Loading ctf params from /goliath/processing/kimjunh/cryodrgn/ctf.pkl (INFO) (ctf.py) (05-Sep-23 18:00:46) Image size (pix) : 128 (INFO) (ctf.py) (05-Sep-23 18:00:46) A/pix : 3.319999933242798 (INFO) (ctf.py) (05-Sep-23 18:00:46) DefocusU (A) : 10656.8232421875 (INFO) (ctf.py) (05-Sep-23 18:00:46) DefocusV (A) : 9715.7421875 (INFO) (ctf.py) (05-Sep-23 18:00:46) Dfang (deg) : 27.725799560546875 (INFO) (ctf.py) (05-Sep-23 18:00:46) voltage (kV) : 300.0 (INFO) (ctf.py) (05-Sep-23 18:00:46) cs (mm) : 2.700000047683716 (INFO) (ctf.py) (05-Sep-23 18:00:46) w : 0.10000000149011612 (INFO) (ctf.py) (05-Sep-23 18:00:46) Phase shift (deg) : 0.0 (INFO) (train_vae.py) (05-Sep-23 18:00:46) HetOnlyVAE( (encoder): ResidLinearMLP( (main): Sequential( (0): MyLinear(in_features=12852, out_features=256, bias=True) (1): ReLU() (2): ResidLinear( (linear): MyLinear(in_features=256, out_features=256, bias=True) ) (3): ReLU() (4): ResidLinear( (linear): MyLinear(in_features=256, out_features=256, bias=True) ) (5): ReLU() (6): ResidLinear( (linear): MyLinear(in_features=256, out_features=256, bias=True) ) (7): ReLU() (8): MyLinear(in_features=256, out_features=16, bias=True) ) ) (decoder): FTPositionalDecoder( (decoder): ResidLinearMLP( (main): Sequential( (0): MyLinear(in_features=392, out_features=256, bias=True) (1): ReLU() (2): ResidLinear( (linear): MyLinear(in_features=256, out_features=256, bias=True) ) (3): ReLU() (4): ResidLinear( (linear): MyLinear(in_features=256, out_features=256, bias=True) ) (5): ReLU() (6): ResidLinear( (linear): MyLinear(in_features=256, out_features=256, bias=True) ) (7): ReLU() (8): MyLinear(in_features=256, out_features=2, bias=True) ) ) ) ) (INFO) (train_vae.py) (05-Sep-23 18:00:46) 3790354 parameters in model (INFO) (train_vae.py) (05-Sep-23 18:00:46) 3491856 parameters in encoder (INFO) (train_vae.py) (05-Sep-23 18:00:46) 298498 parameters in decoder (WARNING) (train_vae.py) (05-Sep-23 18:00:46) Warning: Masked input image dimension is not a mutiple of 8 -- AMP training speedup is not optimized (INFO) (train_vae.py) (05-Sep-23 18:00:46) Using 4 GPUs! (INFO) (train_vae.py) (05-Sep-23 18:00:46) Increasing batch size to 32 (INFO) (train_vae.py) (05-Sep-23 18:00:56) tensor([nan, nan, nan, nan, nan, nan, nan, nan], device='cuda:0', dtype=torch.float16, grad_fn=<SelectBackward0>) (INFO) (train_vae.py) (05-Sep-23 18:00:56) tensor([nan, nan, nan, nan, nan, nan, nan, nan], device='cuda:0', dtype=torch.float16, grad_fn=<SelectBackward0>) Traceback (most recent call last): File "/goliath/sw/conda4.10/envs/cryodrgn-2.3/bin/cryodrgn", line 8, in sys.exit(main()) File "/goliath/sw/conda4.10/envs/cryodrgn-2.3/lib/python3.9/site-packages/cryodrgn/main.py", line 72, in main args.func(args) File "/goliath/sw/conda4.10/envs/cryodrgn-2.3/lib/python3.9/site-packages/cryodrgn/commands/train_vae.py", line 836, in main loss, gen_loss, kld = train_batch( File "/goliath/sw/conda4.10/envs/cryodrgn-2.3/lib/python3.9/site-packages/cryodrgn/commands/train_vae.py", line 331, in train_batch loss, gen_loss, kld = loss_function( File "/goliath/sw/conda4.10/envs/cryodrgn-2.3/lib/python3.9/site-packages/cryodrgn/commands/train_vae.py", line 425, in loss_function raise RuntimeError("KLD is nan") RuntimeError: KLD is nan

JunhoeK2 avatar Sep 06 '23 01:09 JunhoeK2

Hello! Have you solved this issue? I met the same runtime error "KLD is nan" when I am trying the heterogeneous reconstruction on the tutorial dataset EMPIAR-10049.

MatthewFu2001 avatar Sep 20 '24 07:09 MatthewFu2001

Hi all, we are still trying to track down the root cause of this issue; see (potentially) related threads such as #136, #18, and #346. In the meantime, can you try running without the --multigpu flag, and/or trying a smaller training batch size (-b 1 or -b 2) to make sure this is not an issue with running out of memory?

Also, can you try changing the number of latent dimensions (--zdim) to see if this is specific to this particular model, or potentially an issue with the input data?

michal-g avatar Sep 23 '24 15:09 michal-g