RuntimeError: KLD is nan
Hello,
I am trying to run cryodrgn with my own dataset of 532,539 particles, which are pretty much cleaned and globally aligned in CryoSPARC. Its original box size was 256 and down sampled to 128 for training, and I got an error message saying 'RuntimeError: KLD is nan' as shown at the end of the running log. I confirmed the conversion was done correctly by checking backproject map. I see multiple warning messages as well, but not sure they are directly related to the error.
The script that I used for running is below: cryodrgn train_vae particles_128.mrcs --ctf ctf.pkl --poses pose.pkl --zdim 8 -n 25 --enc-dim 256 --enc-layers 3 --dec-dim 256 --dec-layers 3 --multigpu -o training03_128_vae
Does anyone have experience with this issue or know how to solve it?
(INFO) (dataset.py) (05-Sep-23 17:56:58) Loaded 532539 128x128 images
(INFO) (dataset.py) (05-Sep-23 17:56:58) Windowing images with radius 0.85
(INFO) (dataset.py) (05-Sep-23 17:57:00) Computing FFT
(INFO) (dataset.py) (05-Sep-23 17:57:00) Spawning 16 processes
/goliath/sw/conda4.10/envs/cryodrgn-2.3/lib/python3.9/site-packages/cryodrgn/dataset.py:191: RuntimeWarning: overflow encountered in cast
particles = pp.asarray(
(INFO) (dataset.py) (05-Sep-23 17:59:55) Symmetrizing image data
/goliath/sw/conda4.10/envs/cryodrgn-2.3/lib/python3.9/site-packages/numpy/core/_methods.py:181: RuntimeWarning: overflow encountered in reduce
ret = umr_sum(arr, axis, dtype, out, keepdims, where=where)
/goliath/sw/conda4.10/envs/cryodrgn-2.3/lib/python3.9/site-packages/numpy/core/_methods.py:181: RuntimeWarning: invalid value encountered in reduce
ret = umr_sum(arr, axis, dtype, out, keepdims, where=where)
/goliath/sw/conda4.10/envs/cryodrgn-2.3/lib/python3.9/site-packages/numpy/core/_methods.py:215: RuntimeWarning: overflow encountered in reduce
arrmean = umr_sum(arr, axis, dtype, keepdims=True, where=where)
/goliath/sw/conda4.10/envs/cryodrgn-2.3/lib/python3.9/site-packages/numpy/core/_methods.py:215: RuntimeWarning: invalid value encountered in reduce
arrmean = umr_sum(arr, axis, dtype, keepdims=True, where=where)
(INFO) (dataset.py) (05-Sep-23 18:00:44) Normalized HT by 0 +/- nan
(INFO) (train_vae.py) (05-Sep-23 18:00:46) Loading ctf params from /goliath/processing/kimjunh/cryodrgn/ctf.pkl
(INFO) (ctf.py) (05-Sep-23 18:00:46) Image size (pix) : 128
(INFO) (ctf.py) (05-Sep-23 18:00:46) A/pix : 3.319999933242798
(INFO) (ctf.py) (05-Sep-23 18:00:46) DefocusU (A) : 10656.8232421875
(INFO) (ctf.py) (05-Sep-23 18:00:46) DefocusV (A) : 9715.7421875
(INFO) (ctf.py) (05-Sep-23 18:00:46) Dfang (deg) : 27.725799560546875
(INFO) (ctf.py) (05-Sep-23 18:00:46) voltage (kV) : 300.0
(INFO) (ctf.py) (05-Sep-23 18:00:46) cs (mm) : 2.700000047683716
(INFO) (ctf.py) (05-Sep-23 18:00:46) w : 0.10000000149011612
(INFO) (ctf.py) (05-Sep-23 18:00:46) Phase shift (deg) : 0.0
(INFO) (train_vae.py) (05-Sep-23 18:00:46) HetOnlyVAE(
(encoder): ResidLinearMLP(
(main): Sequential(
(0): MyLinear(in_features=12852, out_features=256, bias=True)
(1): ReLU()
(2): ResidLinear(
(linear): MyLinear(in_features=256, out_features=256, bias=True)
)
(3): ReLU()
(4): ResidLinear(
(linear): MyLinear(in_features=256, out_features=256, bias=True)
)
(5): ReLU()
(6): ResidLinear(
(linear): MyLinear(in_features=256, out_features=256, bias=True)
)
(7): ReLU()
(8): MyLinear(in_features=256, out_features=16, bias=True)
)
)
(decoder): FTPositionalDecoder(
(decoder): ResidLinearMLP(
(main): Sequential(
(0): MyLinear(in_features=392, out_features=256, bias=True)
(1): ReLU()
(2): ResidLinear(
(linear): MyLinear(in_features=256, out_features=256, bias=True)
)
(3): ReLU()
(4): ResidLinear(
(linear): MyLinear(in_features=256, out_features=256, bias=True)
)
(5): ReLU()
(6): ResidLinear(
(linear): MyLinear(in_features=256, out_features=256, bias=True)
)
(7): ReLU()
(8): MyLinear(in_features=256, out_features=2, bias=True)
)
)
)
)
(INFO) (train_vae.py) (05-Sep-23 18:00:46) 3790354 parameters in model
(INFO) (train_vae.py) (05-Sep-23 18:00:46) 3491856 parameters in encoder
(INFO) (train_vae.py) (05-Sep-23 18:00:46) 298498 parameters in decoder
(WARNING) (train_vae.py) (05-Sep-23 18:00:46) Warning: Masked input image dimension is not a mutiple of 8 -- AMP training speedup is not optimized
(INFO) (train_vae.py) (05-Sep-23 18:00:46) Using 4 GPUs!
(INFO) (train_vae.py) (05-Sep-23 18:00:46) Increasing batch size to 32
(INFO) (train_vae.py) (05-Sep-23 18:00:56) tensor([nan, nan, nan, nan, nan, nan, nan, nan], device='cuda:0',
dtype=torch.float16, grad_fn=<SelectBackward0>)
(INFO) (train_vae.py) (05-Sep-23 18:00:56) tensor([nan, nan, nan, nan, nan, nan, nan, nan], device='cuda:0',
dtype=torch.float16, grad_fn=<SelectBackward0>)
Traceback (most recent call last):
File "/goliath/sw/conda4.10/envs/cryodrgn-2.3/bin/cryodrgn", line 8, in
Hello! Have you solved this issue? I met the same runtime error "KLD is nan" when I am trying the heterogeneous reconstruction on the tutorial dataset EMPIAR-10049.
Hi all, we are still trying to track down the root cause of this issue; see (potentially) related threads such as #136, #18, and #346. In the meantime, can you try running without the --multigpu flag, and/or trying a smaller training batch size (-b 1 or -b 2) to make sure this is not an issue with running out of memory?
Also, can you try changing the number of latent dimensions (--zdim) to see if this is specific to this particular model, or potentially an issue with the input data?