deepscm
deepscm copied to clipboard
Encountered NaN under the ConditionalDecoderVISEM setting
Hello,
Thanks for open-sourcing this beautiful project. I am currently trying to replicate the results in Table 1 of the paper, however, while I was trying to train the Conditional model, I have experienced this error message stating that a NaN loss value has been encountered, after training the model for 222 epochs. Just to mention, I performed a clean installation of the necessary libraries and I have used the Morpho-MNIST data creation script which you provided. Would there be anything wrong with the calculation of the ELBO of the p(intensity), since that's the only metric that has gone to NaN?
Steps to reproduce the behavior:
python -m deepscm.experiments.morphomnist.trainer -e SVIExperiment -m ConditionalDecoderVISEM --data_dir data/morphomnist/ --default_root_dir checkpoints/ --decoder_type fixed_var --gpus 0
Whole Error Message:
Epoch 222: 6%|▋ | 15/251 [00:01<00:28, 8.31it/s, loss=952243.375, v_num=2]/home/ilkay/miniconda3/envs/torch/lib/python3.7/site-packages/pyro/infer/tracegraph_elbo.py:261: UserWarning: Encountered NaN: loss
warn_if_nan(loss, "loss")
Traceback (most recent call last):
File "/home/ilkay/miniconda3/envs/torch/lib/python3.7/runpy.py", line 193, in _run_module_as_main
"main", mod_spec)
File "/home/ilkay/miniconda3/envs/torch/lib/python3.7/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/home/ilkay/Documents/caner/deepscm/deepscm/experiments/morphomnist/trainer.py", line 62, in
Environment
PyTorch version: 1.7.1 Is debug build: False CUDA used to build PyTorch: 10.2 ROCM used to build PyTorch: N/A
OS: Ubuntu 20.04 LTS (x86_64) GCC version: (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0 Clang version: Could not collect CMake version: Could not collect
Python version: 3.7 (64-bit runtime) Is CUDA available: True CUDA runtime version: Could not collect GPU models and configuration: GPU 0: Quadro RTX 6000 Nvidia driver version: 440.100 cuDNN version: /usr/local/cuda-10.1/targets/x86_64-linux/lib/libcudnn.so.7 HIP runtime version: N/A MIOpen runtime version: N/A
Versions of relevant libraries:
[pip3] numpy==1.19.4
[pip3] pytorch-lightning==0.7.6
[pip3] torch==1.7.1
[pip3] torchvision==0.6.0a0+35d732a
[conda] blas 1.0 mkl
[conda] cudatoolkit 10.2.89 hfd86e86_1
[conda] mkl 2020.1 217
[conda] mkl-service 2.3.0 py37he904b0f_0
[conda] mkl_fft 1.1.0 py37h23d657b_0
[conda] mkl_random 1.1.1 py37h0573a6f_0
[conda] numpy 1.19.4 pypi_0 pypi
[conda] pytorch-lightning 0.7.6 pypi_0 pypi
[conda] torch 1.7.1 pypi_0 pypi
[conda] torchvision 0.6.1 py37_cu102 pytorch
Thanks for you interest in our paper and the code. I've just gotten around to start updating the dependencies as it seems that you're at least using a newer PyTorch version than what we were using.
I'll get back to you once I updated everything and was able to run the experiments with the new versions.
In the meantime, are you using the fixed pyro version? Also you can experiment with different (lower) learning rates via: --pgm_lr 1e-3
and --lr 5e-5
or get more detailed error messages by using the --validate
flag.
However, it seems that only 'log p(intensity)': tensor(nan, device='cuda:0', grad_fn=)
went to NaN. It's something we observed sometimes as well and lowering the pgm_lr
usually helped :)
Hello,
Thanks for your suggestion, I see that the model has been successfully trained after reducing the PGM learning rate. Just to understand it clearly, the pgm_lr
hyperparameter affects only the spline layer of the thickness flow and the affine transformation layer of the intensity flow, right? I will take a look at that paper asap.
Meanwhile, I have tried to train the Normalizing Flow models (all 3 settings) but I noticed that they still have a tendency to go to NaN even after a couple of epochs with a learning rate of 10^-4. I am now trying with a learning rate of 10^-5 but I don't know whether that would solve this problem.
BTW, I was using the pyro version of your suggestion (1.3.1+4b2752f8) but I will update the repository right after completing that training attempt.
Edit: Still failing for normalizing flow experiments.
Oh wait - so you're training a flow only model?
It's only included in the code here for completeness but we also had the issue of running into NaNs when running with flows only which is why we settled on the VI solution.
As for pgm_lr
- yes it only acts on the flows for the covariates and not the image components.