deepscm icon indicating copy to clipboard operation
deepscm copied to clipboard

Encountered NaN under the ConditionalDecoderVISEM setting

Open canerozer opened this issue 4 years ago • 3 comments

Hello,

Thanks for open-sourcing this beautiful project. I am currently trying to replicate the results in Table 1 of the paper, however, while I was trying to train the Conditional model, I have experienced this error message stating that a NaN loss value has been encountered, after training the model for 222 epochs. Just to mention, I performed a clean installation of the necessary libraries and I have used the Morpho-MNIST data creation script which you provided. Would there be anything wrong with the calculation of the ELBO of the p(intensity), since that's the only metric that has gone to NaN?

Steps to reproduce the behavior:

python -m deepscm.experiments.morphomnist.trainer -e SVIExperiment -m ConditionalDecoderVISEM --data_dir data/morphomnist/ --default_root_dir checkpoints/ --decoder_type fixed_var --gpus 0

Whole Error Message:

Epoch 222: 6%|▋ | 15/251 [00:01<00:28, 8.31it/s, loss=952243.375, v_num=2]/home/ilkay/miniconda3/envs/torch/lib/python3.7/site-packages/pyro/infer/tracegraph_elbo.py:261: UserWarning: Encountered NaN: loss warn_if_nan(loss, "loss") Traceback (most recent call last): File "/home/ilkay/miniconda3/envs/torch/lib/python3.7/runpy.py", line 193, in _run_module_as_main "main", mod_spec) File "/home/ilkay/miniconda3/envs/torch/lib/python3.7/runpy.py", line 85, in _run_code exec(code, run_globals) File "/home/ilkay/Documents/caner/deepscm/deepscm/experiments/morphomnist/trainer.py", line 62, in trainer.fit(experiment) File "/home/ilkay/miniconda3/envs/torch/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 859, in fit self.single_gpu_train(model) File "/home/ilkay/miniconda3/envs/torch/lib/python3.7/site-packages/pytorch_lightning/trainer/distrib_parts.py", line 503, in single_gpu_train self.run_pretrain_routine(model) File "/home/ilkay/miniconda3/envs/torch/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 1015, in run_pretrain_routine self.train() File "/home/ilkay/miniconda3/envs/torch/lib/python3.7/site-packages/pytorch_lightning/trainer/training_loop.py", line 347, in train self.run_training_epoch() File "/home/ilkay/miniconda3/envs/torch/lib/python3.7/site-packages/pytorch_lightning/trainer/training_loop.py", line 419, in run_training_epoch _outputs = self.run_training_batch(batch, batch_idx) File "/home/ilkay/miniconda3/envs/torch/lib/python3.7/site-packages/pytorch_lightning/trainer/training_loop.py", line 597, in run_training_batch loss, batch_output = optimizer_closure() File "/home/ilkay/miniconda3/envs/torch/lib/python3.7/site-packages/pytorch_lightning/trainer/training_loop.py", line 561, in optimizer_closure output_dict = self.training_forward(split_batch, batch_idx, opt_idx, self.hiddens) File "/home/ilkay/miniconda3/envs/torch/lib/python3.7/site-packages/pytorch_lightning/trainer/training_loop.py", line 727, in training_forward output = self.model.training_step(*args) File "/home/ilkay/Documents/caner/deepscm/deepscm/experiments/morphomnist/sem_vi/base_sem_experiment.py", line 385, in training_step raise ValueError('loss went to nan with metrics:\n{}'.format(metrics)) ValueError: loss went to nan with metrics: {'log p(x)': tensor(-3502.9570, device='cuda:0', grad_fn=<MeanBackward0>), 'log p(intensity)': tensor(nan, device='cuda:0', grad_fn=<MeanBackward0>), 'log p(thickness)': tensor(-0.9457, device='cuda:0', grad_fn=<MeanBackward0>), 'p(z)': tensor(-22.2051, device='cuda:0', grad_fn=<MeanBackward0>), 'q(z)': tensor(54.3670, device='cuda:0', grad_fn=<MeanBackward0>), 'log p(z) - log q(z)': tensor(-76.5721, device='cuda:0', grad_fn=<SubBackward0>)} Exception ignored in: <function tqdm.del at 0x7ffb7e9d2320> Traceback (most recent call last): File "/home/ilkay/miniconda3/envs/torch/lib/python3.7/site-packages/tqdm/std.py", line 1135, in del File "/home/ilkay/miniconda3/envs/torch/lib/python3.7/site-packages/tqdm/std.py", line 1282, in close File "/home/ilkay/miniconda3/envs/torch/lib/python3.7/site-packages/tqdm/std.py", line 1467, in display File "/home/ilkay/miniconda3/envs/torch/lib/python3.7/site-packages/tqdm/std.py", line 1138, in repr File "/home/ilkay/miniconda3/envs/torch/lib/python3.7/site-packages/tqdm/std.py", line 1425, in format_dict TypeError: cannot unpack non-iterable NoneType object

Environment

PyTorch version: 1.7.1 Is debug build: False CUDA used to build PyTorch: 10.2 ROCM used to build PyTorch: N/A

OS: Ubuntu 20.04 LTS (x86_64) GCC version: (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0 Clang version: Could not collect CMake version: Could not collect

Python version: 3.7 (64-bit runtime) Is CUDA available: True CUDA runtime version: Could not collect GPU models and configuration: GPU 0: Quadro RTX 6000 Nvidia driver version: 440.100 cuDNN version: /usr/local/cuda-10.1/targets/x86_64-linux/lib/libcudnn.so.7 HIP runtime version: N/A MIOpen runtime version: N/A

Versions of relevant libraries: [pip3] numpy==1.19.4 [pip3] pytorch-lightning==0.7.6 [pip3] torch==1.7.1 [pip3] torchvision==0.6.0a0+35d732a [conda] blas 1.0 mkl
[conda] cudatoolkit 10.2.89 hfd86e86_1
[conda] mkl 2020.1 217
[conda] mkl-service 2.3.0 py37he904b0f_0
[conda] mkl_fft 1.1.0 py37h23d657b_0
[conda] mkl_random 1.1.1 py37h0573a6f_0
[conda] numpy 1.19.4 pypi_0 pypi [conda] pytorch-lightning 0.7.6 pypi_0 pypi [conda] torch 1.7.1 pypi_0 pypi [conda] torchvision 0.6.1 py37_cu102 pytorch

canerozer avatar Jan 03 '21 21:01 canerozer

Thanks for you interest in our paper and the code. I've just gotten around to start updating the dependencies as it seems that you're at least using a newer PyTorch version than what we were using.

I'll get back to you once I updated everything and was able to run the experiments with the new versions.

In the meantime, are you using the fixed pyro version? Also you can experiment with different (lower) learning rates via: --pgm_lr 1e-3 and --lr 5e-5 or get more detailed error messages by using the --validate flag.

However, it seems that only 'log p(intensity)': tensor(nan, device='cuda:0', grad_fn=) went to NaN. It's something we observed sometimes as well and lowering the pgm_lr usually helped :)

pawni avatar Jan 04 '21 18:01 pawni

Hello,

Thanks for your suggestion, I see that the model has been successfully trained after reducing the PGM learning rate. Just to understand it clearly, the pgm_lr hyperparameter affects only the spline layer of the thickness flow and the affine transformation layer of the intensity flow, right? I will take a look at that paper asap.

Meanwhile, I have tried to train the Normalizing Flow models (all 3 settings) but I noticed that they still have a tendency to go to NaN even after a couple of epochs with a learning rate of 10^-4. I am now trying with a learning rate of 10^-5 but I don't know whether that would solve this problem.

BTW, I was using the pyro version of your suggestion (1.3.1+4b2752f8) but I will update the repository right after completing that training attempt.

Edit: Still failing for normalizing flow experiments.

canerozer avatar Jan 10 '21 22:01 canerozer

Oh wait - so you're training a flow only model?

It's only included in the code here for completeness but we also had the issue of running into NaNs when running with flows only which is why we settled on the VI solution.

As for pgm_lr - yes it only acts on the flows for the covariates and not the image components.

pawni avatar Jan 11 '21 19:01 pawni