Probabilistic-Unet-Pytorch icon indicating copy to clipboard operation
Probabilistic-Unet-Pytorch copied to clipboard

Getting NAN tensor from encoder

Open zabboud opened this issue 2 years ago • 2 comments

Hello - I've been getting this issue consistently while running the code as is, with the LIDC-IDRI data (downloaded from the provided link). This error is caught at the line dist = Independent(Normal(loc=mu, scale=torch.exp(log_sigma)),1) within the class AxisAlignedConvGaussian() due to the fact that mu = tensor([[nan, nan], [nan, nan], [nan, nan], [nan, nan], [nan, nan]], device='cuda:0', grad_fn=<SliceBackward0>).

When I trace back where nan is coming from, it's directly from the encoder (output from encoding = self.encoder(input)) all the way back to the output from the forward method in the Encoder class.

this issue seems to be persistent regardless of batch size (I've run it with batch size 5 and 10, and I still get the error within the first epoch, randomly after a few runs).

I've verified the input, it seems okay, the images are what is expected (viewed) and some masks have all 0's while others have some values. Nothing out of the ordinary.

I have yet to be able to track down why this is occurring. It seems like others have experienced a similar issue, but more on the loss side, the issue I'm experiencing is within the forward pass, so it is independent of the loss.

Any insight would be appreciated! ValueError: Expected parameter loc (Tensor of shape (10, 2)) of distribution Normal(loc: torch.Size([10, 2]), scale: torch.Size([10, 2])) to satisfy the constraint Real(), but found invalid values: tensor([[nan, nan], [nan, nan], [nan, nan], [nan, nan], [nan, nan], [nan, nan], [nan, nan], [nan, nan], [nan, nan], [nan, nan]], device='cuda:0', grad_fn=<SliceBackward0>)

zabboud avatar May 08 '22 02:05 zabboud

Hi Zabboud,

I'm working on my own implementation, so I can't comment on this exact codebase. But what I found: changing the initialisation (in this case from nn.init.kaiming_normal_(m.weight, mode='fan_in', nonlinearity='relu') to nn.init.normal_(m.weight, std=0.001) solves the problem for me.

Curious if you found some more information on this in the meantime!

JasperLinmans avatar Jun 17 '22 11:06 JasperLinmans

Hi Zabboud,

I'm working on my own implementation, so I can't comment on this exact codebase. But what I found: changing the initialisation (in this case from nn.init.kaiming_normal_(m.weight, mode='fan_in', nonlinearity='relu') to nn.init.normal_(m.weight, std=0.001) solves the problem for me.

Curious if you found some more information on this in the meantime!

Actually changing the learning rate (decreasing it) fixed the problem for me. I'm still unsure as to why it happens, do you have an idea of why it happens? I'd be interested to test out the different initialization!

zabboud avatar Jul 06 '22 15:07 zabboud

The model is quite sensitive and a too high learning rate and some initialization methods can cause the loss to go to NaN.

stefanknegt avatar Sep 14 '22 07:09 stefanknegt

Thank you - I figured out that part -- I was wondering if you have some insight on why with other datasets the loss seems not to decrease, however, I can see that the predictions are improving through visual feedback -- any insight on this issue?

zabboud avatar Sep 14 '22 17:09 zabboud

Hmm I think you should look at the 2 components of the loss function and how they evolve over time. Maybe this can give you some insight into why the loss is not decreasing while the predictions seem to improve.

stefanknegt avatar Sep 15 '22 07:09 stefanknegt

Both the total ELBO loss and the KL loss are just stagnant - there's little to no change. Do you have any suggestions on what to tune from the parameters (whether it's latent dimension, gamma, beta, num_convs_fcomb)? I've been playing around with the preprocessing of the data (liver dataset) -but with no luck to make the model learn to predict lesion location.

I've tested the model on the lung dataset - and it works, I have some diversity in the predictions, and there's a progression in the loss - but unfortunately no progress on the liver dataset, whether in predicting liver or lesion.

zabboud avatar Sep 15 '22 15:09 zabboud

I am not sure why that happens and guess that changing things like the latent dimension and num_convs_fcomb is not going to help. I've only tested it on the LIDC and although I sometimes had issues with the loss, it never remained stagnant. Good luck!

stefanknegt avatar Sep 16 '22 11:09 stefanknegt

Thank you - I realized that often the KL divergence term goes to 0 -- what would be the cause of that? Probably an indicator of why the model is not training properly

zabboud avatar Sep 19 '22 15:09 zabboud