Machine-Learning-Collection icon indicating copy to clipboard operation
Machine-Learning-Collection copied to clipboard

UNET model returns NaN values

Open mlaszko opened this issue 3 years ago • 4 comments

I trained UNET model and get dice score = 0 and all preds are black pixels only. I debugged it. After first "down" in training all values of tensor x are nan.

for down in self.downs:
            x = down(x) # after this all values are nan
            skip_connections.append(x)
            x = self.pool(x)

My setup: Windows 10 Python 3.10 pyttorch 1.11.0 torchvision 0.12.0

mlaszko avatar Jun 03 '22 08:06 mlaszko

I'm having the same issue and determined the output of the second convolution of DoubleConv goes to nan during the first iteration of the down for loop. Haven't found a solution yet.

class DoubleConv(nn.Module): # a double convolution is performed at each step in UNET, so creating the class simplifies things
    def __init__(self, in_channels, out_channels): # Initialize the class
        super(DoubleConv, self).__init__() # Inherit class properties from nn.Module
        self.conv = nn.Sequential( # Build sequence of operations
            nn.Conv2d(in_channels, out_channels, 3, 1, 1, bias=False), 
            nn.BatchNorm2d(out_channels), 
            nn.ReLU(inplace=True), 
            nn.Conv2d(out_channels, out_channels, 3, 1, 1, bias=False), # Nan after first call here
            nn.BatchNorm2d(out_channels),
            nn.ReLU(inplace=True)
            )

    def forward(self, x):
        return self.conv(x)

jtwegner23 avatar Sep 01 '22 00:09 jtwegner23

In my case it seems to be related to cuda. When I run UNET on a dataset on cpu it runs fine. When I run it on cuda it returns nan. However, I don't know how to edit the code to run on cpu because gradScaler is only on cuda.

jtwegner23 avatar Sep 02 '22 01:09 jtwegner23

I suppose I got the same problem as you guys. I haven't checked for tensor values, but my loss is nan and all my predicted images remain black. Changing from "cuda" to "cpu" seems to solve this issue, but of course can't be a real solution. If you still want to try it on your cpu you can remove the gradScaler line, remove the scaler-parameter from the train_fn and replace the backward-code in the train_fn with this:

optimizer.zero_grad()
output = model(data)
loss = loss_fn(predictions, targets)
loss.backward()
optimizer.step()

lpetflo avatar May 09 '23 16:05 lpetflo

I dug a little bit deeper and seemingly found a solution for this issue. Only thing you have to do is disable autocast in the train_fn:

# forward
with torch.cuda.amp.autocast(enabled=False):
    ...

For debugging I added the line `torch.autograd.set_detect_anomaly(True) which resulted in RuntimeError: Function 'BinaryCrossEntropyWithLogitsBackward0' returned nan values in its 0th output and while researching I found a similar issue in the official pytorch repo.

lpetflo avatar May 10 '23 09:05 lpetflo