Machine-Learning-Collection
Machine-Learning-Collection copied to clipboard
UNET model returns NaN values
I trained UNET model and get dice score = 0 and all preds are black pixels only. I debugged it. After first "down" in training all values of tensor x are nan.
for down in self.downs:
x = down(x) # after this all values are nan
skip_connections.append(x)
x = self.pool(x)
My setup: Windows 10 Python 3.10 pyttorch 1.11.0 torchvision 0.12.0
I'm having the same issue and determined the output of the second convolution of DoubleConv goes to nan during the first iteration of the down for loop. Haven't found a solution yet.
class DoubleConv(nn.Module): # a double convolution is performed at each step in UNET, so creating the class simplifies things
def __init__(self, in_channels, out_channels): # Initialize the class
super(DoubleConv, self).__init__() # Inherit class properties from nn.Module
self.conv = nn.Sequential( # Build sequence of operations
nn.Conv2d(in_channels, out_channels, 3, 1, 1, bias=False),
nn.BatchNorm2d(out_channels),
nn.ReLU(inplace=True),
nn.Conv2d(out_channels, out_channels, 3, 1, 1, bias=False), # Nan after first call here
nn.BatchNorm2d(out_channels),
nn.ReLU(inplace=True)
)
def forward(self, x):
return self.conv(x)
In my case it seems to be related to cuda. When I run UNET on a dataset on cpu it runs fine. When I run it on cuda it returns nan. However, I don't know how to edit the code to run on cpu because gradScaler is only on cuda.
I suppose I got the same problem as you guys. I haven't checked for tensor values, but my loss is nan and all my predicted images remain black. Changing from "cuda" to "cpu" seems to solve this issue, but of course can't be a real solution. If you still want to try it on your cpu you can remove the gradScaler line, remove the scaler-parameter from the train_fn and replace the backward-code in the train_fn with this:
optimizer.zero_grad()
output = model(data)
loss = loss_fn(predictions, targets)
loss.backward()
optimizer.step()
I dug a little bit deeper and seemingly found a solution for this issue. Only thing you have to do is disable autocast in the train_fn:
# forward
with torch.cuda.amp.autocast(enabled=False):
...
For debugging I added the line `torch.autograd.set_detect_anomaly(True) which resulted in RuntimeError: Function 'BinaryCrossEntropyWithLogitsBackward0' returned nan values in its 0th output and while researching I found a similar issue in the official pytorch repo.