jetson-inference
                                
                                 jetson-inference copied to clipboard
                                
                                    jetson-inference copied to clipboard
                            
                            
                            
                        Re-training SSD-Mobilenet: gt_locations consist of nan values which causing Regression Loss to NaN
Hi,
I'm training SSD-Mobilenet Model on Bosch Small Traffic Lights Dataset.
While training, my Avg Loss is reducing slowly but suddenly I'm getting NaN. I followed the following methods but the issue still persists.
- https://forums.developer.nvidia.com/t/error-training-with-jetson-inference/210095 I have verified the image's XML files and they look fine. Sometimes I'm not getting any NaN value for 'epoch 0'
- Tuning learning rate i.e. 0.01, 0.001, 0.0001 etc
- Using ADAM Optimizer
But after enabling Pytorch's Anomaly Detection i.e. torch.autograd.set_detect_anomaly(True), I'm able to find the instance and source of NaN. By further debugging, I have observed that one of the box locations in gt_location is having nan values (please refer to the following log)
image_id: 481834
predicted_locations:  tensor([[  1.4837,   1.2564,  -6.5235,  -2.5821],
[  0.6447,   0.8457, -16.9513, -11.4073],
[  2.0294,   0.9745, -15.5438, -14.0698],
[  1.8593,   1.0754, -15.8804, -14.4709],
[  2.0474,   1.3663, -15.7238, -14.4092]],
grad_fn=<ReshapeAliasBackward0>)
gt_locations:  tensor([[ 25.0286,  15.6667,      nan,      nan],
[  4.0797,   2.3779, -13.1398,  -8.8714],
[  4.1841,   2.5611, -14.6530, -13.4025],
[  2.0534,   0.6725, -13.3843, -12.9900],
[  3.5518,   0.3255, -14.6399, -13.4983]])
regression_loss: nan | classification_loss: 3.4250411987304688 | loss: nan
/usr/local/lib/python3.7/dist-packages/torch/autograd/init.py:175: UserWarning: Error detected in SmoothL1LossBackward0. Traceback of forward call that caused the error:
File "train_ssd.py", line 409, in 
I think TrainAugmentation causing this issue but not sure. To verify that I want to disable Image Augmentation. @dusty-nv could you please let me know how to do that?
Thank you in advance!
I think TrainAugmentation causing this issue but not sure. To verify that I want to disable Image Augmentation.
I would remove operators from https://github.com/dusty-nv/pytorch-ssd/blob/21383204c68846bfff95acbbd93d39914a77c707/vision/ssd/data_preprocessing.py#L13 to determine which one is causing the NaN's
That's a nifty tip about torch.autograd.set_detect_anomaly(), I will have to remember that.
Hi @dusty-nv, Thank you so much for your prompt response.
As per your suggestion, I have tried to remove each operator individually to determine NaN source but they all are giving non-NaN values. However, when I did isnan check on output from target_transform I'm able to locate the issue.
https://github.com/dusty-nv/pytorch-ssd/blob/21383204c68846bfff95acbbd93d39914a77c707/vision/utils/box_utils.py#L115
torch.log(center_form_boxes[..., 2:] / center_form_priors[..., 2:]) The above log term from the convert_boxes_to_locations function causes this issue. Please refer to the following log.
2022-09-14 17:11:18 - Epoch: 0, Step: 1383/2195, Avg Loss: 12.3409, Avg Regression Loss 7.9730, Avg Classification Loss: 4.3678
center_form_boxes[..., :2]:  tensor([[0.3156, 0.3982],
[0.7146, 0.3298],
[0.7146, 0.3298],
...,
[0.3233, 0.3999],
[0.3233, 0.3999],
[0.3233, 0.3999]])
center_form_priors[..., :2]:  tensor([[0.0267, 0.0267],
[0.0267, 0.0267],
[0.0267, 0.0267],
...,
[0.5000, 0.5000],
[0.5000, 0.5000],
[0.5000, 0.5000]])
torch.log term:  tensor([[    nan,     nan],
[-3.5644, -2.0298],
[-3.6312, -1.4035],
...,
[-3.8496, -2.7807],
[-4.2475, -2.1801],
[-3.6469, -2.7807]])
2022-09-14 17:11:23 - Epoch: 0, Step: 1384/2195, Avg Loss: 4.9618, Avg Regression Loss 1.6924, Avg Classification Loss: 3.2693
2022-09-14 17:11:28 - Epoch: 0, Step: 1385/2195, Avg Loss: 7.9004, Avg Regression Loss 3.6440, Avg Classification Loss: 4.2564
Traceback (most recent call last):
File "train_ssd.py", line 412, in 
Could you please suggest how to resolve that ?
I'm attaching the saved tensors files (using TORCH.SAVE()) which consist of center_form_boxes, center_form_priors, and log term tensors.zip
The center_form_boxes consists of negative values and torch.log of negative value results in NaN
 

@dusty-nv please let me know what you think.
I'm not super familiar with all the details of the transforms, as I'm not the original author of the pytorch-ssd code. You could try logging an issue on the upstream github for it. Or if this condition only happens on a few items from your dataset, remove those from the dataset.
Hello, I got a same error with your case.
I solved this by using torch.nan_to_num() function, it convert nan to 0, and also -inf to custom value.
You can check documemtation here(https://pytorch.org/docs/stable/generated/torch.nan_to_num.html).
I can't tell you it would not be affect to your model performance because I am new in machine learning, but I hope it could be helpful to you :D
same issue, plz refer my code and data @dusty-nv
import numpy as np
import torch
import torch.nn.functional as F
# d_gt = np.random.random([12, 12])
# d_pred = np.random.random([12, 12])
d_gt = np.load('C:/Users/liqiang.li/Downloads/20230216_082727__gt_locations.txt.npy')
d_pred = np.load('C:/Users/liqiang.li/Downloads/20230216_082727__predicted_locations.txt.npy')
smooth_l1_loss = F.smooth_l1_loss(torch.tensor(d_pred),
                                  torch.tensor(d_gt),size_average=False
                                  )
# smooth_l1_loss  >> nan
print(smooth_l1_loss)
Please refer my code that fix the bug 👍
@dusty-nv @KhemSon thanks again for your seggestion of locating bug. @KhemSon
def convert_boxes_to_locations(center_form_boxes, center_form_priors, center_variance, size_variance):
    # priors can have one dimension less
    if center_form_priors.dim() + 1 == center_form_boxes.dim():
        center_form_priors = center_form_priors.unsqueeze(0)
  
    # fix nan bug,add relu function before log,leef,20230223
    return torch.cat([
        (center_form_boxes[..., :2] - center_form_priors[..., :2]) / center_form_priors[..., 2:] / center_variance,
        torch.log(F.relu(center_form_boxes[..., 2:] / center_form_priors[..., 2:])+1e-7) / size_variance
    ],
        dim=center_form_boxes.dim() - 1)