pytorch-retinanet
pytorch-retinanet copied to clipboard
Loss is coming nan
both loc_loss and cls_loss are coming nan can u suggest the solution
the condition may be caused by the size of anchors that anchors'size can't match your detected objects
I also meet this problem......
both loc_loss and cls_loss are coming nan can u suggest the solution
I suppose u can print the number of positive example and u can adjust the ratios of the anchor according the number
No description provided.
did you solved this problem?
Who have such problem? May someone recommend the solution? In my case it seems like this: `loc_loss: 0.086 | cls_loss: 673.821 | Train_loss: 673.90753 | avg_loss: 673.90753
loc_loss: 0.088 | cls_loss: 540.022 | Train_loss: 540.11029 | avg_loss: 607.00891
loc_loss: 0.081 | cls_loss: 589.325 | Train_loss: 589.40613 | avg_loss: 601.14132
loc_loss: 0.081 | cls_loss: 418.840 | Train_loss: 418.92139 | avg_loss: 555.58633
loc_loss: 0.083 | cls_loss: 268.827 | Train_loss: 268.90982 | avg_loss: 498.25103
loc_loss: 0.086 | cls_loss: 211.607 | Train_loss: 211.69376 | avg_loss: 450.49149
loc_loss: 0.106 | cls_loss: 71.394 | Train_loss: 71.49988 | avg_loss: 396.34983
loc_loss: 0.075 | cls_loss: 28.076 | Train_loss: 28.15103 | avg_loss: 350.32498
loc_loss: 0.088 | cls_loss: 19.801 | Train_loss: 19.88938 | avg_loss: 313.60991
loc_loss: 0.086 | cls_loss: 12.623 | Train_loss: 12.70911 | avg_loss: 283.51983
loc_loss: 0.092 | cls_loss: inf | Train_loss: inf | avg_loss: inf
loc_loss: nan | cls_loss: nan | Train_loss: nan | avg_loss: nan
loc_loss: nan | cls_loss: nan | Train_loss: nan | avg_loss: nan`
Problem was solved. This is rewritten code
from __future__ import print_function
import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.autograd import Variable
class FocalLoss(nn.Module):
def __init__(self, num_classes):
super(FocalLoss, self).__init__()
self.num_classes = num_classes
def _one_hot_embeding(self, labels):
"""Embeding labels to one-hot form.
Args:
labels(LongTensor): class labels
num_classes(int): number of classes
Returns:
encoded labels, sized[N, #classes]
"""
y = torch.eye(self.num_classes+1) # [D, D]
return y[labels] # [N, D]
def focal_loss(self, x, y):
"""Focal loss
Args:
x(tensor): size [N, D]
y(tensor): size [N, ]
Returns:
(tensor): focal loss
"""
alpha = 0.25
gamma = 2
t = self._one_hot_embeding(y.data.cpu()) # [N,21]
t = t[:, 1:] # exclude background
t = Variable(t).cuda() # [N,20]
logit = F.softmax(x)
logit = logit.clamp(1e-7, 1.-1e-7)
conf_loss_tmp = -1 * t.float() * torch.log(logit)
conf_loss_tmp = alpha * conf_loss_tmp * (1-logit)**gamma
conf_loss = conf_loss_tmp.sum()
return conf_loss
def forward(self, loc_preds, loc_targets, cls_preds, cls_targets):
"""Compute loss between (loc_preds, loc_targets) and (cls_preds, cls_targets).
Args:
loc_preds(tensor): predicted locations, sized [batch_size, #anchors, 4].
loc_targets(tensor): encoded target locations, sized [batch_size, #anchors, 4].
cls_preds(tensor): predicted class confidences, sized [batch_size, #anchors, #classes].
cls_targets(tensor): encoded target labels, sized [batch_size, #anchors].
Returns:
(tensor) loss = SmoothL1Loss(loc_preds, loc_targets) + FocalLoss(cls_preds, cls_targets).
"""
pos = cls_targets > 0 # [N,#anchors]
num_pos = pos.data.long().sum()
# loc_loss = SmoothL1Loss(pos_loc_preds, pos_loc_targets)
mask = pos.unsqueeze(2).expand_as(loc_preds) # [N,#anchors,4]
masked_loc_preds = loc_preds[mask].view(-1, 4) # [#pos,4]
masked_loc_targets = loc_targets[mask].view(-1, 4) # [#pos,4]
loc_loss = F.smooth_l1_loss(masked_loc_preds, masked_loc_targets, size_average=False)
# cls_loss = FocalLoss(loc_preds, loc_targets)
pos_neg = cls_targets > -1 # exclude ignored anchors
# num_pos_neg = pos_neg.data.long().sum()
mask = pos_neg.unsqueeze(2).expand_as(cls_preds)
masked_cls_preds = cls_preds[mask].view(-1, self.num_classes)
cls_loss = self.focal_loss(masked_cls_preds, cls_targets[pos_neg])
num_pos = max(1.0, num_pos.item())
print('loc_loss: %.3f | cls_loss: %.3f' % (loc_loss.item() / num_pos, cls_loss.item() / num_pos), end=' | ')
loss = loc_loss / num_pos + cls_loss / num_pos
return loss
`
Thanks, It is worked. But can you tell me which statement did you change? I try to but can't find it. Thank you @miramind
Where did you get ckpt.pth and params.pth? please help me @heartInsert ,thank you
I don't have pretrained model, I trained the code myself in voc dataset@Imagery007
Thanks.I didn't understand before,Actually, net.pth can be trained without ckpt.pth and params.pth. @heartInsert Thanks again.
@Imagery007 Do you predict a real picture and draw bboxes in it ? I think there is bug in train that I can't get right bboxes.
@heartInsert Yes, I can't run test.py.I still don't know how to solve it.
RuntimeError: Error(s) in loading state_dict for RetinaNet: Missing key(s) in state_dict: "fpn.conv1.weight", "fpn.bn1.weight", "fpn.bn1.bias", "fpn.bn1.running_mean", "fpn.bn1.running_var", "fpn.layer1.0.conv1.weight", "fpn.layer1.0.bn1.weight", "fpn.layer1.0.bn1.bias", "fpn.layer1.0.bn1.running_mean", "fpn.layer1.0.bn1.running_var", "fpn.layer1.0.conv2.weight", "fpn.layer1.0.bn2.weight", "fpn.layer1.0.bn2.bias", "fpn.layer1.0.bn2.running_mean", "fpn.layer1.0.bn2.running_var", "fpn.layer1.0.conv3.weight", "fpn.layer1.0.bn3.weight", "fpn.layer1.0.bn3.bias", "fpn.layer1.0.bn3.running_mean", "fpn.layer1.0.bn3.running_var", "fpn.layer1.0.downsample.0.weight",............
@heartInsert I read the code carefully and successfully ran test.py. I found that test.py was only used to draw the filtered anchor and could not make predictions well.
@miramind Hello, I want to know the meaning of t in the 43rd line of loss.py. Hope to get your reply.Thanks.
The effective reason the this line just before the print statement
num_pos = max(1.0, num_pos.item())
It makes num_pos a floating point number, this makes sure the loc_loss.item()/num_pos is a floating point result as well.
In my experiment (custom data and VOC),I found the classification loss may be Nan and the reason is num_pos may be 0.
@heartInsert Yes, I can't run test.py.I still don't know how to solve it.
RuntimeError: Error(s) in loading state_dict for RetinaNet: Missing key(s) in state_dict: "fpn.conv1.weight", "fpn.bn1.weight", "fpn.bn1.bias", "fpn.bn1.running_mean", "fpn.bn1.running_var", "fpn.layer1.0.conv1.weight", "fpn.layer1.0.bn1.weight", "fpn.layer1.0.bn1.bias", "fpn.layer1.0.bn1.running_mean", "fpn.layer1.0.bn1.running_var", "fpn.layer1.0.conv2.weight", "fpn.layer1.0.bn2.weight", "fpn.layer1.0.bn2.bias", "fpn.layer1.0.bn2.running_mean", "fpn.layer1.0.bn2.running_var", "fpn.layer1.0.conv3.weight", "fpn.layer1.0.bn3.weight", "fpn.layer1.0.bn3.bias", "fpn.layer1.0.bn3.running_mean", "fpn.layer1.0.bn3.running_var", "fpn.layer1.0.downsample.0.weight",............
I have the same problem as you,can you tell me how to slove it
@Imagery007
@heartInsert Yes, I can't run test.py.I still don't know how to solve it. RuntimeError: Error(s) in loading state_dict for RetinaNet: Missing key(s) in state_dict: "fpn.conv1.weight", "fpn.bn1.weight", "fpn.bn1.bias", "fpn.bn1.running_mean", "fpn.bn1.running_var", "fpn.layer1.0.conv1.weight", "fpn.layer1.0.bn1.weight", "fpn.layer1.0.bn1.bias", "fpn.layer1.0.bn1.running_mean", "fpn.layer1.0.bn1.running_var", "fpn.layer1.0.conv2.weight", "fpn.layer1.0.bn2.weight", "fpn.layer1.0.bn2.bias", "fpn.layer1.0.bn2.running_mean", "fpn.layer1.0.bn2.running_var", "fpn.layer1.0.conv3.weight", "fpn.layer1.0.bn3.weight", "fpn.layer1.0.bn3.bias", "fpn.layer1.0.bn3.running_mean", "fpn.layer1.0.bn3.running_var", "fpn.layer1.0.downsample.0.weight",............
I have the same problem as you,can you tell me how to slove it
Hi there,,any luck with this? I'm having the same trouble and would love to know how to solve it.