pytorch-ssd icon indicating copy to clipboard operation
pytorch-ssd copied to clipboard

Average regression loss and classification loss nan, average regression loss inf when training vgg16 model

Open owenvt1 opened this issue 5 years ago • 4 comments

This happens when running the examples for training on VOC, as well as Open Images. My goal is to train on Open Images.

owenvt1 avatar Dec 18 '19 01:12 owenvt1

Solved by reducing the learning rate to 0.00001

owenvt1 avatar Dec 18 '19 01:12 owenvt1

Late to the party, but this is a problem with the new version of torchvision 0.5. It is not easy to predict (and at times it happens in places you won't look) but it gives nans during the training (or even inference). In this case, it gives nan/inf (as you described) even just training on VOC.

The solution is to downgrade torchvision (I am using now 0.2 without problems).

TheRevanchist avatar Feb 19 '20 00:02 TheRevanchist

@TheRevanchist i got same issue,i have reduced learning rate and downgraded the torchvision, what torch do you use?

ijalalfrz avatar Jun 22 '20 07:06 ijalalfrz

I found this was an issue with one of the data samples annotation

<bndbox>
<xmin>0</xmin>
<ymin>370</ymin>
<xmax>0</xmax>
<ymax>407</ymax>
</bndbox>

xmin and xmax are in the same place which doesn't make sense, this was a fault of the data augmentation tool. I found it was the issue by first providing the training 1 sample, then 10 samples, then 100 etc until I saw the failure at 10,000 and back tracked to find the exact offending sample

Abdob avatar Mar 12 '21 18:03 Abdob