faster-rcnn.pytorch
faster-rcnn.pytorch copied to clipboard
Training Loss : Nan
My training loss always becomes NAN when the iteteration comes to several hundred iters. All parameters are default. My training dataset is useable for py-faster rcnn, and copy to faster-rcnn.pytorch directory. My training command is : python trainval_net.py --dataset pascal_voc --net vgg16 --bs 1 --lr 0.001 --cuda
There are some advices for that? TKS
This is my training print-out [session 1][epoch 1][iter 0] loss: 6.3588, lr: 1.00e-03 fg/bg=(20/236), time cost: 1.659386 rpn_cls: 0.8163, rpn_box: 4.6926, rcnn_cls: 0.8488, rcnn_box 0.0010 [session 1][epoch 1][iter 100] loss: 1.0697, lr: 1.00e-03 fg/bg=(24/232), time cost: 33.015444 rpn_cls: 0.1404, rpn_box: 0.6121, rcnn_cls: 0.2425, rcnn_box 0.1688 [session 1][epoch 1][iter 200] loss: 0.7961, lr: 1.00e-03 fg/bg=(43/213), time cost: 33.076333 rpn_cls: 0.1488, rpn_box: 1.1833, rcnn_cls: 0.3630, rcnn_box 0.2185 [session 1][epoch 1][iter 300] loss: nan, lr: 1.00e-03 fg/bg=(256/0), time cost: 33.628527 rpn_cls: nan, rpn_box: nan, rcnn_cls: nan, rcnn_box nan [session 1][epoch 1][iter 400] loss: nan, lr: 1.00e-03 fg/bg=(256/0), time cost: 32.910808 rpn_cls: nan, rpn_box: nan, rcnn_cls: nan, rcnn_box nan [session 1][epoch 1][iter 500] loss: nan, lr: 1.00e-03 fg/bg=(256/0), time cost: 32.843017 rpn_cls: nan, rpn_box: nan, rcnn_cls: nan, rcnn_box nan [session 1][epoch 1][iter 600] loss: nan, lr: 1.00e-03 fg/bg=(256/0), time cost: 32.721040 rpn_cls: nan, rpn_box: nan, rcnn_cls: nan, rcnn_box nan [session 1][epoch 1][iter 700] loss: nan, lr: 1.00e-03 fg/bg=(256/0), time cost: 33.876777 rpn_cls: nan, rpn_box: nan, rcnn_cls: nan, rcnn_box nan [session 1][epoch 1][iter 800] loss: nan, lr: 1.00e-03 fg/bg=(256/0), time cost: 33.819963 rpn_cls: nan, rpn_box: nan, rcnn_cls: nan, rcnn_box nan
Hi @xiaoxingzeng , I have the same problem but sadly i still have no answer. Are you using a custom dataset? Maybe some answers of this issue can help you. Good luck!
@JavHaro @xiaoxingzeng I am having the same issue of nan values with custom dataset. Did you find the solution? Thanks
I have the same probelm with my own dataset. Did you find the solution? Thanks a lot!
[session 1][epoch 1][iter 5400/33021] loss: 0.3792, lr: 1.00e-03 fg/bg=(113/399), time cost: 129.585304 rpn_cls: 0.0753, rpn_box: 0.2207, rcnn_cls: 0.1785, rcnn_box 0.2806 [session 1][epoch 1][iter 5500/33021] loss: 0.3538, lr: 1.00e-03 fg/bg=(42/470), time cost: 129.476077 rpn_cls: 0.0916, rpn_box: 0.1274, rcnn_cls: 0.0800, rcnn_box 0.1175 [session 1][epoch 1][iter 5600/33021] loss: nan, lr: 1.00e-03 fg/bg=(512/0), time cost: 122.264345 rpn_cls: nan, rpn_box: nan, rcnn_cls: nan, rcnn_box nan [session 1][epoch 1][iter 5700/33021] loss: nan, lr: 1.00e-03 fg/bg=(512/0), time cost: 119.232911 rpn_cls: nan, rpn_box: nan, rcnn_cls: nan, rcnn_box nan
Hi, all, If possible, it would be good to share your data loaders with us, so that we can check why that happened.
@jwyang I use caltech dataset for pedestrian detection. When training, I print num_boxes and gt_boxes under "print loss" like this:
print("[session %d][epoch %2d][iter %4d] loss: %.4f, lr: %.2e"
% (args.session, epoch, step, loss_temp, lr))
print("\t\t\tfg/bg=(%d/%d), time cost: %f" % (fg_cnt, bg_cnt, end-start))
print("\t\t\trpn_cls: %.4f, rpn_box: %.4f, rcnn_cls: %.4f, rcnn_box %.4f"
% (loss_rpn_cls, loss_rpn_box, loss_rcnn_cls, loss_rcnn_box))
gt_boxes_cpu=gt_boxes.cpu().data.numpy()
im_data_cpu=im_data.cpu().data
for i in range(args.batch_size):
num_gt=num_boxes.data[i]
print("the %dth image have num_boxes: %d and the gt_boxes are:"%(i,num_gt))
print(gt_boxes_cpu[i][:num_gt])
img=im_data_cpu[i].permute(1,2,0).numpy()
#_vis_minibatch(img, gt_boxes_cpu[i][:num_gt])
I even visualize the image and gt_boxes like this: def _vis_minibatch(im_blob, rois_blob): """Visualize a mini-batch for debugging.""" import matplotlib.pyplot as plt for i in xrange(rois_blob.shape[0]): rois = rois_blob[i, :] #im_ind = rois[0] roi = rois[:4] im = im_blob[:, :, :].copy() im += cfg.PIXEL_MEANS im = im[:, :, (2, 1, 0)] im = im.astype(np.uint8) plt.imshow(im) plt.gca().add_patch( plt.Rectangle((roi[0], roi[1]), roi[2] - roi[0], roi[3] - roi[1], fill=False, edgecolor='r', linewidth=3) ) plt.show()
My training command is : python trainval_net.py --dataset caltech --net vgg16 --bs 2 --gpu 0 --cuda This is my training print-out:
[session 1][epoch 1][iter 100] loss: 0.6892, lr: 1.00e-03 fg/bg=(19/493), time cost: 125.878273 rpn_cls: 0.2557, rpn_box: 0.4797, rcnn_cls: 0.1704, rcnn_box 0.0634 the 0th image have num_boxes: 3 and the gt_boxes are: [[ 465. 215. 496.25 271.25 1. ] [ 536.25 206.25 561.25 255. 1. ] [ 168.75 216.25 218.75 303.75 1. ]] the 1th image have num_boxes: 3 and the gt_boxes are: [[ 548.75 210. 550. 238.75 1. ] [ 175. 216.25 227.5 303.75 1. ] [ 465. 215. 496.25 271.25 1. ]] [session 1][epoch 1][iter 200] loss: 0.5355, lr: 1.00e-03 fg/bg=(20/492), time cost: 124.361140 rpn_cls: 0.2848, rpn_box: 0.4464, rcnn_cls: 0.1204, rcnn_box 0.0524 the 0th image have num_boxes: 6 and the gt_boxes are: [[ 388.75 207.5 402.5 253.75 1. ] [ 511.25 193.75 523.75 232.5 1. ] [ 327.5 195. 345. 242.5 1. ] [ 493.75 193.75 512.5 238.75 1. ] [ 407.5 195. 425. 252.5 1. ] [ 210. 190. 236.25 253.75 1. ]] the 1th image have num_boxes: 5 and the gt_boxes are: [[ 450. 200. 451.25 251.25 1. ] [ 506.25 206.25 523.75 250. 1. ] [ 258.75 202.5 273.75 250. 1. ] [ 497.5 201.25 512.5 246.25 1. ] [ 422.5 211.25 435. 252.5 1. ]] [session 1][epoch 1][iter 300] loss: 0.4500, lr: 1.00e-03 fg/bg=(4/508), time cost: 124.083962 rpn_cls: 0.1161, rpn_box: 0.0771, rcnn_cls: 0.0267, rcnn_box 0.0117 the 0th image have num_boxes: 1 and the gt_boxes are: [[ 720. 243.75 743.75 303.75 1. ]] the 1th image have num_boxes: 1 and the gt_boxes are: [[ 720. 243.75 743.75 303.75 1. ]] [session 1][epoch 1][iter 400] loss: 0.4334, lr: 1.00e-03 fg/bg=(3/509), time cost: 123.896316 rpn_cls: 0.2213, rpn_box: 0.0603, rcnn_cls: 0.0627, rcnn_box 0.0038 the 0th image have num_boxes: 1 and the gt_boxes are: [[ 22.5 225. 45. 261.25 1. ]] the 1th image have num_boxes: 1 and the gt_boxes are: [[ 28.75 230. 37.5 257.5 1. ]] [session 1][epoch 1][iter 500] loss: 0.4303, lr: 1.00e-03 fg/bg=(19/493), time cost: 124.256176 rpn_cls: 0.1117, rpn_box: 0.2408, rcnn_cls: 0.0595, rcnn_box 0.0675 the 0th image have num_boxes: 3 and the gt_boxes are: [[ 178.75 205. 208.75 277.5 1. ] [ 541.25 217.5 558.75 257.5 1. ] [ 595. 210. 618.75 268.75 1. ]] the 1th image have num_boxes: 3 and the gt_boxes are: [[ 183.75 206.25 213.75 277.5 1. ] [ 592.5 208.75 616.25 267.5 1. ] [ 541.25 217.5 558.75 257.5 1. ]] [session 1][epoch 1][iter 600] loss: nan, lr: 1.00e-03 fg/bg=(512/0), time cost: 124.420586 rpn_cls: nan, rpn_box: nan, rcnn_cls: nan, rcnn_box nan the 0th image have num_boxes: 1 and the gt_boxes are: [[ 408.75 225. 427.5 247.5 1. ]] the 1th image have num_boxes: 1 and the gt_boxes are: [[ 380. 233.75 398.75 256.25 1. ]] [session 1][epoch 1][iter 700] loss: nan, lr: 1.00e-03 fg/bg=(512/0), time cost: 124.281587 rpn_cls: nan, rpn_box: nan, rcnn_cls: nan, rcnn_box nan the 0th image have num_boxes: 1 and the gt_boxes are: [[ 610. 176.25 638.75 248.75 1. ]] the 1th image have num_boxes: 1 and the gt_boxes are: [[ 606.25 176.25 635. 248.75 1. ]]
I think the im_data and gt_boxes are right before being sent into the net, but when training, the "fg/bg" is abnormal and the number of "fg" is too small until the"fg/bg" is "512/0". I have been confused by this problem for one week and I hope I can get some useful advice as soon as possible. Thanks a lot!
Hi, @sdlyaoyuan , I will take a look at this problem, and try to solve it as soon as possible.
Hi, @jwyang , I had the same issue these day too when I tried to use my customized dataset to train. The same thing (NAN) happened when 'fg/bg' is 1024/0. It seems like NAN occurs when there are no bg samples during the rois sampling. Maybe the code never steped into this case when using dataset like COCO?
Thanks a lot!:)
This is abnormal I guess... once the bg=0 occurs, it keeps happening to all the following batches...
Hi @jwyang , @ChenXuelinCXL, and @sdlyaoyuan I just posted in other thread that i have located the problem. In my case, the problem is in the annotations loading. I don't know why, when you load the minimum values of the annotation (xmin & y min) if they are close to 0 it loads 65534 (the maximum value minus 2) so when you manage the areas and calculate xmax-xmin the value is negative. I solve it checking the minimum values after loading. I hope this could help you.
Hi @JavHaro , thanks for the help, I will see if that is my case.:)
BTW, I am wondering if any one of you had your code entered these 'elif' when training?
@ChenXuelinCXL If you make your customized dataset as the format of pascal voc , you can try make such change in pascal_voc.py: for ix, obj in enumerate(objs): bbox = obj.find('bndbox') # Make pixel indexes 0-based x1 = float(bbox.find('xmin').text) #- 1 y1 = float(bbox.find('ymin').text) #- 1 x2 = float(bbox.find('xmax').text) #- 1 y2 = float(bbox.find('ymax').text) #- 1
OK.. after I tried several times. I double checked my code to make the bounding box coordinations all correct. It is still producing NAN. The bg=0 still happens.
Hi @ChenXuelinCXL Where did you check the bounding box coordinates? I say it because depending on the part of the code where you check them the value to check will be different. Anyway, you can check if xmin<xmax or ymin<ymax in any part, just to locate the problem. KR
@JavHaro I am still locating the bug. I am very sure that the bounding box from myd data is correct. Now I got this:
I print the rois output from RPN, this is what causes number of bg = 0, all proposals from RPN are almost zero boxes, except the first few wired boxes.
Trying to figure out what causes this. Note I add a filter to assign zero scores for those very small proposal boxes in RPN, but RPN still outputs them, that measn all proposals in RPN are almost zero boxes!?
I am getting similar issues regarding the foreground bboxes. It is related to RPN layer.
In my case, i got this error when the bbox of GT is wrong like [-1, -1, 100, 200] where x
and y
are -1. The bbox representation in this code is uint16, so all -1 numbers are overflowed.
@ChenXuelinCXL Hi, I met the same problem as yours, have you solved it?
@lolongcovas In my case you are right!
Tried reducing the learning rate? I sometimes have to set it as low as 0.00001
Hi, I encountered the nan issue with my faster_rcnn_resnet50 model, turns out I was using Adam optimizer which led to the values going to Nan, I changed it back to SGD with momentum and weight decay (to which the original architecture was trained) and changed my learning rate from 0.01 to 0.0001, the results were better, Hope this helps