faster-rcnn.pytorch icon indicating copy to clipboard operation
faster-rcnn.pytorch copied to clipboard

Training Loss : Nan

Open xiaoxingzeng opened this issue 6 years ago • 21 comments

My training loss always becomes NAN when the iteteration comes to several hundred iters. All parameters are default. My training dataset is useable for py-faster rcnn, and copy to faster-rcnn.pytorch directory. My training command is : python trainval_net.py --dataset pascal_voc --net vgg16 --bs 1 --lr 0.001 --cuda

There are some advices for that? TKS

xiaoxingzeng avatar Feb 23 '18 14:02 xiaoxingzeng

This is my training print-out [session 1][epoch 1][iter 0] loss: 6.3588, lr: 1.00e-03 fg/bg=(20/236), time cost: 1.659386 rpn_cls: 0.8163, rpn_box: 4.6926, rcnn_cls: 0.8488, rcnn_box 0.0010 [session 1][epoch 1][iter 100] loss: 1.0697, lr: 1.00e-03 fg/bg=(24/232), time cost: 33.015444 rpn_cls: 0.1404, rpn_box: 0.6121, rcnn_cls: 0.2425, rcnn_box 0.1688 [session 1][epoch 1][iter 200] loss: 0.7961, lr: 1.00e-03 fg/bg=(43/213), time cost: 33.076333 rpn_cls: 0.1488, rpn_box: 1.1833, rcnn_cls: 0.3630, rcnn_box 0.2185 [session 1][epoch 1][iter 300] loss: nan, lr: 1.00e-03 fg/bg=(256/0), time cost: 33.628527 rpn_cls: nan, rpn_box: nan, rcnn_cls: nan, rcnn_box nan [session 1][epoch 1][iter 400] loss: nan, lr: 1.00e-03 fg/bg=(256/0), time cost: 32.910808 rpn_cls: nan, rpn_box: nan, rcnn_cls: nan, rcnn_box nan [session 1][epoch 1][iter 500] loss: nan, lr: 1.00e-03 fg/bg=(256/0), time cost: 32.843017 rpn_cls: nan, rpn_box: nan, rcnn_cls: nan, rcnn_box nan [session 1][epoch 1][iter 600] loss: nan, lr: 1.00e-03 fg/bg=(256/0), time cost: 32.721040 rpn_cls: nan, rpn_box: nan, rcnn_cls: nan, rcnn_box nan [session 1][epoch 1][iter 700] loss: nan, lr: 1.00e-03 fg/bg=(256/0), time cost: 33.876777 rpn_cls: nan, rpn_box: nan, rcnn_cls: nan, rcnn_box nan [session 1][epoch 1][iter 800] loss: nan, lr: 1.00e-03 fg/bg=(256/0), time cost: 33.819963 rpn_cls: nan, rpn_box: nan, rcnn_cls: nan, rcnn_box nan

xiaoxingzeng avatar Feb 23 '18 14:02 xiaoxingzeng

Hi @xiaoxingzeng , I have the same problem but sadly i still have no answer. Are you using a custom dataset? Maybe some answers of this issue can help you. Good luck!

JavHaro avatar Mar 01 '18 12:03 JavHaro

@JavHaro @xiaoxingzeng I am having the same issue of nan values with custom dataset. Did you find the solution? Thanks

zeehasham avatar Mar 09 '18 23:03 zeehasham

I have the same probelm with my own dataset. Did you find the solution? Thanks a lot!

[session 1][epoch 1][iter 5400/33021] loss: 0.3792, lr: 1.00e-03 fg/bg=(113/399), time cost: 129.585304 rpn_cls: 0.0753, rpn_box: 0.2207, rcnn_cls: 0.1785, rcnn_box 0.2806 [session 1][epoch 1][iter 5500/33021] loss: 0.3538, lr: 1.00e-03 fg/bg=(42/470), time cost: 129.476077 rpn_cls: 0.0916, rpn_box: 0.1274, rcnn_cls: 0.0800, rcnn_box 0.1175 [session 1][epoch 1][iter 5600/33021] loss: nan, lr: 1.00e-03 fg/bg=(512/0), time cost: 122.264345 rpn_cls: nan, rpn_box: nan, rcnn_cls: nan, rcnn_box nan [session 1][epoch 1][iter 5700/33021] loss: nan, lr: 1.00e-03 fg/bg=(512/0), time cost: 119.232911 rpn_cls: nan, rpn_box: nan, rcnn_cls: nan, rcnn_box nan

yuanyao366 avatar Mar 14 '18 00:03 yuanyao366

Hi, all, If possible, it would be good to share your data loaders with us, so that we can check why that happened.

jwyang avatar Mar 14 '18 18:03 jwyang

@jwyang I use caltech dataset for pedestrian detection. When training, I print num_boxes and gt_boxes under "print loss" like this: print("[session %d][epoch %2d][iter %4d] loss: %.4f, lr: %.2e"
% (args.session, epoch, step, loss_temp, lr)) print("\t\t\tfg/bg=(%d/%d), time cost: %f" % (fg_cnt, bg_cnt, end-start)) print("\t\t\trpn_cls: %.4f, rpn_box: %.4f, rcnn_cls: %.4f, rcnn_box %.4f"
% (loss_rpn_cls, loss_rpn_box, loss_rcnn_cls, loss_rcnn_box))

    gt_boxes_cpu=gt_boxes.cpu().data.numpy()
    im_data_cpu=im_data.cpu().data
    for i in range(args.batch_size):
        num_gt=num_boxes.data[i]
        print("the %dth image have num_boxes: %d and the gt_boxes are:"%(i,num_gt))
        print(gt_boxes_cpu[i][:num_gt])
        img=im_data_cpu[i].permute(1,2,0).numpy()
        #_vis_minibatch(img, gt_boxes_cpu[i][:num_gt])

I even visualize the image and gt_boxes like this: def _vis_minibatch(im_blob, rois_blob): """Visualize a mini-batch for debugging.""" import matplotlib.pyplot as plt for i in xrange(rois_blob.shape[0]): rois = rois_blob[i, :] #im_ind = rois[0] roi = rois[:4] im = im_blob[:, :, :].copy() im += cfg.PIXEL_MEANS im = im[:, :, (2, 1, 0)] im = im.astype(np.uint8) plt.imshow(im) plt.gca().add_patch( plt.Rectangle((roi[0], roi[1]), roi[2] - roi[0], roi[3] - roi[1], fill=False, edgecolor='r', linewidth=3) ) plt.show()

yuanyao366 avatar Mar 15 '18 07:03 yuanyao366

My training command is : python trainval_net.py --dataset caltech --net vgg16 --bs 2 --gpu 0 --cuda This is my training print-out:

[session 1][epoch 1][iter 100] loss: 0.6892, lr: 1.00e-03 fg/bg=(19/493), time cost: 125.878273 rpn_cls: 0.2557, rpn_box: 0.4797, rcnn_cls: 0.1704, rcnn_box 0.0634 the 0th image have num_boxes: 3 and the gt_boxes are: [[ 465. 215. 496.25 271.25 1. ] [ 536.25 206.25 561.25 255. 1. ] [ 168.75 216.25 218.75 303.75 1. ]] the 1th image have num_boxes: 3 and the gt_boxes are: [[ 548.75 210. 550. 238.75 1. ] [ 175. 216.25 227.5 303.75 1. ] [ 465. 215. 496.25 271.25 1. ]] [session 1][epoch 1][iter 200] loss: 0.5355, lr: 1.00e-03 fg/bg=(20/492), time cost: 124.361140 rpn_cls: 0.2848, rpn_box: 0.4464, rcnn_cls: 0.1204, rcnn_box 0.0524 the 0th image have num_boxes: 6 and the gt_boxes are: [[ 388.75 207.5 402.5 253.75 1. ] [ 511.25 193.75 523.75 232.5 1. ] [ 327.5 195. 345. 242.5 1. ] [ 493.75 193.75 512.5 238.75 1. ] [ 407.5 195. 425. 252.5 1. ] [ 210. 190. 236.25 253.75 1. ]] the 1th image have num_boxes: 5 and the gt_boxes are: [[ 450. 200. 451.25 251.25 1. ] [ 506.25 206.25 523.75 250. 1. ] [ 258.75 202.5 273.75 250. 1. ] [ 497.5 201.25 512.5 246.25 1. ] [ 422.5 211.25 435. 252.5 1. ]] [session 1][epoch 1][iter 300] loss: 0.4500, lr: 1.00e-03 fg/bg=(4/508), time cost: 124.083962 rpn_cls: 0.1161, rpn_box: 0.0771, rcnn_cls: 0.0267, rcnn_box 0.0117 the 0th image have num_boxes: 1 and the gt_boxes are: [[ 720. 243.75 743.75 303.75 1. ]] the 1th image have num_boxes: 1 and the gt_boxes are: [[ 720. 243.75 743.75 303.75 1. ]] [session 1][epoch 1][iter 400] loss: 0.4334, lr: 1.00e-03 fg/bg=(3/509), time cost: 123.896316 rpn_cls: 0.2213, rpn_box: 0.0603, rcnn_cls: 0.0627, rcnn_box 0.0038 the 0th image have num_boxes: 1 and the gt_boxes are: [[ 22.5 225. 45. 261.25 1. ]] the 1th image have num_boxes: 1 and the gt_boxes are: [[ 28.75 230. 37.5 257.5 1. ]] [session 1][epoch 1][iter 500] loss: 0.4303, lr: 1.00e-03 fg/bg=(19/493), time cost: 124.256176 rpn_cls: 0.1117, rpn_box: 0.2408, rcnn_cls: 0.0595, rcnn_box 0.0675 the 0th image have num_boxes: 3 and the gt_boxes are: [[ 178.75 205. 208.75 277.5 1. ] [ 541.25 217.5 558.75 257.5 1. ] [ 595. 210. 618.75 268.75 1. ]] the 1th image have num_boxes: 3 and the gt_boxes are: [[ 183.75 206.25 213.75 277.5 1. ] [ 592.5 208.75 616.25 267.5 1. ] [ 541.25 217.5 558.75 257.5 1. ]] [session 1][epoch 1][iter 600] loss: nan, lr: 1.00e-03 fg/bg=(512/0), time cost: 124.420586 rpn_cls: nan, rpn_box: nan, rcnn_cls: nan, rcnn_box nan the 0th image have num_boxes: 1 and the gt_boxes are: [[ 408.75 225. 427.5 247.5 1. ]] the 1th image have num_boxes: 1 and the gt_boxes are: [[ 380. 233.75 398.75 256.25 1. ]] [session 1][epoch 1][iter 700] loss: nan, lr: 1.00e-03 fg/bg=(512/0), time cost: 124.281587 rpn_cls: nan, rpn_box: nan, rcnn_cls: nan, rcnn_box nan the 0th image have num_boxes: 1 and the gt_boxes are: [[ 610. 176.25 638.75 248.75 1. ]] the 1th image have num_boxes: 1 and the gt_boxes are: [[ 606.25 176.25 635. 248.75 1. ]]

I think the im_data and gt_boxes are right before being sent into the net, but when training, the "fg/bg" is abnormal and the number of "fg" is too small until the"fg/bg" is "512/0". I have been confused by this problem for one week and I hope I can get some useful advice as soon as possible. Thanks a lot!

yuanyao366 avatar Mar 15 '18 08:03 yuanyao366

Hi, @sdlyaoyuan , I will take a look at this problem, and try to solve it as soon as possible.

jwyang avatar Mar 15 '18 15:03 jwyang

Hi, @jwyang , I had the same issue these day too when I tried to use my customized dataset to train. The same thing (NAN) happened when 'fg/bg' is 1024/0. It seems like NAN occurs when there are no bg samples during the rois sampling. Maybe the code never steped into this case when using dataset like COCO?

Thanks a lot!:)

xuelin-chen avatar Mar 19 '18 03:03 xuelin-chen

image This is abnormal I guess... once the bg=0 occurs, it keeps happening to all the following batches...

xuelin-chen avatar Mar 19 '18 05:03 xuelin-chen

Hi @jwyang , @ChenXuelinCXL, and @sdlyaoyuan I just posted in other thread that i have located the problem. In my case, the problem is in the annotations loading. I don't know why, when you load the minimum values of the annotation (xmin & y min) if they are close to 0 it loads 65534 (the maximum value minus 2) so when you manage the areas and calculate xmax-xmin the value is negative. I solve it checking the minimum values after loading. I hope this could help you.

JavHaro avatar Mar 19 '18 08:03 JavHaro

Hi @JavHaro , thanks for the help, I will see if that is my case.:)

BTW, I am wondering if any one of you had your code entered these 'elif' when training? image

xuelin-chen avatar Mar 19 '18 08:03 xuelin-chen

@ChenXuelinCXL If you make your customized dataset as the format of pascal voc , you can try make such change in pascal_voc.py: for ix, obj in enumerate(objs): bbox = obj.find('bndbox') # Make pixel indexes 0-based x1 = float(bbox.find('xmin').text) #- 1 y1 = float(bbox.find('ymin').text) #- 1 x2 = float(bbox.find('xmax').text) #- 1 y2 = float(bbox.find('ymax').text) #- 1

yuanyao366 avatar Mar 19 '18 09:03 yuanyao366

OK.. after I tried several times. I double checked my code to make the bounding box coordinations all correct. It is still producing NAN. The bg=0 still happens.

xuelin-chen avatar Mar 19 '18 15:03 xuelin-chen

Hi @ChenXuelinCXL Where did you check the bounding box coordinates? I say it because depending on the part of the code where you check them the value to check will be different. Anyway, you can check if xmin<xmax or ymin<ymax in any part, just to locate the problem. KR

JavHaro avatar Mar 20 '18 11:03 JavHaro

@JavHaro I am still locating the bug. I am very sure that the bounding box from myd data is correct. Now I got this: image I print the rois output from RPN, this is what causes number of bg = 0, all proposals from RPN are almost zero boxes, except the first few wired boxes.

Trying to figure out what causes this. Note I add a filter to assign zero scores for those very small proposal boxes in RPN, but RPN still outputs them, that measn all proposals in RPN are almost zero boxes!?

xuelin-chen avatar Mar 20 '18 13:03 xuelin-chen

I am getting similar issues regarding the foreground bboxes. It is related to RPN layer.

In my case, i got this error when the bbox of GT is wrong like [-1, -1, 100, 200] where x and y are -1. The bbox representation in this code is uint16, so all -1 numbers are overflowed.

lolongcovas avatar Mar 24 '18 17:03 lolongcovas

@ChenXuelinCXL Hi, I met the same problem as yours, have you solved it?

Tristacheng avatar Apr 21 '18 05:04 Tristacheng

@lolongcovas In my case you are right!

underfitting avatar May 04 '18 09:05 underfitting

Tried reducing the learning rate? I sometimes have to set it as low as 0.00001

ahmed-shariff avatar May 04 '18 10:05 ahmed-shariff

Hi, I encountered the nan issue with my faster_rcnn_resnet50 model, turns out I was using Adam optimizer which led to the values going to Nan, I changed it back to SGD with momentum and weight decay (to which the original architecture was trained) and changed my learning rate from 0.01 to 0.0001, the results were better, Hope this helps

AkshayLaddha943 avatar Mar 20 '24 11:03 AkshayLaddha943