faster_rcnn_pytorch icon indicating copy to clipboard operation
faster_rcnn_pytorch copied to clipboard

Train new dataset: zeros after conv3 in vgg16

Open kduy opened this issue 7 years ago • 19 comments

I am trying to train the model with my own dataset. Sometimes , I got this error

  File "train.py", line 127, in <module>
    net(im_data, im_info, gt_boxes, gt_ishard, dontcare_areas)
  File "/usr/local/lib/python2.7/dist-packages/torch/nn/modules/module.py", line 206, in __call__
    result = self.forward(*input, **kwargs)
  File "/data/code/faster_rcnn_pytorch/faster_rcnn/faster_rcnn.py", line 219, in forward
    roi_data = self.proposal_target_layer(rois, gt_boxes, gt_ishard, dontcare_areas, self.n_classes)
  File "/data/code/faster_rcnn_pytorch/faster_rcnn/faster_rcnn.py", line 287, in proposal_target_layer
    proposal_target_layer_py(rpn_rois, gt_boxes, gt_ishard, dontcare_areas, num_classes)
  File "/data/code/faster_rcnn_pytorch/faster_rcnn/rpn_msr/proposal_target_layer.py", line 66, in proposal_target_layer
    np.hstack((zeros, np.vstack((gt_easyboxes[:, :-1], jittered_gt_boxes[:, :-1]))))))
  File "/usr/local/lib/python2.7/dist-packages/numpy/core/shape_base.py", line 234, in vstack
    return _nx.concatenate([atleast_2d(_m) for _m in tup], 0)
ValueError: all the input array dimensions except for the concatenation axis must match exactly

I traced the bug and figure out that it returns zeros array after conv3 in faster_rcnn/vgg16.py, hence return zero-array feature after forwarding through vgg16 Do you have any clue why ? Thank yah.

kduy avatar May 25 '17 20:05 kduy

Same problem. Any solution? Any help would be appreciated. @acgtyrant @kduy

abhiML avatar Jun 05 '17 10:06 abhiML

@longcw

abhiML avatar Jun 06 '17 08:06 abhiML

@abhiML I am refactoring the program, and it's still ongoing. So I have not get it worked as so far.

acgtyrant avatar Jun 09 '17 11:06 acgtyrant

Do you load pretrained npy for vgg16?

acgtyrant avatar Jun 26 '17 13:06 acgtyrant

Yeah first it gives a runtime warning:

RuntimeWarning: invalid value encountered in greater_equal
  keep = np.where((ws >= min_size) & (hs >= min_size))[0]

abhiML avatar Jun 26 '17 13:06 abhiML

Do you use Python 2? I do not encounter this error.

acgtyrant avatar Jun 26 '17 13:06 acgtyrant

Yeah I am using 2.7. You are running it on your own dataset?

abhiML avatar Jun 26 '17 13:06 abhiML

I ran it a few steps in PASCAL VOC 2007 trainval dataset, no problem. If you want to run it on the new dataset, you must adjust the source code by yourself.

acgtyrant avatar Jun 26 '17 13:06 acgtyrant

Yeah but what all do I have to adjust? I just changed the classes in pascal_voc.py and prepared the dataset according to the Pascal VOC 2007 set.

abhiML avatar Jun 26 '17 13:06 abhiML

I have not train the model in the new dataset, wait.

acgtyrant avatar Jun 26 '17 13:06 acgtyrant

Okay

abhiML avatar Jun 26 '17 14:06 abhiML

https://github.com/rbgirshick/py-faster-rcnn/issues/65 Could you take a look at this issue ?

abhiML avatar Jun 26 '17 14:06 abhiML

@acgtyrant going by https://github.com/longcw/faster_rcnn_pytorch/blob/master/faster_rcnn/network.py#L109 as far as I understood if the totalnorm becomes very large, then the norm gets really small and underflow occurs? Is that correct?

abhiML avatar Jun 27 '17 07:06 abhiML

No, it is used to prevent overflow occurs.

acgtyrant avatar Jun 27 '17 07:06 acgtyrant

But I am using that function. Still I am getting the error.

abhiML avatar Jun 27 '17 07:06 abhiML

I had the issue described and I now seem to be able to train without this error when using SDG or if you use ADAM loss will equal NAN, I would suggest you check the values in the gt_boxes of any image cause this error. For me when reading the xml files it was assigning some negative values which where being transformed to huge numbers. Also the PASCALVOC uses -1 on the XMIN and YMIN so if your bounding boxes are set at 0 they will be set to -1 and this caused issues as well. I fixed this in my _load_AFLW_annotation function by making sure the absolute value was taken and if a value was equal to 0 don't do a subtraction. This may help.

gls81 avatar Jul 04 '17 18:07 gls81

Yeah I was making a similar mistake. In the dataset some of the annotations were wrong (xmin>xmax). Once I corrected those and set the negative values to 0, it worked fine.

abhiML avatar Jul 05 '17 10:07 abhiML

i have checked my annotations and it is right for experiment, so do anyone know any other bug that would lead to this problem?

liyuanyaun avatar Jul 14 '17 03:07 liyuanyaun

@liyuanyaun I have encountered this problem too. After discard the shuffle operation in RoIDataLayer(),and locate which image the error occurs. I found that one of the bounding boxes has xmin=0, and voc_pascal.py which I imitated has -1 operation, so gt_boxes got a negative value. Here is an issue relative to this: https://github.com/rbgirshick/py-faster-rcnn/issues/9 (you can search 'based') After remove -1 and delete ground truth .pkl file(needed if you created before), the error is gone.

zhyx12 avatar Jul 28 '17 14:07 zhyx12