faster_rcnn_pytorch Train new dataset: zeros after conv3 in vgg16

I am trying to train the model with my own dataset. Sometimes , I got this error

  File "train.py", line 127, in <module>
    net(im_data, im_info, gt_boxes, gt_ishard, dontcare_areas)
  File "/usr/local/lib/python2.7/dist-packages/torch/nn/modules/module.py", line 206, in __call__
    result = self.forward(*input, **kwargs)
  File "/data/code/faster_rcnn_pytorch/faster_rcnn/faster_rcnn.py", line 219, in forward
    roi_data = self.proposal_target_layer(rois, gt_boxes, gt_ishard, dontcare_areas, self.n_classes)
  File "/data/code/faster_rcnn_pytorch/faster_rcnn/faster_rcnn.py", line 287, in proposal_target_layer
    proposal_target_layer_py(rpn_rois, gt_boxes, gt_ishard, dontcare_areas, num_classes)
  File "/data/code/faster_rcnn_pytorch/faster_rcnn/rpn_msr/proposal_target_layer.py", line 66, in proposal_target_layer
    np.hstack((zeros, np.vstack((gt_easyboxes[:, :-1], jittered_gt_boxes[:, :-1]))))))
  File "/usr/local/lib/python2.7/dist-packages/numpy/core/shape_base.py", line 234, in vstack
    return _nx.concatenate([atleast_2d(_m) for _m in tup], 0)
ValueError: all the input array dimensions except for the concatenation axis must match exactly

I traced the bug and figure out that it returns zeros array after conv3 in faster_rcnn/vgg16.py, hence return zero-array feature after forwarding through vgg16 Do you have any clue why ? Thank yah.

May 25 '17 20:05 kduy

Same problem. Any solution? Any help would be appreciated. @acgtyrant @kduy

Jun 05 '17 10:06 abhiML

@longcw

Jun 06 '17 08:06 abhiML

@abhiML I am refactoring the program, and it's still ongoing. So I have not get it worked as so far.

Jun 09 '17 11:06 acgtyrant

Do you load pretrained npy for vgg16?

Jun 26 '17 13:06 acgtyrant

Yeah first it gives a runtime warning:

RuntimeWarning: invalid value encountered in greater_equal
  keep = np.where((ws >= min_size) & (hs >= min_size))[0]

Jun 26 '17 13:06 abhiML

Do you use Python 2? I do not encounter this error.

Jun 26 '17 13:06 acgtyrant

Yeah I am using 2.7. You are running it on your own dataset?

Jun 26 '17 13:06 abhiML

I ran it a few steps in PASCAL VOC 2007 trainval dataset, no problem. If you want to run it on the new dataset, you must adjust the source code by yourself.

Jun 26 '17 13:06 acgtyrant

Yeah but what all do I have to adjust? I just changed the classes in pascal_voc.py and prepared the dataset according to the Pascal VOC 2007 set.

Jun 26 '17 13:06 abhiML

I have not train the model in the new dataset, wait.

Jun 26 '17 13:06 acgtyrant

Okay

Jun 26 '17 14:06 abhiML

https://github.com/rbgirshick/py-faster-rcnn/issues/65 Could you take a look at this issue ?

Jun 26 '17 14:06 abhiML

@acgtyrant going by https://github.com/longcw/faster_rcnn_pytorch/blob/master/faster_rcnn/network.py#L109 as far as I understood if the totalnorm becomes very large, then the norm gets really small and underflow occurs? Is that correct?

Jun 27 '17 07:06 abhiML

No, it is used to prevent overflow occurs.

Jun 27 '17 07:06 acgtyrant

But I am using that function. Still I am getting the error.

Jun 27 '17 07:06 abhiML

I had the issue described and I now seem to be able to train without this error when using SDG or if you use ADAM loss will equal NAN, I would suggest you check the values in the gt_boxes of any image cause this error. For me when reading the xml files it was assigning some negative values which where being transformed to huge numbers. Also the PASCALVOC uses -1 on the XMIN and YMIN so if your bounding boxes are set at 0 they will be set to -1 and this caused issues as well. I fixed this in my _load_AFLW_annotation function by making sure the absolute value was taken and if a value was equal to 0 don't do a subtraction. This may help.

Jul 04 '17 18:07 gls81

Yeah I was making a similar mistake. In the dataset some of the annotations were wrong (xmin>xmax). Once I corrected those and set the negative values to 0, it worked fine.

Jul 05 '17 10:07 abhiML

i have checked my annotations and it is right for experiment, so do anyone know any other bug that would lead to this problem?

Jul 14 '17 03:07 liyuanyaun

@liyuanyaun I have encountered this problem too. After discard the shuffle operation in RoIDataLayer()，and locate which image the error occurs. I found that one of the bounding boxes has xmin=0, and voc_pascal.py which I imitated has -1 operation, so gt_boxes got a negative value. Here is an issue relative to this: https://github.com/rbgirshick/py-faster-rcnn/issues/9 (you can search 'based') After remove -1 and delete ground truth .pkl file(needed if you created before), the error is gone.

Jul 28 '17 14:07 zhyx12

faster_rcnn_pytorch faster_rcnn_pytorch copied to clipboard

Train new dataset: zeros after conv3 in vgg16

faster_rcnn_pytorch
faster_rcnn_pytorch copied to clipboard