fpn.pytorch icon indicating copy to clipboard operation
fpn.pytorch copied to clipboard

Training suddenly terminate after first epoch. Looking for help, plz

Open KevinQian97 opened this issue 6 years ago • 5 comments

Here are my Trace backs: [session 1][epoch 1][iter 0] loss: 4.0006, lr: 1.00e-02 fg/bg=(128/384), time cost: 7.218862 rpn_cls: 0.6919, rpn_box: 0.1386, rcnn_cls: 2.8319, rcnn_box 0.3382 Traceback (most recent call last): File "trainval_net.py", line 330, in roi_labels = FPN(im_data, im_info, gt_boxes, num_boxes) File "/home/zhiqi.cheng/anaconda2/lib/python2.7/site-packages/torch/nn/modules/module.py", line 357, in call result = self.forward(*input, **kwargs) File "/home/zhiqi.cheng/anaconda2/lib/python2.7/site-packages/torch/nn/parallel/data_parallel.py", line 73, in forward outputs = self.parallel_apply(replicas, inputs, kwargs) File "/home/zhiqi.cheng/anaconda2/lib/python2.7/site-packages/torch/nn/parallel/data_parallel.py", line 83, in parallel_apply return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)]) File "/home/zhiqi.cheng/anaconda2/lib/python2.7/site-packages/torch/nn/parallel/parallel_apply.py", line 67, in parallel_apply raise output RuntimeError: invalid argument 2: Input tensor must have same size as output tensor apart from the specified dimension at /opt/conda/conda-bld/pytorch_1518238409320/work/torch/lib/THC/generic/THCTensorScatterGather.cu:29

KevinQian97 avatar Sep 02 '18 21:09 KevinQian97

I found that the code runs normally on faster-rcnn. But if I use the code of fpn, it failed. So I guess the problem happens in fpn.py, but I still can't find out why. What's more, I used this model to train my personal data, if I changed the data back to origin Voc2007, it works. That's strange. I just changed my personal data into the form of Voc2007. Here is one of my annotation file: train VIRAT_S_000000.mp4_0 C:/Users/Kevin Qian/Downloads/images/train/VIRAT_S_000000.mp4_0.jpg

Unknown 1920 1080 3 0 Other 0 636 723 655 787 Other 0 411 618 438 703 Person 0 349 709 410 850 Other 0 760 758 778 831 Person 0 1386 245 1432 354 Person 0 276 688 345 845 Other 0 512 687 541 747

and here is the annotation file in original voc2007 VOC2007 009962.jpg The VOC2007 Database PASCAL VOC2007 flickr 246788553 Tool - Wroclaw Milosz J. 500 375 3 0 chair Right 1 0 211 192 324 326 person Unspecified 1 0 162 72 273 248 person Right 1 0 250 68 473 312 person Right 1 0 4 2 253 374 diningtable Unspecified 1 1 358 216 500 375

KevinQian97 avatar Sep 06 '18 15:09 KevinQian97

@KevinQian97 I have encountered with the same problem. Have you found out how to solve it?

WangTianYuan avatar Sep 26 '18 04:09 WangTianYuan

@KevinQian97 @WangTianYuan did you solve this issue?

Karthik-Suresh93 avatar Nov 25 '18 01:11 Karthik-Suresh93

Have you solved the problem? I got the same error.@KevinQian97 @WangTianYuan

krushi1992 avatar Apr 12 '19 13:04 krushi1992

Have you solved the problem? I got the same error.@KevinQian97 @WangTianYuan

I found that if you use your own dataset to train the model, if it has dirty data, it will cause Nan values in roi_ level in FPN.py. You can try the following modification methods: roi_ level[roi_ level < 2] = 2 roi_ level[roi_ level > 5] = 5 To roi_ level[roi_ level < 2] = 2 roi_ level[roi_ level > 5] = 5 roi_ level[roi_ level!=roi_ level]=5

Complicateddd avatar Jul 30 '20 13:07 Complicateddd