cascade-rcnn icon indicating copy to clipboard operation
cascade-rcnn copied to clipboard

how to train it on my own dataset

Open derek-zr opened this issue 7 years ago • 48 comments

hi! I want to train cascade-rcnn on my own dataset (three classes). I don't know how to modify the files(eg. examples/voc/). Can you give me some instructions? Thank you!

derek-zr avatar Apr 11 '18 03:04 derek-zr

Hi, When I train the models such as res50-12s-600-rfcn-cascade without FPN with my own dataset is fine. But when I try to train res50-15s-800-fpn-cascade with my own dataset, I meet the problem that decode_bbox_layer cannot get valid bbox. After the code of "screen out high IoU boxes, to remove redundant gt boxes" the valid_bbox_ids is 0. So, what the problem might be? Thanks. @zhaoweicai

makefile avatar Apr 26 '18 04:04 makefile

@makefile If you don't want to remove the redundant gt boxes, you can simply set gt_iou_thr=1.0 or higher. But a more important problem is you might not have enough proposals. In your case of error, only gt boxes and no negative box. You can try to lower the proposal threshold in "BoxGroupOutput" layer to have more proposals. Or your training is diverging and crashed. You can also try to use a lower learning rate.

zhaoweicai avatar Apr 26 '18 19:04 zhaoweicai

@zhaoweicai Thanks! Follow your advice, set lower the fg_thr in BoxGroupOutput layer, the problem disappeared.

makefile avatar Apr 27 '18 04:04 makefile

@zhaoweicai @makefile I try to train cascade rcnn on my own dataset, and I got this problem, I tried to lower the iou_thr in "BoxGroupOutput" layer but the problem still there, can you give me any suggestion. wenti

Peng-wei-Yu avatar Jun 02 '18 12:06 Peng-wei-Yu

The error seems related to multiple gpus. When I tried single gpu (not all GPU ids, gpu id 1 is fine, but gpu id 2 encounters same above error), training proceeds; however, with 2 gpus, encountered same above error.

jwnsu avatar Jun 03 '18 01:06 jwnsu

@Peng-wei-Yu try lower the score of fg_thr instead of nms thresh.

makefile avatar Jun 03 '18 06:06 makefile

@jwnsu @makefile Thank you for you help. But I tried to lower fg_thr and use only GPU 1, the problem is still there. Have you tried to change the --weights in train_detection, I decided to change the caffemodel and have a try.

Peng-wei-Yu avatar Jun 03 '18 13:06 Peng-wei-Yu

FYI. coco model seems to work fine (e.g. coco/res50-15s-800-fpn-cascade is fine, res101 runs out of GPU memory on 1080 Ti), suggest you switch to coco flavor from voc.

jwnsu avatar Jun 03 '18 16:06 jwnsu

@Peng-wei-Yu when you change the number of GPUs, you should change the learning rate at the same time, as described in the paper.

zhaoweicai avatar Jun 03 '18 19:06 zhaoweicai

@jwnsu The code should have no problem on multi-gpu training or VOC dataset. Try the run the script a couple of times to see if the problem still happens. If the problem is still there, try to lower the learning rate a little bit. If it still cannot be fixed, maybe there is something wrong.

zhaoweicai avatar Jun 03 '18 19:06 zhaoweicai

@makefile @zhaoweicai When you trained cascade rcnn on your own data, which caffemodel did you use. Your own caffemodel or ResNet-50-model-merge.caffemodel. The picture in my own data have the size of 1600*1200, should I change the short_size and long_size in train.prototxt.

Peng-wei-Yu avatar Jun 04 '18 08:06 Peng-wei-Yu

@Peng-wei-Yu If you use the author's prototxt, you should use the corresponding ResNet-50-model-merge.caffemodel, since it merges the BN layer to scale layer to reduce memory and speed up. You can increase the input size of image if your memory is enough, but the result may not increase too much.

makefile avatar Jun 04 '18 09:06 makefile

@makefile Thank you very much. I'll have a try by using ResNet-50-model-merge.caffemodel.

Peng-wei-Yu avatar Jun 04 '18 10:06 Peng-wei-Yu

@makefile @Peng-wei-Yu in BoxGroupOutput layer,the original setting is 0.001, you finally set it?

GuoxingYan avatar Jun 07 '18 14:06 GuoxingYan

@makefile @Peng-wei-Yu When I was training, batchsize was equal to 1. There was at least one sample in my own training pictures, but Why is total positive equal to 0 in many iterations during the training process?and my rpn loss is 0.Have you encountered such a problem? default

GuoxingYan avatar Jun 08 '18 03:06 GuoxingYan

@GuoxingYan I set fg_thr: 0.01 or 0 in all BoxGroupOutput layer. If your positive rois num is always 0, maybe your dataset has some problem.

makefile avatar Jun 08 '18 04:06 makefile

@makefile Did you try to change the short_size and long_size in train.prototxt?when i only changed the short_size or long_size ,There will be an error。

GuoxingYan avatar Jun 20 '18 03:06 GuoxingYan

@GuoxingYan I did not try to change that, since there use Deconvolution layer to upsample, the size maybe need to be multiplier of 32, 64 or larger.

makefile avatar Jun 20 '18 07:06 makefile

@makefile thank you very much!!

GuoxingYan avatar Jun 20 '18 08:06 GuoxingYan

@makefile Will you have the following problems when training fpn? default

GuoxingYan avatar Jun 21 '18 01:06 GuoxingYan

@GuoxingYan I didn't met. the integer seems to be abnormal big.

makefile avatar Jun 21 '18 13:06 makefile

@Peng-wei-Yu @zhaoweicai my own data size is 960*1280,I try to use the ResNet-50-model-merge.caffemodel, but I also get this problem. wx20180624-154016 2x

licy5152 avatar Jun 24 '18 07:06 licy5152

@makefile @zhaoweicai @Peng-wei-Yu When I was training, I found that the short_size in detection_data_param in trian.prototxt is 800, which is exactly equal to img_width and img_height in proposal_target_param. So the question arises. When I change the short_size to 320, does the img_width and img_height need to be changed to 320?

GuoxingYan avatar Jun 30 '18 04:06 GuoxingYan

@GuoxingYan I think it needs to be.

makefile avatar Jun 30 '18 05:06 makefile

@makefile I use to train my owe dataset,how can I get the output for every picture?

licy5152 avatar Jun 30 '18 06:06 licy5152

@licy5152 I wrote a python script CascadeRCNN-demo.py imitate the matlab code, you can modify it to use.

makefile avatar Jul 02 '18 01:07 makefile

@makefile 你的demo.py 显示无效链接诶。

GuoxingYan avatar Jul 02 '18 06:07 GuoxingYan

@GuoxingYan 你的网络问题吧

makefile avatar Jul 02 '18 08:07 makefile

@makefile @zhaoweicai When I was training my own dataset, the following issue happened. However, I have already check that there is no box has xmin = 1664 and xmax = 636 in the window_file.txt. And I also have not found bbox_util.cpp file under the workspace directory. Could you guys help me to solve this issue? Thanks a lot. image

PacteraKun avatar Jul 08 '18 06:07 PacteraKun

@PacteraKun The situation you encountered is unusual, check carefully.

makefile avatar Jul 09 '18 02:07 makefile