mx-maskrcnn icon indicating copy to clipboard operation
mx-maskrcnn copied to clipboard

Training is too slow. How to be more effective, adjust mini-batch size?

Open ypflll opened this issue 7 years ago • 13 comments

I am runing train_alternate.sh, really slow when training mask-rcnn(step 2): to epoch[0] Batch [2740], it takes more than 4hours. (Thinking of 20 epoches in total!!!)

I am using 2 titanx with 12G mem each. So effective minibatch size is 4 for me. How to change it?

ypflll avatar Nov 10 '17 09:11 ypflll

Hi, @ypflll https://github.com/TuSimple/mx-maskrcnn/blob/master/rcnn/tools/train_maskrcnn.py#L29 https://github.com/TuSimple/mx-maskrcnn/blob/master/rcnn/tools/train_rpn.py#L25

Zehaos avatar Nov 10 '17 09:11 Zehaos

Tried it. 12G mem is not enough for two images...

ypflll avatar Nov 10 '17 10:11 ypflll

@ypflll You can set batch_rois to 128, if you want to verify the training quickly.

Zehaos avatar Nov 10 '17 10:11 Zehaos

I think it may be the problem of ROIAlign, I have tried TuSimple's implementations of ROIAlign, replacing ROIPooling in my faster rcnn, both training and inference stages are quite slow.

ysfalo avatar Nov 13 '17 03:11 ysfalo

Yes, it accelerates the training. Thanks.

ypflll avatar Nov 13 '17 04:11 ypflll

Hi, @ysfalo @ypflll A new roialign is on the way. I will make a PR soon.

Zehaos avatar Nov 15 '17 02:11 Zehaos

@Zehaos Waiting for it.

ypflll avatar Nov 16 '17 01:11 ypflll

I use 4 titan xp(single image per GPU) to alternatively train mask rcnn on cityscape, in step 1(training RPN), it takes 1.2 hour to run 1 epoch, and the usage of GPU is almost 11GB, what is your runtime during step 1? The training snapshot is as below: image Can you give me some advice?

wenhe-jia avatar Nov 17 '17 07:11 wenhe-jia

@LeonJWH It takes 8000s/epoch when I train with 2 titanx, so I think your time cost is reasonable. You can try to set batch_rois to 128, as Zehaos says.

ypflll avatar Nov 19 '17 06:11 ypflll

@ypflll All right, it seems my time cost is acceptable.

wenhe-jia avatar Nov 20 '17 01:11 wenhe-jia

@ypflll I want to know whether you have got this in your training progress: 2017-11-26 7 45 25 Is this the reason that slow down the training process? I am not very familiar with this.

wenhe-jia avatar Nov 26 '17 11:11 wenhe-jia

@LeonJWH Ignore it.

ypflll avatar Nov 27 '17 05:11 ypflll

@ypflll what is your testing time? My time cost of inference process is:

testing 123/500 data 0.1469s net 1.8130s post 0.0143s testing 124/500 data 0.1440s net 1.7960s post 0.0160s testing 125/500 data 0.1389s net 1.7581s post 0.0172s testing 126/500 data 0.1382s net 1.7488s post 0.0180s testing 127/500 data 0.1387s net 1.9848s post 0.0172s testing 128/500 data 0.1352s net 1.8763s post 0.0187s testing 129/500 data 0.1407s net 1.8345s post 0.0182s testing 130/500 data 0.1158s net 1.7482s post 0.0171s

I use resnet101 as the base network.

wenhe-jia avatar Dec 11 '17 08:12 wenhe-jia