mx-maskrcnn
mx-maskrcnn copied to clipboard
Training is too slow. How to be more effective, adjust mini-batch size?
I am runing train_alternate.sh, really slow when training mask-rcnn(step 2): to epoch[0] Batch [2740], it takes more than 4hours. (Thinking of 20 epoches in total!!!)
I am using 2 titanx with 12G mem each. So effective minibatch size is 4 for me. How to change it?
Hi, @ypflll https://github.com/TuSimple/mx-maskrcnn/blob/master/rcnn/tools/train_maskrcnn.py#L29 https://github.com/TuSimple/mx-maskrcnn/blob/master/rcnn/tools/train_rpn.py#L25
Tried it. 12G mem is not enough for two images...
@ypflll You can set batch_rois to 128, if you want to verify the training quickly.
I think it may be the problem of ROIAlign, I have tried TuSimple's implementations of ROIAlign, replacing ROIPooling in my faster rcnn, both training and inference stages are quite slow.
Yes, it accelerates the training. Thanks.
Hi, @ysfalo @ypflll A new roialign is on the way. I will make a PR soon.
@Zehaos Waiting for it.
I use 4 titan xp(single image per GPU) to alternatively train mask rcnn on cityscape, in step 1(training RPN), it takes 1.2 hour to run 1 epoch, and the usage of GPU is almost 11GB, what is your runtime during step 1?
The training snapshot is as below:
Can you give me some advice?
@LeonJWH It takes 8000s/epoch when I train with 2 titanx, so I think your time cost is reasonable. You can try to set batch_rois to 128, as Zehaos says.
@ypflll All right, it seems my time cost is acceptable.
@ypflll I want to know whether you have got this in your training progress:
Is this the reason that slow down the training process? I am not very familiar with this.
@LeonJWH Ignore it.
@ypflll what is your testing time? My time cost of inference process is:
testing 123/500 data 0.1469s net 1.8130s post 0.0143s testing 124/500 data 0.1440s net 1.7960s post 0.0160s testing 125/500 data 0.1389s net 1.7581s post 0.0172s testing 126/500 data 0.1382s net 1.7488s post 0.0180s testing 127/500 data 0.1387s net 1.9848s post 0.0172s testing 128/500 data 0.1352s net 1.8763s post 0.0187s testing 129/500 data 0.1407s net 1.8345s post 0.0182s testing 130/500 data 0.1158s net 1.7482s post 0.0171s
I use resnet101 as the base network.