FastMaskRCNN
FastMaskRCNN copied to clipboard
Does anyone successfully train the model yet?
With GTX1080, it takes 24 hours to train 22k iterations, but it seems need to train 1500k iterations. which means it will take more than TWO MONTHS to train the model!!!
INTOLERABLE!!!
Hi @maxenceliu , has the total_loss explosion problem been solved with the newest version?
Yes, But I don't know which commit correct this problem... @CharlesShang would you explain this for us? Just for study. FYI, MobileNets is simple amazing!
This is adressed in the original paper of the the Mask RCNN. They trained for 2 days on a 8 GPU machine. Since parallelism does not speed up linear this may take this long. Even if it'd be linear it would be 16 days on a single GPU. This was done on the COCO dataset so other datasets may differ. Since you can increase the Batchsize much higher the more memory you have available (which would be 8 times higher with 8 GPUs) the CUDA-Stream applications will increase the heterogenous tasks, which will result in a shorter training time.
"Our models can run at about 200ms per frame on a GPU, and training on COCO takes one to two days on a single 8-GPU machine. " Original MaskRCNN
One way to speed it up would be to increase the Batch size to the GPU memories maximum to increase parallelity.
I couldn't finish the training process. One of the loss becomes NaN after ~25k iters. I read that from tensorboard so I don't know which loss is it, but I suspect it is the rpn_loss.
So far I trained 130k iterations on GTX 1080 in ~20h. Still far from what is needed, but the loss is decreasing very slowly. The network always predicts the classes with the highest mean occurrence, i.e. 1.) Unlabeled 2.) Person 3.) Chair 4.) Car 5.) This one is finally varying a bit Do you think that batchsize=1 is enough for training this network?
Since the original paper uses the Faster RCNN hyperparameters mini-batching was used, which means that a batch-size greater than 1 must be used. However the batchsize will affect the learning time positively - however you may run out of memory very soon if only one GPU is availbe (current implementation does not provide multi-gpu)
After 29683 iters, it gives warnings:
train/../libs/boxes/bbox_transform.py:62: RuntimeWarning: overflow encountered in exp pred_h = np.exp(dh) * heights[:, np.newaxis] train/../libs/boxes/bbox_transform.py:61: RuntimeWarning: overflow encountered in exp pred_w = np.exp(dw) * widths[:, np.newaxis]
Then, in iter 29684, the loss becomes unusual:
iter 29684: image-id:0094949, time:0.605(sec), regular_loss: 0.179757, total-loss 85438849024.0000(163221872.0000, 73605677056.0000, 30830362.000000, 11639122944.0000, 3.2994), instances: 8, batch:(125|524, 8|12, 8|8) iter 29685: image-id:0357095, time:0.688(sec), regular_loss: 10989575769948160.000000, total-loss 2035863.0000(0.0033, 0.1700, 0.000137, 2035862.8750, 0.0118), instances: 2, batch:(32|152, 2|32, 2|2) iter 29686: image-id:0094952, time:0.764(sec), regular_loss: nan, total-loss 5372209.0000(0.0358, 0.2918, 0.000548, 5372208.5000, 0.0244), instances: 9, batch:(312|1256, 9|46, 9|9)
NaN happens...
Do anyone know when or how to stop the training process?What is max_iters? So far I have trained 90k iterations.
@YYfangzi I think the maximum iterations is set to be 2500K. You can check it here: https://github.com/CharlesShang/FastMaskRCNN/blob/fe9c0dc3ec487aa032f57cadb68b5514b285ed46/libs/configs/config_v1.py#L70 Up to now I have trained about 180K, still far from be completed.
@boycebai I just use the github code, didn't modify any code lines. What errors did you encounter?
I'm glad to hear from you.I've run this code. It should be the configuration of my machine. Thank you very much!
Best Wishes
At 2017-06-02 10:45:18, "handong1587" [email protected] wrote:
@boycebai I just use the github code, didn't modify any code lines. What errors did you encounter?
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.
I am new in this filed. I have below question.
- I want to find only person and his mask from the image. Is it possible to apply only on person? is there reduction of time for training the data?
- Is it possible to train this module on windows machine with cpu only?