FastMaskRCNN icon indicating copy to clipboard operation
FastMaskRCNN copied to clipboard

Does anyone successfully train the model yet?

Open maxenceliu opened this issue 7 years ago • 12 comments

With GTX1080, it takes 24 hours to train 22k iterations, but it seems need to train 1500k iterations. which means it will take more than TWO MONTHS to train the model!!!

INTOLERABLE!!!

maxenceliu avatar Apr 25 '17 02:04 maxenceliu

Hi @maxenceliu , has the total_loss explosion problem been solved with the newest version?

lihungchieh avatar Apr 25 '17 03:04 lihungchieh

Yes, But I don't know which commit correct this problem... @CharlesShang would you explain this for us? Just for study. FYI, MobileNets is simple amazing!

maxenceliu avatar Apr 25 '17 03:04 maxenceliu

This is adressed in the original paper of the the Mask RCNN. They trained for 2 days on a 8 GPU machine. Since parallelism does not speed up linear this may take this long. Even if it'd be linear it would be 16 days on a single GPU. This was done on the COCO dataset so other datasets may differ. Since you can increase the Batchsize much higher the more memory you have available (which would be 8 times higher with 8 GPUs) the CUDA-Stream applications will increase the heterogenous tasks, which will result in a shorter training time.

"Our models can run at about 200ms per frame on a GPU, and training on COCO takes one to two days on a single 8-GPU machine. " Original MaskRCNN

One way to speed it up would be to increase the Batch size to the GPU memories maximum to increase parallelity.

kevinkit avatar Apr 25 '17 11:04 kevinkit

I couldn't finish the training process. One of the loss becomes NaN after ~25k iters. I read that from tensorboard so I don't know which loss is it, but I suspect it is the rpn_loss.

Nikasa1889 avatar Apr 30 '17 08:04 Nikasa1889

So far I trained 130k iterations on GTX 1080 in ~20h. Still far from what is needed, but the loss is decreasing very slowly. The network always predicts the classes with the highest mean occurrence, i.e. 1.) Unlabeled 2.) Person 3.) Chair 4.) Car 5.) This one is finally varying a bit Do you think that batchsize=1 is enough for training this network?

MartinSmeyer avatar May 03 '17 12:05 MartinSmeyer

Since the original paper uses the Faster RCNN hyperparameters mini-batching was used, which means that a batch-size greater than 1 must be used. However the batchsize will affect the learning time positively - however you may run out of memory very soon if only one GPU is availbe (current implementation does not provide multi-gpu)

kevinkit avatar May 03 '17 14:05 kevinkit

After 29683 iters, it gives warnings:

train/../libs/boxes/bbox_transform.py:62: RuntimeWarning: overflow encountered in exp pred_h = np.exp(dh) * heights[:, np.newaxis] train/../libs/boxes/bbox_transform.py:61: RuntimeWarning: overflow encountered in exp pred_w = np.exp(dw) * widths[:, np.newaxis]

Then, in iter 29684, the loss becomes unusual:

iter 29684: image-id:0094949, time:0.605(sec), regular_loss: 0.179757, total-loss 85438849024.0000(163221872.0000, 73605677056.0000, 30830362.000000, 11639122944.0000, 3.2994), instances: 8, batch:(125|524, 8|12, 8|8) iter 29685: image-id:0357095, time:0.688(sec), regular_loss: 10989575769948160.000000, total-loss 2035863.0000(0.0033, 0.1700, 0.000137, 2035862.8750, 0.0118), instances: 2, batch:(32|152, 2|32, 2|2) iter 29686: image-id:0094952, time:0.764(sec), regular_loss: nan, total-loss 5372209.0000(0.0358, 0.2918, 0.000548, 5372208.5000, 0.0244), instances: 9, batch:(312|1256, 9|46, 9|9)

NaN happens...

Kongsea avatar May 04 '17 09:05 Kongsea

Do anyone know when or how to stop the training process?What is max_iters? So far I have trained 90k iterations.

YYfangzi avatar May 12 '17 13:05 YYfangzi

@YYfangzi I think the maximum iterations is set to be 2500K. You can check it here: https://github.com/CharlesShang/FastMaskRCNN/blob/fe9c0dc3ec487aa032f57cadb68b5514b285ed46/libs/configs/config_v1.py#L70 Up to now I have trained about 180K, still far from be completed.

handong1587 avatar May 28 '17 05:05 handong1587

@boycebai I just use the github code, didn't modify any code lines. What errors did you encounter?

handong1587 avatar Jun 02 '17 02:06 handong1587

I'm glad to hear from you.I've run this code. It should be the configuration of my machine. Thank you very much!

Best Wishes

At 2017-06-02 10:45:18, "handong1587" [email protected] wrote:

@boycebai I just use the github code, didn't modify any code lines. What errors did you encounter?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

boycebai avatar Jun 02 '17 04:06 boycebai

I am new in this filed. I have below question.

  1. I want to find only person and his mask from the image. Is it possible to apply only on person? is there reduction of time for training the data?
  2. Is it possible to train this module on windows machine with cpu only?

vps62 avatar Jan 21 '18 15:01 vps62