Mask_RCNN
Mask_RCNN copied to clipboard
overall loss is not the sum of the 5 losses in train_shapes sample
I'm running mask rcnn on TF 2.5.0, CUDA 11.1, CUDNN 8.1.1 using akTwelve implementation for tf 2.x.
The first epoch is completely fine, and the sum of the 5 main losses (rpn_class_loss, rpn_bbox_loss, mrcnn_class_loss, mrcnn_bbox_loss, and mrcnn_mask_loss) equals the overall loss both for training and validation.
The second epoch, however, is problematic and the 5 losses are 1/2 of the overall losses both in training and validation. I'm using the "model.train" again to train for the second epoch.
I also added another model.train after this to train for 2 additional epochs (total epochs=4). This time the 5 losses were 1/3 of the overall loss!!
Epoch1/1: 100/100 [==============================] - 66s 594ms/step - batch: 49.5000 - size: 8.0000 - loss: 1.3710 - rpn_class_loss: 0.0236 - rpn_bbox_loss: 0.5378 - mrcnn_class_loss: 0.2624 - mrcnn_bbox_loss: 0.3398 - mrcnn_mask_loss: 0.2073 - val_loss: 0.7835 - val_rpn_class_loss: 0.0143 - val_rpn_bbox_loss: 0.4560 - val_mrcnn_class_loss: 0.1051 - val_mrcnn_bbox_loss: 0.1279 - val_mrcnn_mask_loss: 0.0802
Epoch2/2: 100/100 [==============================] - 31s 184ms/step - batch: 49.5000 - size: 8.0000 - loss: 1.3859 - rpn_class_loss: 0.0139 - rpn_bbox_loss: 0.3914 - mrcnn_class_loss: 0.0920 - mrcnn_bbox_loss: 0.0965 - mrcnn_mask_loss: 0.0991 - val_loss: 1.2546 - val_rpn_class_loss: 0.0115 - val_rpn_bbox_loss: 0.4192 - val_mrcnn_class_loss: 0.0752 - val_mrcnn_bbox_loss: 0.0570 - val_mrcnn_mask_loss: 0.0645
Epoch3/4: 100/100 [==============================] - 32s 187ms/step - batch: 49.5000 - size: 8.0000 - loss: 1.8418 - rpn_class_loss: 0.0133 - rpn_bbox_loss: 0.3781 - mrcnn_class_loss: 0.0772 - mrcnn_bbox_loss: 0.0624 - mrcnn_mask_loss: 0.0830 - val_loss: 2.0209 - val_rpn_class_loss: 0.0127 - val_rpn_bbox_loss: 0.4384 - val_mrcnn_class_loss: 0.0951 - val_mrcnn_bbox_loss: 0.0625 - val_mrcnn_mask_loss: 0.0649
Epoch4/4: 100/100 [==============================] - 15s 148ms/step - batch: 49.5000 - size: 8.0000 - loss: 1.5852 - rpn_class_loss: 0.0124 - rpn_bbox_loss: 0.3374 - mrcnn_class_loss: 0.0592 - mrcnn_bbox_loss: 0.0508 - mrcnn_mask_loss: 0.0686 - val_loss: 1.7954 - val_rpn_class_loss: 0.0114 - val_rpn_bbox_loss: 0.3634 - val_mrcnn_class_loss: 0.0833 - val_mrcnn_bbox_loss: 0.0660 - val_mrcnn_mask_loss: 0.0743
Does anyone have an idea why this is happening?
Hello, I am having the same issue, did you find why this is happening?? thanks
No, I did not. All the TF2 implementations that I have tried had the same issue. Try to use TF1.x implementations if possible.