Mask_RCNN
Mask_RCNN copied to clipboard
Validation loss keeps fluctuating
Hi all,
I am using this maskrcnn library to do detection and segmentation. I have this class distribution: Class_Occurrences = { 0:189 , 1:22, 2:1, 3:40, 4:28, 5:85, 6:40, 7:63, 8:42, 9:5 } key: class_id, value: number of occurrences. First class with key 0 is the background.
Data set contains 189 training images and 53 validation images.
- Training process 1 : 100 epoch, pre trained coco weights, without augmentation. the result mAP : 0.17
- Training process 2 : 100 epoch, pre trained coco weights, with online augmentation. the result mAP : 0.29
Augmentation Config:
augmentation = iaa.SomeOf((0, 3), [ iaa.Fliplr(0.5), iaa.Flipud(0.5), iaa.OneOf([iaa.Affine(rotate=90), iaa.Affine(rotate=180), iaa.Affine(rotate=270)]), iaa.Multiply((0.8, 1.5)), iaa.GaussianBlur(sigma=(0.0, 5.0)) ])
Below you can see the training and validation loss for process 2:
| Training losses | Validation losses |
|---|---|
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
my question is, why the mAP is so low ? what I can do to increase the performance ? and why the training loss decreasing while validation loss in not (fluctuating) ? I tried to add class_weight to work around the data imbalanced but I always get this error : Unknown entries in class_weight dictionary: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]. Only expected following keys: []
Model Configuration:
| Name | Value |
|---|---|
| BACKBONE | resnet101 |
| BACKBONE_STRIDES | [4, 8, 16, 32, 64] |
| BATCH_SIZE | 1 |
| BBOX_STD_DEV | [0.1 0.1 0.2 0.2] |
| COMPUTE_BACKBONE_SHAPE | None |
| DETECTION_MAX_INSTANCES | 100 |
| DETECTION_MIN_CONFIDENCE | 0.9 |
| DETECTION_NMS_THRESHOLD | 0.3 |
| FPN_CLASSIF_FC_LAYERS_SIZE | 1024 |
| GPU_COUNT | 1 |
| GRADIENT_CLIP_NORM | 5.0 |
| IMAGES_PER_GPU | 1 |
| IMAGE_CHANNEL_COUNT | 3 |
| IMAGE_MAX_DIM | 1024 |
| IMAGE_META_SIZE | 22 |
| IMAGE_MIN_DIM | 800 |
| IMAGE_MIN_SCALE | 0 |
| IMAGE_RESIZE_MODE | square |
| IMAGE_SHAPE | [1024 1024 3] |
| LEARNING_MOMENTUM | 0.9 |
| LEARNING_RATE | 0.001 |
| LOSS_WEIGHTS | {'rpn_class_loss': 1.0, 'rpn_bbox_loss': 1.0, 'mrcnn_class_loss': 1.0, 'mrcnn_bbox_loss': 1.0, 'mrcnn_mask_loss': 1.0} |
| MASK_POOL_SIZE | 14 |
| MASK_SHAPE | [28, 28] |
| MAX_GT_INSTANCES | 100 |
| MEAN_PIXEL | [123.7 116.8 103.9] |
| MINI_MASK_SHAPE | (56, 56) |
| NAM | E |
| NUM_CLASSES | 10 |
| POOL_SIZE | 7 |
| POST_NMS_ROIS_INFERENCE | 1000 |
| POST_NMS_ROIS_TRAINING | 2000 |
| PRE_NMS_LIMIT | 6000 |
| ROI_POSITIVE_RATIO | 0.33 |
| RPN_ANCHOR_RATIOS | [0.5, 1, 2] |
| RPN_ANCHOR_SCALES | (32, 64, 128, 256, 512) |
| RPN_ANCHOR_STRIDE | 1 |
| RPN_BBOX_STD_DEV | [0.1 0.1 0.2 0.2] |
| RPN_NMS_THRESHOLD | 0.7 |
| RPN_TRAIN_ANCHORS_PER_IMAGE | 256 |
| STEPS_PER_EPOCH | 100 |
| TOP_DOWN_PYRAMID_SIZE | 256 |
| TRAIN_BN | False |
| TRAIN_ROIS_PER_IMAGE | 200 |
| USE_MINI_MASK | True |
| USE_RPN_ROIS | True |
| VALIDATION_STEPS | 100 |
| WEIGHT_DECAY | 0.0001 |
- It seems pretty obvious to me that your model is immediately overfitting. Your validation loss is almost double your training loss immediately. I would think that the learning rate may be too high, and would try reducing it.
I recommend this blog.
- mAP will vary based on your threshold and IoU. Try reducing the threshold and visualize some results to see if that's better.
- Your validation loss is varying wildly because your validation set is likely not representative of the whole dataset. I would recommend shuffling/resampling the validation set, or using a larger validation fraction.
Hi MahBadran93,
- It showed some sort of overfitting. because if you draw a line best fit the val loss, it is going down and then going up while your train loss keeps going down.
- It also showed signs of the training dataset maybe not representative enough, and the model didn't learn enough to perform the task. make sure that you feed the right images to your model.
I have a question. Does mask rcnn not adjust its weights and learning on the basis of validation dataset after each epoch. Like I have a a dataset divided into train, val and test. Train and val are supplied for training. And if I run the the model on validation dataset, the results are quite poor itself, let alone test dataset. This means validation dataset is not used for training? Just for us to check our val score while training is going on?
The validation set is used to validate training.
After each step, the current model is tested on the validation set. This test determines if the last training improved the model or not. So the validation set is not explicitly used to train the model, but it used in training if that makes sense.
@TimNagle-McNaughton, thank you for your reply. So if my validation score is not improving, does the training model learn that and adjust its weights? That would mean it learns on both train and val dataset and if that is so, the resultant model should not perform that poorly on val dataset. Am I getting it correctly? My validation score doesn't improve after 40 epochs itself and the trained model is unable to segment most of the objects in validation/test datasets. Any ideas on how to improve training.
I tried something. I wanted to retrain all layers of the backbone network on my custom dataset. For which I set TRAIN_BN = True in config.py. Am I correct here? Will this mean no layer would be frozen while training?
So if my validation score is not improving, does the training model learn that and adjust its weights?
Broadly, yes.
the resultant model should not perform that poorly on val dataset
Correct.
For which I set TRAIN_BN = True
I'm not familiar with that flag sorry.
the resultant model should not perform that poorly on val dataset
Correct.
I guess my trained model is not efficient then because it is in fact performing poorly on val set. Thanks anyway @TimNagle-McNaughton
- It seems pretty obvious to me that your model is immediately overfitting. Your validation loss is almost double your training loss immediately. I would think that the learning rate may be too high, and would try reducing it.
I recommend this blog.
- mAP will vary based on your threshold and IoU. Try reducing the threshold and visualize some results to see if that's better.
- Your validation loss is varying wildly because your validation set is likely not representative of the whole dataset. I would recommend shuffling/resampling the validation set, or using a larger validation fraction.
Thank you @TimNagle-McNaughton for your answer.
Hi MahBadran93,
- It showed some sort of overfitting. because if you draw a line best fit the val loss, it is going down and then going up while your train loss keeps going down.
- It also showed signs of the training dataset maybe not representative enough, and the model didn't learn enough to perform the task. make sure that you feed the right images to your model.
You are right, the dataset was not representative enough and that was the main issue.
Hello I am facing this same problem. Based on previous answers i have adjusted my data split. I have used a 80-20(original split), tried 90-10 and 70-30, but i get the same result, epoch_loss looks awesome but validation_loss keeps fluctuating.
I am only training heads, no matter the epoch amount, fluctuate.
Reading elsewhere said that a possible cause could be my model is too complex but that argument does not fit here i think.
This is the dataset i am using https://github.com/dsmlr/Car-Parts-Segmentation/
id appreciate any advice where to continue looking.
| BACKBONE | resnet101 |
| BACKBONE_STRIDES | [4, 8, 16, 32, 64] |
| BATCH_SIZE | 1 |
| BBOX_STD_DEV | [0.1 0.1 0.2 0.2] |
| BBOX_STD_DEV | [0.1 0.1 0.2 0.2] |
| COMPUTE_BACKBONE_SHAPE | None |
| DETECTION_MAX_INSTANCES | 35 |
| DETECTION_MIN_CONFIDENCE | 0.7 |
| DETECTION_NMS_THRESHOLD | 0.3 |
| FPN_CLASSIF_FC_LAYERS_SIZE | 1024 |
| GPU_COUNT | 1 |
| GRADIENT_CLIP_NORM | 5.0 |
| IMAGES_PER_GPU | 1 |
| IMAGE_CHANNEL_COUNT | 3 |
| IMAGE_MAX_DIM | 512 |
| IMAGE_META_SIZE | 32 |
| IMAGE_MIN_DIM | 512 |
| IMAGE_MIN_SCALE | 0 |
| IMAGE_RESIZE_MODE | square |
| IMAGE_SHAPE | [512 512 3] |
| LEARNING_MOMENTUM | 0.9 |
| LEARNING_RATE | 0.001 |
| LOSS_WEIGHTS | {'rpn_class_loss': 1.0, 'rpn_bbox_loss': 1.0, 'mrcnn_class_loss': 1.0, 'mrcnn_bbox_loss': 1.0, 'mrcnn_mask_loss': 1.0} |
| MASK_POOL_SIZE | 14 |
| MASK_SHAPE | [28, 28] |
| MAX_GT_INSTANCES | 100 |
| MEAN_PIXEL | [123.7 116.8 103.9] |
| MINI_MASK_SHAPE | (56, 56) |
| NAME | car_parts |
| NUM_CLASSES | 20 |
| POOL_SIZE | 7 |
| POST_NMS_ROIS_INFERENCE | 1000 |
| POST_NMS_ROIS_TRAINING | 2000 |
| PRE_NMS_LIMIT | 6000 |
| ROI_POSITIVE_RATIO | 0.33 |
| RPN_ANCHOR_RATIOS | [0.5, 1, 2] |
| RPN_ANCHOR_SCALES | (32, 64, 128, 256, 512) |
| RPN_ANCHOR_STRIDE | 1 |
| RPN_BBOX_STD_DEV | [0.1 0.1 0.2 0.2] |
| RPN_NMS_THRESHOLD | 0.7 |
| RPN_TRAIN_ANCHORS_PER_IMAGE | 256 |
| STEPS_PER_EPOCH | 500 |
| TOP_DOWN_PYRAMID_SIZE | 256 |
| TRAIN_BN | False |
| TRAIN_ROIS_PER_IMAGE | 200 |
| USE_MINI_MASK | False |
| USE_RPN_ROIS | True |
| VALIDATION_STEPS | 100 |
| WEIGHT_DECAY | 0.0001 |
UPDATE: It was fluctuating because my Dataset already has a background annotation. When creating my custom Dataset, this created two background classes resulting in problems when training. Now my training is not fluctuating any more.
I got these My dataset has imbalance problem but is it only this reason or something else?
Network1:

Network 2

I got these My dataset has imbalance problem but is it only this reason or something else?
Network1:
![]()
Network 2
![]()
Hello , I meet this problem too. Can you tell me how to solve this problem? Thanks!
I got these My dataset has imbalance problem but is it only this reason or something else? Network1:
![]()
Network 2
![]()
Hello , I meet this problem too. Can you tell me how to solve this problem? Thanks!
I couldn't come to any conclusion.
I got these My dataset has imbalance problem but is it only this reason or something else? Network1:
![]()
Network 2
![]()
Hello , I meet this problem too. Can you tell me how to solve this problem? Thanks!
I couldn't come to any conclusion.
If you solve it one day, please tell me! Thank you, sys!
I got these My dataset has imbalance problem but is it only this reason or something else? Network1:
![]()
Network 2
![]()
Hello , I meet this problem too. Can you tell me how to solve this problem? Thanks!
I couldn't come to any conclusion.
You need to solve the data imbalance problem. It can be the main reason for the bad results. You want to make sure that you have an equal distribution for each class across train, val and test. You can try augmentation.
I got these My dataset has imbalance problem but is it only this reason or something else? Network1:
![]()
Network 2
![]()
Hello , I meet this problem too. Can you tell me how to solve this problem? Thanks!
I couldn't come to any conclusion.
You need to solve the data imbalance problem. It can be the main reason for the bad results. You want to make sure that you have an equal distribution for each class across train, val and test. You can try augmentation.
I tried data augmentation but Alexnet pretrained showed skipped class prediction in classification report nd accuracy is very low. I did for mnist dataset it gave 98% but for ecg dataset it was 48% and my classification report shows few classes precision/recall 0
I got these My dataset has imbalance problem but is it only this reason or something else? Network1:
![]()
Network 2
![]()
Hello , I meet this problem too. Can you tell me how to solve this problem? Thanks!
hi guys i m facing the same issue. here is my advice
- check out your dataset. The same preprocessing method should be applied to all datasets. augmentation, rescale etc
- using CALLBACK API in KERAS. Keep reducing the learning rate. this method helped me out
I hope it was helpful.











