Mask_RCNN icon indicating copy to clipboard operation
Mask_RCNN copied to clipboard

Validation loss keeps fluctuating

Open MahBadran93 opened this issue 4 years ago • 10 comments

Hi all,

I am using this maskrcnn library to do detection and segmentation. I have this class distribution: Class_Occurrences = { 0:189 , 1:22, 2:1, 3:40, 4:28, 5:85, 6:40, 7:63, 8:42, 9:5 } key: class_id, value: number of occurrences. First class with key 0 is the background.

Data set contains 189 training images and 53 validation images.

  1. Training process 1 : 100 epoch, pre trained coco weights, without augmentation. the result mAP : 0.17
  2. Training process 2 : 100 epoch, pre trained coco weights, with online augmentation. the result mAP : 0.29 Augmentation Config: augmentation = iaa.SomeOf((0, 3), [ iaa.Fliplr(0.5), iaa.Flipud(0.5), iaa.OneOf([iaa.Affine(rotate=90), iaa.Affine(rotate=180), iaa.Affine(rotate=270)]), iaa.Multiply((0.8, 1.5)), iaa.GaussianBlur(sigma=(0.0, 5.0)) ])

Below you can see the training and validation loss for process 2:

Training losses Validation losses
loss1 valloss1
loss2 2
loss3 3
loss4 4
loss5 5
loss6 6

my question is, why the mAP is so low ? what I can do to increase the performance ? and why the training loss decreasing while validation loss in not (fluctuating) ? I tried to add class_weight to work around the data imbalanced but I always get this error : Unknown entries in class_weight dictionary: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]. Only expected following keys: []

Model Configuration:

Name Value
BACKBONE resnet101
BACKBONE_STRIDES [4, 8, 16, 32, 64]
BATCH_SIZE 1
BBOX_STD_DEV [0.1 0.1 0.2 0.2]
COMPUTE_BACKBONE_SHAPE None
DETECTION_MAX_INSTANCES 100
DETECTION_MIN_CONFIDENCE 0.9
DETECTION_NMS_THRESHOLD 0.3
FPN_CLASSIF_FC_LAYERS_SIZE 1024
GPU_COUNT 1
GRADIENT_CLIP_NORM 5.0
IMAGES_PER_GPU 1
IMAGE_CHANNEL_COUNT 3
IMAGE_MAX_DIM 1024
IMAGE_META_SIZE 22
IMAGE_MIN_DIM 800
IMAGE_MIN_SCALE 0
IMAGE_RESIZE_MODE square
IMAGE_SHAPE [1024 1024 3]
LEARNING_MOMENTUM 0.9
LEARNING_RATE 0.001
LOSS_WEIGHTS {'rpn_class_loss': 1.0, 'rpn_bbox_loss': 1.0, 'mrcnn_class_loss': 1.0, 'mrcnn_bbox_loss': 1.0, 'mrcnn_mask_loss': 1.0}
MASK_POOL_SIZE 14
MASK_SHAPE [28, 28]
MAX_GT_INSTANCES 100
MEAN_PIXEL [123.7 116.8 103.9]
MINI_MASK_SHAPE (56, 56)
NAM E
NUM_CLASSES 10
POOL_SIZE 7
POST_NMS_ROIS_INFERENCE 1000
POST_NMS_ROIS_TRAINING 2000
PRE_NMS_LIMIT 6000
ROI_POSITIVE_RATIO 0.33
RPN_ANCHOR_RATIOS [0.5, 1, 2]
RPN_ANCHOR_SCALES (32, 64, 128, 256, 512)
RPN_ANCHOR_STRIDE 1
RPN_BBOX_STD_DEV [0.1 0.1 0.2 0.2]
RPN_NMS_THRESHOLD 0.7
RPN_TRAIN_ANCHORS_PER_IMAGE 256
STEPS_PER_EPOCH 100
TOP_DOWN_PYRAMID_SIZE 256
TRAIN_BN False
TRAIN_ROIS_PER_IMAGE 200
USE_MINI_MASK True
USE_RPN_ROIS True
VALIDATION_STEPS 100
WEIGHT_DECAY 0.0001

MahBadran93 avatar Apr 27 '21 12:04 MahBadran93

  1. It seems pretty obvious to me that your model is immediately overfitting. Your validation loss is almost double your training loss immediately. I would think that the learning rate may be too high, and would try reducing it.

I recommend this blog.

  1. mAP will vary based on your threshold and IoU. Try reducing the threshold and visualize some results to see if that's better.
  2. Your validation loss is varying wildly because your validation set is likely not representative of the whole dataset. I would recommend shuffling/resampling the validation set, or using a larger validation fraction.

TimNagle-McNaughton avatar Jun 10 '21 21:06 TimNagle-McNaughton

Hi MahBadran93,

  1. It showed some sort of overfitting. because if you draw a line best fit the val loss, it is going down and then going up while your train loss keeps going down.
  2. It also showed signs of the training dataset maybe not representative enough, and the model didn't learn enough to perform the task. make sure that you feed the right images to your model.

will6309 avatar Jul 14 '21 15:07 will6309

I have a question. Does mask rcnn not adjust its weights and learning on the basis of validation dataset after each epoch. Like I have a a dataset divided into train, val and test. Train and val are supplied for training. And if I run the the model on validation dataset, the results are quite poor itself, let alone test dataset. This means validation dataset is not used for training? Just for us to check our val score while training is going on?

mansi-aggarwal-2504 avatar Jul 26 '21 11:07 mansi-aggarwal-2504

The validation set is used to validate training.

After each step, the current model is tested on the validation set. This test determines if the last training improved the model or not. So the validation set is not explicitly used to train the model, but it used in training if that makes sense.

TimNagle-McNaughton avatar Jul 26 '21 13:07 TimNagle-McNaughton

@TimNagle-McNaughton, thank you for your reply. So if my validation score is not improving, does the training model learn that and adjust its weights? That would mean it learns on both train and val dataset and if that is so, the resultant model should not perform that poorly on val dataset. Am I getting it correctly? My validation score doesn't improve after 40 epochs itself and the trained model is unable to segment most of the objects in validation/test datasets. Any ideas on how to improve training.

I tried something. I wanted to retrain all layers of the backbone network on my custom dataset. For which I set TRAIN_BN = True in config.py. Am I correct here? Will this mean no layer would be frozen while training?

mansi-aggarwal-2504 avatar Jul 26 '21 13:07 mansi-aggarwal-2504

So if my validation score is not improving, does the training model learn that and adjust its weights?

Broadly, yes.

the resultant model should not perform that poorly on val dataset

Correct.

For which I set TRAIN_BN = True

I'm not familiar with that flag sorry.

TimNagle-McNaughton avatar Jul 26 '21 13:07 TimNagle-McNaughton

the resultant model should not perform that poorly on val dataset

Correct.

I guess my trained model is not efficient then because it is in fact performing poorly on val set. Thanks anyway @TimNagle-McNaughton

mansi-aggarwal-2504 avatar Jul 26 '21 14:07 mansi-aggarwal-2504

  1. It seems pretty obvious to me that your model is immediately overfitting. Your validation loss is almost double your training loss immediately. I would think that the learning rate may be too high, and would try reducing it.

I recommend this blog.

  1. mAP will vary based on your threshold and IoU. Try reducing the threshold and visualize some results to see if that's better.
  2. Your validation loss is varying wildly because your validation set is likely not representative of the whole dataset. I would recommend shuffling/resampling the validation set, or using a larger validation fraction.

Thank you @TimNagle-McNaughton for your answer.

MahBadran93 avatar Aug 23 '21 08:08 MahBadran93

Hi MahBadran93,

  1. It showed some sort of overfitting. because if you draw a line best fit the val loss, it is going down and then going up while your train loss keeps going down.
  2. It also showed signs of the training dataset maybe not representative enough, and the model didn't learn enough to perform the task. make sure that you feed the right images to your model.

You are right, the dataset was not representative enough and that was the main issue.

MahBadran93 avatar Aug 23 '21 08:08 MahBadran93

Hello I am facing this same problem. Based on previous answers i have adjusted my data split. I have used a 80-20(original split), tried 90-10 and 70-30, but i get the same result, epoch_loss looks awesome but validation_loss keeps fluctuating. I am only training heads, no matter the epoch amount, fluctuate. Reading elsewhere said that a possible cause could be my model is too complex but that argument does not fit here i think.

This is the dataset i am using https://github.com/dsmlr/Car-Parts-Segmentation/

id appreciate any advice where to continue looking.

BACKBONE resnet101
BACKBONE_STRIDES [4, 8, 16, 32, 64]
BATCH_SIZE 1
BBOX_STD_DEV [0.1 0.1 0.2 0.2]
BBOX_STD_DEV [0.1 0.1 0.2 0.2]
COMPUTE_BACKBONE_SHAPE None
DETECTION_MAX_INSTANCES 35
DETECTION_MIN_CONFIDENCE 0.7
DETECTION_NMS_THRESHOLD 0.3
FPN_CLASSIF_FC_LAYERS_SIZE 1024
GPU_COUNT 1
GRADIENT_CLIP_NORM 5.0
IMAGES_PER_GPU 1
IMAGE_CHANNEL_COUNT 3
IMAGE_MAX_DIM 512
IMAGE_META_SIZE 32
IMAGE_MIN_DIM 512
IMAGE_MIN_SCALE 0
IMAGE_RESIZE_MODE square
IMAGE_SHAPE [512 512 3]
LEARNING_MOMENTUM 0.9
LEARNING_RATE 0.001
LOSS_WEIGHTS {'rpn_class_loss': 1.0, 'rpn_bbox_loss': 1.0, 'mrcnn_class_loss': 1.0, 'mrcnn_bbox_loss': 1.0, 'mrcnn_mask_loss': 1.0}
MASK_POOL_SIZE 14
MASK_SHAPE [28, 28]
MAX_GT_INSTANCES 100
MEAN_PIXEL [123.7 116.8 103.9]
MINI_MASK_SHAPE (56, 56)
NAME car_parts
NUM_CLASSES 20
POOL_SIZE 7
POST_NMS_ROIS_INFERENCE 1000
POST_NMS_ROIS_TRAINING 2000
PRE_NMS_LIMIT 6000
ROI_POSITIVE_RATIO 0.33
RPN_ANCHOR_RATIOS [0.5, 1, 2]
RPN_ANCHOR_SCALES (32, 64, 128, 256, 512)
RPN_ANCHOR_STRIDE 1
RPN_BBOX_STD_DEV [0.1 0.1 0.2 0.2]
RPN_NMS_THRESHOLD 0.7
RPN_TRAIN_ANCHORS_PER_IMAGE 256
STEPS_PER_EPOCH 500
TOP_DOWN_PYRAMID_SIZE 256
TRAIN_BN False
TRAIN_ROIS_PER_IMAGE 200
USE_MINI_MASK False
USE_RPN_ROIS True
VALIDATION_STEPS 100
WEIGHT_DECAY 0.0001

UPDATE: It was fluctuating because my Dataset already has a background annotation. When creating my custom Dataset, this created two background classes resulting in problems when training. Now my training is not fluctuating any more.

raulperezalejo avatar Aug 24 '22 18:08 raulperezalejo

I got these My dataset has imbalance problem but is it only this reason or something else?

Network1:

loss_plot (1)

accuracy_plot (1)

Network 2

accuracy_plot loss_plot

jjavv avatar Apr 02 '23 14:04 jjavv

I got these My dataset has imbalance problem but is it only this reason or something else?

Network1:

loss_plot (1)

accuracy_plot (1)

Network 2

accuracy_plot loss_plot

Hello , I meet this problem too. Can you tell me how to solve this problem? Thanks!

Savant-HO avatar Apr 20 '23 09:04 Savant-HO

I got these My dataset has imbalance problem but is it only this reason or something else? Network1: loss_plot (1) accuracy_plot (1) Network 2 accuracy_plot loss_plot

Hello , I meet this problem too. Can you tell me how to solve this problem? Thanks!

I couldn't come to any conclusion.

jjavv avatar Apr 20 '23 11:04 jjavv

I got these My dataset has imbalance problem but is it only this reason or something else? Network1: loss_plot (1) accuracy_plot (1) Network 2 accuracy_plot loss_plot

Hello , I meet this problem too. Can you tell me how to solve this problem? Thanks!

I couldn't come to any conclusion.

If you solve it one day, please tell me! Thank you, sys!

Savant-HO avatar Apr 20 '23 12:04 Savant-HO

I got these My dataset has imbalance problem but is it only this reason or something else? Network1: loss_plot (1) accuracy_plot (1) Network 2 accuracy_plot loss_plot

Hello , I meet this problem too. Can you tell me how to solve this problem? Thanks!

I couldn't come to any conclusion.

You need to solve the data imbalance problem. It can be the main reason for the bad results. You want to make sure that you have an equal distribution for each class across train, val and test. You can try augmentation.

MahBadran93 avatar Apr 20 '23 15:04 MahBadran93

I got these My dataset has imbalance problem but is it only this reason or something else? Network1: loss_plot (1) accuracy_plot (1) Network 2 accuracy_plot loss_plot

Hello , I meet this problem too. Can you tell me how to solve this problem? Thanks!

I couldn't come to any conclusion.

You need to solve the data imbalance problem. It can be the main reason for the bad results. You want to make sure that you have an equal distribution for each class across train, val and test. You can try augmentation.

I tried data augmentation but Alexnet pretrained showed skipped class prediction in classification report nd accuracy is very low. I did for mnist dataset it gave 98% but for ecg dataset it was 48% and my classification report shows few classes precision/recall 0

jjavv avatar Apr 20 '23 15:04 jjavv

I got these My dataset has imbalance problem but is it only this reason or something else? Network1: loss_plot (1) accuracy_plot (1) Network 2 accuracy_plot loss_plot

Hello , I meet this problem too. Can you tell me how to solve this problem? Thanks!

hi guys i m facing the same issue. here is my advice

  1. check out your dataset. The same preprocessing method should be applied to all datasets. augmentation, rescale etc
  2. using CALLBACK API in KERAS. Keep reducing the learning rate. this method helped me out

I hope it was helpful.

2022kaishi avatar Jun 04 '23 14:06 2022kaishi