Mask_RCNN
Mask_RCNN copied to clipboard
Detections are extremely incorrect and mAP is always zero
Hi all,
I've been very excited to get to apply this slightly intimidating project to some new data, but despite all of the great examples of impressive results I've seen out there, I'm really struggling to get results that are at all promising, and so I'm suspecting that there's something fundamental I'm overlooking in my setup.
My dataset consists of aerial RGB shots of a city, with two classes: tree and background.
Images: Aerial RGB photos, all 512x512, training: 324 validation: 36, using random crops of 128x128. ~46 trees per image on average.
Each training session ends up with something looking pretty similar to this:
With the following rough stats when testing on the validation set with no image cropping using the inspect_model.ipynb
as a guide:
Original image shape: [512 512 3]
Processing 1 images
image shape: (512, 512, 3) min: 23.00000 max: 255.00000 uint8
molded_images shape: (1, 512, 512, 3) min: 23.00000 max: 255.00000 uint8
image_metas shape: (1, 14) min: 0.00000 max: 512.00000 int64
anchors shape: (1, 65280, 4) min: -0.17712 max: 1.11450 float32
gt_class_id shape: (12,) min: 1.00000 max: 1.00000 int32
gt_bbox shape: (12, 4) min: 20.00000 max: 512.00000 int32
gt_mask shape: (512, 512, 12) min: 0.00000 max: 1.00000 float64
AP @0.50: 0.000
AP @0.55: 0.000
AP @0.60: 0.000
AP @0.65: 0.000
AP @0.70: 0.000
AP @0.75: 0.000
AP @0.80: 0.000
AP @0.85: 0.000
AP @0.90: 0.000
AP @0.95: 0.000
AP @0.50-0.95: 0.000
I keep getting the same results (seemingly high confidence with zero or very close to zero IoU, generally clustered at the tops of the images), even after implementing advice I've found elsewhere in this repo (for small datasets) such as only training on heads, initializing with coco weights but not for too long, adjusting my anchor scales to match the general sizes and aspect ratios of the annotations, etc.
So far I'm questioning:
- Is my dataset simply too small for the complexity a Resnet101 backbone?
- Maybe something is up with my annotations?
- I'm screwing up a fundamental aspect of my config
- Unknown unknowns
Checking out the losses, what obviously stands out is the high overall loss (epoch_loss) which increases with each training iteration (just heads -> resnet +4 -> all layers):
My config:
Configurations:
BACKBONE resnet101
BACKBONE_STRIDES [4, 8, 16, 32, 64]
BATCH_SIZE 8
BBOX_STD_DEV [0.1 0.1 0.2 0.2]
COMPUTE_BACKBONE_SHAPE None
DETECTION_MAX_INSTANCES 100
DETECTION_MIN_CONFIDENCE 0.5
DETECTION_NMS_THRESHOLD 0.3
FPN_CLASSIF_FC_LAYERS_SIZE 1024
GPU_COUNT 1
GRADIENT_CLIP_NORM 5.0
IMAGES_PER_GPU 8
IMAGE_CHANNEL_COUNT 3
IMAGE_MAX_DIM 128
IMAGE_META_SIZE 14
IMAGE_MIN_DIM 128
IMAGE_MIN_SCALE 0
IMAGE_RESIZE_MODE crop
IMAGE_SHAPE [128 128 3]
LEARNING_MOMENTUM 0.9
LEARNING_RATE 0.001
LOSS_WEIGHTS {'rpn_class_loss': 1.0, 'rpn_bbox_loss': 1.0, 'mrcnn_class_loss': 1.0, 'mrcnn_bbox_loss': 1.0, 'mrcnn_mask_loss': 1.0}
MASK_POOL_SIZE 14
MASK_SHAPE [28, 28]
MAX_GT_INSTANCES 101
MEAN_PIXEL [107. 105.2 101.5]
MINI_MASK_SHAPE (56, 56)
NAME tree
NUM_CLASSES 2
POOL_SIZE 7
POST_NMS_ROIS_INFERENCE 1000
POST_NMS_ROIS_TRAINING 2000
PRE_NMS_LIMIT 6000
ROI_POSITIVE_RATIO 0.33
RPN_ANCHOR_RATIOS [0.5, 1, 1.5]
RPN_ANCHOR_SCALES (16, 32, 64, 128)
RPN_ANCHOR_STRIDE 1
RPN_BBOX_STD_DEV [0.1 0.1 0.2 0.2]
RPN_NMS_THRESHOLD 0.9
RPN_TRAIN_ANCHORS_PER_IMAGE 64
STEPS_PER_EPOCH 500
TOP_DOWN_PYRAMID_SIZE 256
TRAIN_BN False
TRAIN_ROIS_PER_IMAGE 200
USE_MINI_MASK False
USE_RPN_ROIS True
VALIDATION_STEPS 50
WEIGHT_DECAY 0.005
So, any initial thoughts on where I'm going wrong?
I think the model cannot learn the annotations or has a problem with saving weights. I couldn't understand the cropping process of the images. If you have cropped images, have you configurated your annotations according to the sizes of your new images? Could you explain your dataset a little more?
Sure, my dataset consists of 360 512x512x3 RGB tif images + their corresponding annotations (which are also 512x512 tifs with integer labels). For each RGB image I have one corresponding label/mask image. On loading the masks, I isolate each individual label into its own array and then stack them all, resulting in mask arrays that have dimensions (512, 512, # of labels), which is how it was done in the nucleus.py
sample project.
def load_mask(self, image_id):
"""Generate instance masks for an image.
Returns:
masks: A bool array of shape [height, width, instance count] with
one mask per instance.
class_ids: a 1D array of class IDs of the instance masks.
"""
info = self.image_info[image_id]
# Get mask directory from image path
mask_dir = os.path.join(os.path.dirname(os.path.dirname(info["path"])), "mask")
# Read mask file from .tif image and separate classes into
# individual boolean mask layers
mask = tiff.imread(glob.glob(f"{mask_dir}/*.tif")[0]).astype("int")
classes = np.unique(mask)
masks = []
for cl in classes:
if cl > 0:
m = np.zeros((mask.shape[0], mask.shape[1]))
m[np.where(mask == cl)] = 1
masks.append(m)
masks = np.moveaxis(np.array(masks), 0, -1)
# Return mask, and array of class IDs of each instance. Since we have
# one class ID, we return an array of ones
return masks, np.ones([masks.shape[-1]], dtype=np.int32)
There are, on average, 46 labeled trees per 512x512 image (with the maximum number of trees in an image being to 101). In the config, I chose to use the "crop" method to further reduce the size of the images to random crops of 128x128 (this was a size I found appropriate when performing my own U-Net semantic segmentation on the same dataset). But perhaps I'm misunderstanding the usage of the crop function here?
So far, since I have a small dataset, I've also tried just training the classifiers/heads, but I still can't seem to get a loss/val_loss below 2, and the output of the model predictions is still super strange-looking:

image ID: tree.393_5823_RGB_2020_04_08 (13) 393_5823_RGB_2020_04_08
Original image shape: [512 512 3]
Processing 1 images
image shape: (512, 512, 3) min: 13.00000 max: 255.00000 uint8
molded_images shape: (1, 512, 512, 3) min: 13.00000 max: 255.00000 uint8
image_metas shape: (1, 14) min: 0.00000 max: 512.00000 int64
anchors shape: (1, 65280, 4) min: -0.08856 max: 1.02594 float32
gt_class_id shape: (56,) min: 1.00000 max: 1.00000 int32
gt_bbox shape: (56, 4) min: 0.00000 max: 512.00000 int32
gt_mask shape: (512, 512, 56) min: 0.00000 max: 1.00000 float64
AP @0.50: 0.000
AP @0.55: 0.000
AP @0.60: 0.000
AP @0.65: 0.000
AP @0.70: 0.000
AP @0.75: 0.000
AP @0.80: 0.000
AP @0.85: 0.000
AP @0.90: 0.000
AP @0.95: 0.000
AP @0.50-0.95: 0.000
Maybe I just don't have enough training data, but I can't help but feel like there's something obviously flawed about my setup that I'm missing here...
Did you resize your masks while using the crop method? There are many functions related to crop and resize in model and utils.py. In addition to these, I think you should focus on the load_image_gt(...) in the model :
https://github.com/matterport/Mask_RCNN/blob/3deaec5d902d16e1daf56b62d5971d428dc920bc/mrcnn/model.py#L1186
and the resize functions in utils.py for cropping and resizing.
Also, I think there seems to be a problem between mask sizes and image sizes because you use the original size of your images in the detection part. Check the issue here #396.
@dluks This happened to me similarly regardless of the format of the image. It turned out that, like other people reported similar odd behavior in detection in the issues, anything above tensorflow version 2.5 has this weird problem. So I downgraded my tensorflow to version 2.5, keras to 2.4.3, cudatoolkit to 11.2 and cudnn to 8.1 and it worked out fine.
@nyinyinyanlin Could you specify which mask rcnn repository you are using? with tensorflow to version 2.5 and keras to 2.4.3, I have problems with the load weights function. The weights do not load well
@MatesdeSilvia can you please tell me what it means by weights do not load well? Can you please insert error logs or screenshot? I use Lee Kun Hee's fork but please be aware that you will have to manually install the specific versions of libraries instead of using pip install -r requirements
as Lee Kun Hee's fork uses lower versions of Tensorflow and Keras.