Mask_RCNN Detections are extremely incorrect and mAP is always zero

Hi all,

I've been very excited to get to apply this slightly intimidating project to some new data, but despite all of the great examples of impressive results I've seen out there, I'm really struggling to get results that are at all promising, and so I'm suspecting that there's something fundamental I'm overlooking in my setup.

My dataset consists of aerial RGB shots of a city, with two classes: tree and background.

Images: Aerial RGB photos, all 512x512, training: 324 validation: 36, using random crops of 128x128. ~46 trees per image on average.

Each training session ends up with something looking pretty similar to this:

With the following rough stats when testing on the validation set with no image cropping using the inspect_model.ipynb as a guide:

Original image shape:  [512 512   3]
Processing 1 images
image                    shape: (512, 512, 3)         min:   23.00000  max:  255.00000  uint8
molded_images            shape: (1, 512, 512, 3)      min:   23.00000  max:  255.00000  uint8
image_metas              shape: (1, 14)               min:    0.00000  max:  512.00000  int64
anchors                  shape: (1, 65280, 4)         min:   -0.17712  max:    1.11450  float32
gt_class_id              shape: (12,)                 min:    1.00000  max:    1.00000  int32
gt_bbox                  shape: (12, 4)               min:   20.00000  max:  512.00000  int32
gt_mask                  shape: (512, 512, 12)        min:    0.00000  max:    1.00000  float64
AP @0.50:	 0.000
AP @0.55:	 0.000
AP @0.60:	 0.000
AP @0.65:	 0.000
AP @0.70:	 0.000
AP @0.75:	 0.000
AP @0.80:	 0.000
AP @0.85:	 0.000
AP @0.90:	 0.000
AP @0.95:	 0.000
AP @0.50-0.95:	 0.000

I keep getting the same results (seemingly high confidence with zero or very close to zero IoU, generally clustered at the tops of the images), even after implementing advice I've found elsewhere in this repo (for small datasets) such as only training on heads, initializing with coco weights but not for too long, adjusting my anchor scales to match the general sizes and aspect ratios of the annotations, etc.

So far I'm questioning:

Is my dataset simply too small for the complexity a Resnet101 backbone?
Maybe something is up with my annotations?
I'm screwing up a fundamental aspect of my config
Unknown unknowns

Checking out the losses, what obviously stands out is the high overall loss (epoch_loss) which increases with each training iteration (just heads -> resnet +4 -> all layers):

My config:

Configurations:
BACKBONE                       resnet101
BACKBONE_STRIDES               [4, 8, 16, 32, 64]
BATCH_SIZE                     8
BBOX_STD_DEV                   [0.1 0.1 0.2 0.2]
COMPUTE_BACKBONE_SHAPE         None
DETECTION_MAX_INSTANCES        100
DETECTION_MIN_CONFIDENCE       0.5
DETECTION_NMS_THRESHOLD        0.3
FPN_CLASSIF_FC_LAYERS_SIZE     1024
GPU_COUNT                      1
GRADIENT_CLIP_NORM             5.0
IMAGES_PER_GPU                 8
IMAGE_CHANNEL_COUNT            3
IMAGE_MAX_DIM                  128
IMAGE_META_SIZE                14
IMAGE_MIN_DIM                  128
IMAGE_MIN_SCALE                0
IMAGE_RESIZE_MODE              crop
IMAGE_SHAPE                    [128 128   3]
LEARNING_MOMENTUM              0.9
LEARNING_RATE                  0.001
LOSS_WEIGHTS                   {'rpn_class_loss': 1.0, 'rpn_bbox_loss': 1.0, 'mrcnn_class_loss': 1.0, 'mrcnn_bbox_loss': 1.0, 'mrcnn_mask_loss': 1.0}
MASK_POOL_SIZE                 14
MASK_SHAPE                     [28, 28]
MAX_GT_INSTANCES               101
MEAN_PIXEL                     [107.  105.2 101.5]
MINI_MASK_SHAPE                (56, 56)
NAME                           tree
NUM_CLASSES                    2
POOL_SIZE                      7
POST_NMS_ROIS_INFERENCE        1000
POST_NMS_ROIS_TRAINING         2000
PRE_NMS_LIMIT                  6000
ROI_POSITIVE_RATIO             0.33
RPN_ANCHOR_RATIOS              [0.5, 1, 1.5]
RPN_ANCHOR_SCALES              (16, 32, 64, 128)
RPN_ANCHOR_STRIDE              1
RPN_BBOX_STD_DEV               [0.1 0.1 0.2 0.2]
RPN_NMS_THRESHOLD              0.9
RPN_TRAIN_ANCHORS_PER_IMAGE    64
STEPS_PER_EPOCH                500
TOP_DOWN_PYRAMID_SIZE          256
TRAIN_BN                       False
TRAIN_ROIS_PER_IMAGE           200
USE_MINI_MASK                  False
USE_RPN_ROIS                   True
VALIDATION_STEPS               50
WEIGHT_DECAY                   0.005

So, any initial thoughts on where I'm going wrong?

Oct 09 '22 08:10 dluks

I think the model cannot learn the annotations or has a problem with saving weights. I couldn't understand the cropping process of the images. If you have cropped images, have you configurated your annotations according to the sizes of your new images? Could you explain your dataset a little more?

Oct 13 '22 08:10 dilaraozdemir

Sure, my dataset consists of 360 512x512x3 RGB tif images + their corresponding annotations (which are also 512x512 tifs with integer labels). For each RGB image I have one corresponding label/mask image. On loading the masks, I isolate each individual label into its own array and then stack them all, resulting in mask arrays that have dimensions (512, 512, # of labels), which is how it was done in the nucleus.py sample project.

def load_mask(self, image_id):
        """Generate instance masks for an image.
        Returns:
         masks: A bool array of shape [height, width, instance count] with
             one mask per instance.
         class_ids: a 1D array of class IDs of the instance masks.
        """
        info = self.image_info[image_id]
        # Get mask directory from image path
        mask_dir = os.path.join(os.path.dirname(os.path.dirname(info["path"])), "mask")

        # Read mask file from .tif image and separate classes into
        # individual boolean mask layers
        mask = tiff.imread(glob.glob(f"{mask_dir}/*.tif")[0]).astype("int")
        classes = np.unique(mask)
        masks = []
        for cl in classes:
            if cl > 0:
                m = np.zeros((mask.shape[0], mask.shape[1]))
                m[np.where(mask == cl)] = 1
                masks.append(m)

        masks = np.moveaxis(np.array(masks), 0, -1)

        # Return mask, and array of class IDs of each instance. Since we have
        # one class ID, we return an array of ones
        return masks, np.ones([masks.shape[-1]], dtype=np.int32)

There are, on average, 46 labeled trees per 512x512 image (with the maximum number of trees in an image being to 101). In the config, I chose to use the "crop" method to further reduce the size of the images to random crops of 128x128 (this was a size I found appropriate when performing my own U-Net semantic segmentation on the same dataset). But perhaps I'm misunderstanding the usage of the crop function here?

So far, since I have a small dataset, I've also tried just training the classifiers/heads, but I still can't seem to get a loss/val_loss below 2, and the output of the model predictions is still super strange-looking:

image ID: tree.393_5823_RGB_2020_04_08 (13) 393_5823_RGB_2020_04_08
Original image shape:  [512 512   3]
Processing 1 images
image                    shape: (512, 512, 3)         min:   13.00000  max:  255.00000  uint8
molded_images            shape: (1, 512, 512, 3)      min:   13.00000  max:  255.00000  uint8
image_metas              shape: (1, 14)               min:    0.00000  max:  512.00000  int64
anchors                  shape: (1, 65280, 4)         min:   -0.08856  max:    1.02594  float32
gt_class_id              shape: (56,)                 min:    1.00000  max:    1.00000  int32
gt_bbox                  shape: (56, 4)               min:    0.00000  max:  512.00000  int32
gt_mask                  shape: (512, 512, 56)        min:    0.00000  max:    1.00000  float64
AP @0.50:	 0.000
AP @0.55:	 0.000
AP @0.60:	 0.000
AP @0.65:	 0.000
AP @0.70:	 0.000
AP @0.75:	 0.000
AP @0.80:	 0.000
AP @0.85:	 0.000
AP @0.90:	 0.000
AP @0.95:	 0.000
AP @0.50-0.95:	 0.000

Maybe I just don't have enough training data, but I can't help but feel like there's something obviously flawed about my setup that I'm missing here...

Oct 14 '22 08:10 dluks

Did you resize your masks while using the crop method? There are many functions related to crop and resize in model and utils.py. In addition to these, I think you should focus on the load_image_gt(...) in the model :

https://github.com/matterport/Mask_RCNN/blob/3deaec5d902d16e1daf56b62d5971d428dc920bc/mrcnn/model.py#L1186

and the resize functions in utils.py for cropping and resizing.

Also, I think there seems to be a problem between mask sizes and image sizes because you use the original size of your images in the detection part. Check the issue here #396.

Oct 18 '22 10:10 dilaraozdemir

@dluks This happened to me similarly regardless of the format of the image. It turned out that, like other people reported similar odd behavior in detection in the issues, anything above tensorflow version 2.5 has this weird problem. So I downgraded my tensorflow to version 2.5, keras to 2.4.3, cudatoolkit to 11.2 and cudnn to 8.1 and it worked out fine.

Nov 05 '23 13:11 nyinyinyanlin

@nyinyinyanlin Could you specify which mask rcnn repository you are using? with tensorflow to version 2.5 and keras to 2.4.3, I have problems with the load weights function. The weights do not load well

Nov 06 '23 14:11 MatesdeSilvia

@MatesdeSilvia can you please tell me what it means by weights do not load well? Can you please insert error logs or screenshot? I use Lee Kun Hee's fork but please be aware that you will have to manually install the specific versions of libraries instead of using pip install -r requirements as Lee Kun Hee's fork uses lower versions of Tensorflow and Keras.

Nov 06 '23 15:11 nyinyinyanlin

Mask_RCNN Mask_RCNN copied to clipboard

Detections are extremely incorrect and mAP is always zero

Mask_RCNN
Mask_RCNN copied to clipboard