Hi guys!

I've been looking for a long time to find the correct way to calculate the F1-Score using the lib Mask-RCNN. I created several issues 2178, 2165, 2187, 2189, studied for a long time and I believe I found the right form. Before presenting the code used, let's go to the settings I used.

mAP = mean Average Precision

mAR = mean Average Recall

f1-score = 2 * (((mAP * mAR) / (mAP + mAR))

Calculating mean Average Precision (mAP)

To calculate the mAP, I used the compute_ap function available in the utils.py module. For each image I call the compute_ap function, which returns the Average Recall (AR) and adds it to a list. After going through all the images, I average the Average Recalls.

def evaluate_model(dataset, model, cfg):

  APs = []
  for image_id in dataset.image_ids:
		
    image, image_meta, gt_class_id, gt_bbox, gt_mask = load_image_gt(dataset, cfg, image_id, use_mini_mask=False)
		
    scaled_image = mold_image(image, cfg)
		
    sample = expand_dims(scaled_image, 0)
		
    yhat = model.detect(sample, verbose=0)
		
    r = yhat[0]
		
    AP, precisions, recalls, overlaps = compute_ap(gt_bbox, gt_class_id, gt_mask, 
                                                          r["rois"], r["class_ids"], r["scores"], 
                                                          r['masks'], iou_threshold=0.5)
		
    APs.append(AP)

  mAP = mean(APs)
  return mAP

Where the parameters:

dataset, is an object of a class that inherits from the Dataset class in utils.py;
model is an object resulting from the MaskRCNN class available in the module model.py;
cfg is an object of a class that inherits the super class config.py

Calculating mean Average Recall (mAR)

To calculate the mAR I used the post An Introduction to Evaluation Metrics for Object Detection as a mathematical basis.

The calculation of the mAR is similar to the mAP, except that instead of analyzing precision vs recall, we analyze the recall behavior using different iou thresholds. In the post Average Recall it is defined as:

AR is the recall averaged over all IoU ∈ [0.5, 1.0] and can be computed as two times the area under the recall-IoU curve:

ar_formula

In the code what we need to do is create a function that calculates the Average Recall, and then we follow with the approach similar to mAP, we will go through each of the images, calculate their Average Recall, add it to a list and at the end we make an average and we find the mAR.

from sklearn import metrics

def compute_ar(pred_boxes, gt_boxes, list_iou_thresholds):

    AR = []
    for iou_threshold in list_iou_thresholds:

        try:
            recall, _ = compute_recall(pred_boxes, gt_boxes, iou=iou_threshold)

            AR.append(recall)

        except:
          AR.append(0.0)
          pass

    AUC = 2 * (metrics.auc(list_iou_thresholds, AR))
   return AUC

Basically, we are calling the compute_recall function of the utils.py module for each of the thresholds that we define in the formula.

Where, pred_boxes: Are the coordinates of the expected bounding box; gt_boxes: Are the coordinates of the actual bounding box; list_iou_thresholds: List of thresholds that will be used.

Now let's add mAR to our evaluate_model function.

def evaluate_model(dataset, model, cfg, list_iou_thresholds=None):

  if list_iou_thresholds is None: list_iou_thresholds = np.arange(0.5, 1.01, 0.1)

  APs = []
  ARs = []
  for image_id in dataset.image_ids:
		
    image, image_meta, gt_class_id, gt_bbox, gt_mask = load_image_gt(dataset, cfg, image_id, use_mini_mask=False)
		
    scaled_image = mold_image(image, cfg)
		
    sample = expand_dims(scaled_image, 0)
		
    yhat = model.detect(sample, verbose=0)
		
    r = yhat[0]
		
    AP, precisions, recalls, overlaps = compute_ap(gt_bbox, gt_class_id, gt_mask, r["rois"], r["class_ids"], r["scores"], r['masks'], iou_threshold=0.5)
		
    AR = compute_ar(r['rois'], gt_bbox, list_iou_thresholds)
    ARs.append(AR)
    APs.append(AP)

  mAP = mean(APs)
  mAR = mean(ARs)

  return mAP, mAR

Calculating F1-Score

Now that we know our mAP and mAR, just apply the f1-score formula. Let's add the f1-score formula to our evaluate_model function.

def evaluate_model(dataset, model, cfg, list_iou_thresholds=None):

  if list_iou_thresholds is None: list_iou_thresholds = np.arange(0.5, 1.01, 0.1)

  APs = []
  ARs = []
  for image_id in dataset.image_ids:
		
    image, image_meta, gt_class_id, gt_bbox, gt_mask = load_image_gt(dataset, cfg, image_id, use_mini_mask=False)
		
    scaled_image = mold_image(image, cfg)
		
    sample = expand_dims(scaled_image, 0)
		
    yhat = model.detect(sample, verbose=0)
		
    r = yhat[0]
		
    AP, precisions, recalls, overlaps = compute_ap(gt_bbox, gt_class_id, gt_mask, r["rois"], r["class_ids"], r["scores"], r['masks'], iou_threshold=0.5)
		
    AR = compute_ar(r['rois'], gt_bbox, list_iou_thresholds)
    ARs.append(AR)
    APs.append(AP)

  mAP = mean(APs)
  mAR = mean(ARs)
  f1_score = 2 * ((mAP * mAR) / (mAP + mAR))

  return mAP, mAR, f1_score

This was the way I found to calculate mAP, mAR and f1-score, what did you think? I believe that I am on the right path, I am not an expert in the area and I had a lot of difficulty in reaching this result, I accept any type of feedback. I hope to contribute in some way!

Mar 22 '21 03:03 WillianaLeite

Hello, did this method work for you?

Mar 23 '21 17:03 sohinimallick

Hi @sohinimallick ! So far it has worked well

Mar 23 '21 23:03 WillianaLeite

Big thanks for this! It's working on my end so far.

Edit: No it's not, whoops. I'm getting an error when calling evaluate_model. Within the utils.compute_ap function, there is a shape mismatch when calculating intersections. Here's the error dump:

~/project/2_MaskRCNN/mrcnn/utils.py in compute_overlaps_masks(masks1, masks2)
    109 
    110     # intersections and union
--> 111     intersections = np.dot(masks1.T, masks2)
    112     union = area1[:, None] + area2[None, :] - intersections
    113     overlaps = intersections / union

<__array_function__ internals> in dot(*args, **kwargs)

ValueError: shapes (2,65536) and (3136,51) not aligned: 65536 (dim 1) != 3136 (dim 0)

I have a feeling that it's either the fact that I'm using a newer version of TF, or the expand_dims function is not working correctly. What is the expected output when calling expand_dims?

Here's my code for reference 👇

def evaluate_model(dataset, model, cfg, list_iou_thresholds=None):

    if list_iou_thresholds is None: list_iou_thresholds = np.arange(0.5, 1.01, 0.1)

    APs = []
    ARs = []
    for image_id in dataset.image_ids:
		
        image, image_meta, gt_class_id, gt_bbox, gt_mask = modellib.load_image_gt(dataset, cfg, image_id)
		
        scaled_image = modellib.mold_image(image, cfg)
		
        sample = np.expand_dims(scaled_image, 0)
		
        yhat = model.detect(sample, verbose=0)
		
        r = yhat[0]
		
        AP, precisions, recalls, overlaps = utils.compute_ap(gt_bbox, gt_class_id, gt_mask, r["rois"], r["class_ids"], r["scores"], r['masks'], iou_threshold=0.5)
		
        AR = compute_ar(r['rois'], gt_bbox, list_iou_thresholds)
        ARs.append(AR)
        APs.append(AP)

    mAP = mean(APs)
    mAR = mean(ARs)
    f1_score = 2 * ((mAP * mAR) / (mAP + mAR))


    return mAP, mAR, f1_score
evaluate_model(dataset,model,config)

Mar 24 '21 11:03 wiktor-jurek

Hi @wiktor-jurek

I'm using the Colab environment for training my models, I run this command (magic cell):

%tensorflow_version 1.x

And it returns me the entire environment configured to work with Tensorflow in version 1.15.2. Colab maintains a stable version of Tensorflow 1 and 2. Well, I believe that the version of tensorflow may be the problem, but I also noticed that its function compute_overlaps_masks is slightly different from what I find in my utils.py, so I'm sending you the utils.py da mask rcnn that I have here: https://drive.google.com/file/d/1EWI3kVvBpKGBoBJ-f0rq_NpoURBszlrR/view?usp=sharing.

Mar 25 '21 03:03 WillianaLeite

Big thanks for this! It's working on my end so far.

Edit: No it's not, whoops. I'm getting an error when calling evaluate_model. Within the utils.compute_ap function, there is a shape mismatch when calculating intersections. Here's the error dump:

~/project/2_MaskRCNN/mrcnn/utils.py in compute_overlaps_masks(masks1, masks2)
    109 
    110     # intersections and union
--> 111     intersections = np.dot(masks1.T, masks2)
    112     union = area1[:, None] + area2[None, :] - intersections
    113     overlaps = intersections / union

<__array_function__ internals> in dot(*args, **kwargs)

ValueError: shapes (2,65536) and (3136,51) not aligned: 65536 (dim 1) != 3136 (dim 0)

I have a feeling that it's either the fact that I'm using a newer version of TF, or the expand_dims function is not working correctly. What is the expected output when calling expand_dims?

Here's my code for reference 👇

def evaluate_model(dataset, model, cfg, list_iou_thresholds=None):

    if list_iou_thresholds is None: list_iou_thresholds = np.arange(0.5, 1.01, 0.1)

    APs = []
    ARs = []
    for image_id in dataset.image_ids:
		
        image, image_meta, gt_class_id, gt_bbox, gt_mask = modellib.load_image_gt(dataset, cfg, image_id)
		
        scaled_image = modellib.mold_image(image, cfg)
		
        sample = np.expand_dims(scaled_image, 0)
		
        yhat = model.detect(sample, verbose=0)
		
        r = yhat[0]
		
        AP, precisions, recalls, overlaps = utils.compute_ap(gt_bbox, gt_class_id, gt_mask, r["rois"], r["class_ids"], r["scores"], r['masks'], iou_threshold=0.5)
		
        AR = compute_ar(r['rois'], gt_bbox, list_iou_thresholds)
        ARs.append(AR)
        APs.append(AP)

    mAP = mean(APs)
    mAR = mean(ARs)
    f1_score = 2 * ((mAP * mAR) / (mAP + mAR))


    return mAP, mAR, f1_score
evaluate_model(dataset,model,config)

@wiktor-jurek I solved this putting USE_MINI_MASK = False in both inference and training

Mar 28 '21 22:03 sohinimallick

BTW @WillianaLeite...do you have any suggestions to output number of TP/FP somehow.

Mar 29 '21 00:03 sohinimallick

Hello @WillianaLeite I have a question. Why use mold_image? What's the difference from just putting in [image]?

Mar 29 '21 04:03 sain0722

Hello @WillianaLeite I tried the code you wrote in my own work. I have 5 classes in my dataset. When I found mAP using the method in issues #1839, it was 0.6, but when I tried yours, it got 0.3 and the f1 score was 0. What should I do? How can I increase these values ? I need f1 score and mAP value for my thesis please help. Thanks.

Results: (0.34649124363358835, 0.0, 0.0)

Apr 02 '21 18:04 geomaticsbetul

that its function compute_overlaps_masks is slightly different from what I find in my utils.py, so I'm sending you the utils.py da mask rcnn that I have here: https://drive.google.com/file/d/1EWI3kVvBpKGBoBJ-f0rq_NpoURBszlrR/view?usp=sharing.

Bingo. That made the difference. Thanks.

Apr 06 '21 21:04 wiktor-jurek

Hello @sain0722, I had the same question regarding mold_image and [image]. Have you received an answer? Or do you know why it is necessary to mold the images before detection? Thanks.

Apr 30 '21 10:04 CZ2021

@sain0722 i believe mold_image does the normalization i.e subtract from the mean of the dataset

May 24 '21 14:05 temi92

Hello @WillianaLeite I tried the code you wrote in my own work. I have 5 classes in my dataset. When I found mAP using the method in issues #1839, it was 0.6, but when I tried yours, it got 0.3 and the f1 score was 0. What should I do? How can I increase these values ? I need f1 score and mAP value for my thesis please help. Thanks.

Results: (0.34649124363358835, 0.0, 0.0)

The compute_ar() function is not able to find the compute_recall(..) so it keeps appending 0.0 to the AR list instead of throwing an exception because of the try-except bloc. to fix the 0.0 values in the mAR and F1 score you need to replace compute_recall(..) with utils.compute_recall(..) inside of the compute_ar() function.

Jun 03 '21 13:06 salma-achour

Hi anyone know how to calculate the mAP for bbox. Currently, lots of calculations were found focus on instance but for detection, how to do that? Thank you for the help

Jun 14 '21 22:06 kimile599

Problem is, compute_ap function does it with masks and compute_recall does it with bboxes, it does not work

Jul 26 '21 06:07 marcodelmoral

@marcojulioarg Do you disagree with this implementation? As you did?

Aug 02 '21 19:08 nataliameira

Hey Actually, if you are using matterport TF 2.0 version from here , then you have to add USE_MINI_MASK=False inside config then do the things.

Aug 29 '21 18:08 sauravsolanki

@sain0722 @CZ2021 I found that in the model.py file in the detect function mold_image() is already applied when an image is passed to it and there is another function called detect_molded() that says: "Runs the detection pipeline, but expect inputs that are molded already. Used mostly for debugging and inspecting the model ". I don't know why you find mold_image() applied before using model.detect(). So should model.detect_molded() be used?

Jan 05 '22 23:01 felipetobars

@WillianaLeite Hi! In doing my master thesis I have came to your same problems! I found this issue and I have currently done same path as you! I am using detectron2 for my project and avaliable metrics in instance segmentation tasks are AP and AR, in particular there is an evaluator that uses standard COCO metrics. To translate AP and AR in F1 I've ended up using your same F1 formula. Honestly i do not know if this is the right way to do present a publication, but imo AP@IoU=0.50 should be the standard metric. Average F1 could be appended specifying the method we used to obtain it!

Feb 04 '22 15:02 andreaceruti

. So should model.detect_molded() be used?

I understood the same as you @felipetobars. I think it's not necessary to use mold_image before calling the detection function. I did some testing and got better results by not using the mold_image before.

Feb 17 '22 17:02 guilhermemarim

. So should model.detect_molded() be used?

I understood the same as you @felipetobars. I think it's not necessary to use mold_image before calling the detection function. I did some testing and got better results by not using the mold_image before.

I also got better results without using the function @guilhermemarim

Feb 21 '22 17:02 felipetobars

Hello, looking at the detect() method in the model.py script, i see:

`def detect(self, images, verbose=0):
assert self.mode == "inference", "Create model in inference mode." assert len( images) == self.config.BATCH_SIZE, "len(images) must be equal to BATCH_SIZE"

    if verbose:
        log("Processing {} images".format(len(images)))
        for image in images:
            log("image", image)

    # Mold inputs to format expected by the neural network
    molded_images, image_metas, windows = self.mold_inputs(images)

    # Validate image sizes
    # All images in a batch MUST be of the same size
    image_shape = molded_images[0].shape
    for g in molded_images[1:]:
        assert g.shape == image_shape,\
            "After resizing, all images must have the same size. Check IMAGE_RESIZE_MODE and image sizes."`

why exactly do you mold the image before giving it to the to the detect()-method? The way you do it, the input image gets molded double, right?

Apr 13 '22 11:04 ghost

Are you aware that the weighted-average or micro-average recall is just another name for the ordinary accuracy score? Or that the macro-average recall (equal weight per class irrespective of imbalance in number of instances) is just another name for the balanced accuracy score?

Compare them for yourself. They match exactly:

classification_report(y_tst, y_pred_tst, digits=15) =
                  precision         recall            f1-score           support

              0  0.817246835443038 0.683879510095995 0.744638673634889      3021
              1  0.829678021465236 0.696708463949843 0.757401490947817      2552
              2  0.770103092783505 0.630912162162162 0.693593314763231      1184
              3  0.331294597349643 0.844155844155844 0.475841874084919       385
              4  0.394505494505494 0.920512820512820 0.552307692307692       390

       accuracy                                      0.700345193839618      7532
      macro avg  0.628565608309383 0.755233760175333 0.644756609147710      7532
   weighted avg  0.767319254559894 0.700345193839618 0.717240526308044      7532

balanced_accuracy_score(
              y_tst, y_pred_tst) = 0.755233760175333

May 13 '22 05:05 DavidRosen

@WillianaLeite thank you so much, it works on my project

Jul 01 '22 09:07 aliffarisqi

My question is what needs to be passed into evaluate_model. I need to see where someone calls the function passing in the parameters into it.

Thank you.

Sep 06 '22 10:09 Akintola-Stephen

@WillianaLeite Hi! In doing my master thesis I have came to your same problems! I found this issue and I have currently done same path as you! I am using detectron2 for my project and avaliable metrics in instance segmentation tasks are AP and AR, in particular there is an evaluator that uses standard COCO metrics. To translate AP and AR in F1 I've ended up using your same F1 formula. Honestly i do not know if this is the right way to do present a publication, but imo AP@IoU=0.50 should be the standard metric. Average F1 could be appended specifying the method we used to obtain it!

Hey @andreaceruti ! Facing the same issue atm - did you get it work with detectron2? thanks in advance :)

Oct 18 '22 19:10 Natriumpikant

Hi @WillianaLeite ,

thanks for providing your code. I have a question regarding the compute_ap() function from mrcnn that you maybe know:

Does it compute the AP and mAP based on the boxes or based on the segmentation? I made a formula by hand myself, that is using the segmentation and I do get different results from the compute_ap() one built in mrcnn. However I do not know if eventually I am maybe doing something wrong with my function or it they are just using different inputs.

Thanks and regards!

Nov 04 '22 18:11 Testbild

@WillianaLeite I'm sorry but how is the compute_ar function correct? In the compute_ap, the precision is calculated at a specific iou threshold but in compute_ar, it loops through all the thresholds and calculates the AUC which is returned as the AR. How does that work??

I mean, even if we want to get the AP across all the thresholds, then why don't we use compute_ap_range instead?

But back to the main point, the recall formula looks off to me and I would appreciate any response especially if I'm looking at it the wrong way.

Feb 04 '23 00:02 FiyinfobaO

I also just observed that the compute_recall doesn't return the average recall like the compute_ap returns the average precision, not to talk of the fact that the compute_recall is implemented purely from the bounding box as it doesn't use any of the mask predictions or the ground truth masks in its calculation. So how do we compare the results from this function with the result of the compute_ap which utilizes both the bounding boxes and the masks?

Would appreciate any response to this and if I'm wrong, please let me know as well.

Feb 04 '23 00:02 FiyinfobaO

Hi @WillianaLeite,

I believe that your formula for computing F1-score is not accurate and is not applicable to your model. Mean average precision, or average precision for a single class is computed as an estimate of the area under the precision-recall curve. This unification is done because the precision and recall metrics are inversely proportional and change when you alter the IoU threshold.

Furthermore, the F1 score formula is used for binary classification tasks, not for object detection or segmentation. You are better off sticking to mAP and AR scores to compare your different models.

Screenshot 2023-03-20 114939

Mar 20 '23 17:03 aa217

Hello. I am a beginner, I appreciate the code and ideas you provided, in my project I followed your code hoping to output the F1 score and mAP and mAR but he reported an error, I am inexperienced and still hope you can help me. Thank you very much. Here is my code： from tensorflow import expand_dims from mrcnn.utils import Dataset from tensorflow.python.keras.backend import mean

from build.lib.mrcnn.model import load_image_gt, mold_image from build.lib.mrcnn.utils import compute_ap from mrcnn.utils import compute_recall def compute_ar(pred_boxes, gt_boxes, list_iou_thresholds):

AR = []
for iou_threshold in list_iou_thresholds:

    try:
        recall, _ = compute_recall(pred_boxes, gt_boxes, iou=iou_threshold)

        AR.append(recall)

    except:
      AR.append(0.0)
      pass

AUC = 2 * (metrics.auc(list_iou_thresholds, AR))
return AUC

def evaluate_model(Dataset, model, cfg, list_iou_thresholds=None): if list_iou_thresholds is None: list_iou_thresholds = np.arange(0.5, 1.01, 0.1)

APs = []
ARs = []
for image_id in Dataset.image_ids:
    image, image_meta, gt_class_id, gt_bbox, gt_mask = load_image_gt(Dataset, cfg, image_id, use_mini_mask=False)

    scaled_image = mold_image(image, cfg)

    sample = expand_dims(scaled_image, 0)

    yhat = model.detect(sample, verbose=0)

    r = yhat[0]

    AP, precisions, recalls, overlaps = compute_ap(gt_bbox, gt_class_id, gt_mask, r["rois"], r["class_ids"],
                                                   r["scores"], r['masks'], iou_threshold=0.5)

    AR = compute_ar(r['rois'], gt_bbox, list_iou_thresholds)
    ARs.append(AR)
    APs.append(AP)

mAP = mean(APs)
mAR = mean(ARs)
f1_score = 2 * ((mAP * mAR) / (mAP + mAR))

return mAP, mAR, f1_score

evaluate_model(Dataset,model,config)`

Tips for reporting errors：
for image_id in Dataset.image_ids: TypeError: 'property' object is not iterable

Aug 10 '23 04:08 qa1511

Mask_RCNN
Mask_RCNN copied to clipboard

Calculating mean Average Recall (mAR), mean Average Precision (mAP) and F1-Score

Calculating mean Average Precision (mAP)

Calculating mean Average Recall (mAR)

Calculating F1-Score

Mask_RCNN Mask_RCNN copied to clipboard

Calculating mean Average Recall (mAR), mean Average Precision (mAP) and F1-Score

Calculating mean Average Precision (mAP)

Calculating mean Average Recall (mAR)

Calculating F1-Score

Mask_RCNN
Mask_RCNN copied to clipboard