Mask_RCNN
Mask_RCNN copied to clipboard
Calculating mean Average Recall (mAR), mean Average Precision (mAP) and F1-Score
Hi guys!
I've been looking for a long time to find the correct way to calculate the F1-Score using the lib Mask-RCNN. I created several issues 2178, 2165, 2187, 2189, studied for a long time and I believe I found the right form. Before presenting the code used, let's go to the settings I used.
mAP = mean Average Precision
mAR = mean Average Recall
f1-score = 2 * (((mAP * mAR) / (mAP + mAR))
Calculating mean Average Precision (mAP)
To calculate the mAP, I used the compute_ap function available in the utils.py module. For each image I call the compute_ap function, which returns the Average Recall (AR) and adds it to a list. After going through all the images, I average the Average Recalls.
def evaluate_model(dataset, model, cfg):
APs = []
for image_id in dataset.image_ids:
image, image_meta, gt_class_id, gt_bbox, gt_mask = load_image_gt(dataset, cfg, image_id, use_mini_mask=False)
scaled_image = mold_image(image, cfg)
sample = expand_dims(scaled_image, 0)
yhat = model.detect(sample, verbose=0)
r = yhat[0]
AP, precisions, recalls, overlaps = compute_ap(gt_bbox, gt_class_id, gt_mask,
r["rois"], r["class_ids"], r["scores"],
r['masks'], iou_threshold=0.5)
APs.append(AP)
mAP = mean(APs)
return mAP
Where the parameters:
- dataset, is an object of a class that inherits from the Dataset class in utils.py;
- model is an object resulting from the MaskRCNN class available in the module model.py;
- cfg is an object of a class that inherits the super class config.py
Calculating mean Average Recall (mAR)
To calculate the mAR I used the post An Introduction to Evaluation Metrics for Object Detection as a mathematical basis.
The calculation of the mAR is similar to the mAP, except that instead of analyzing precision vs recall, we analyze the recall behavior using different iou thresholds. In the post Average Recall it is defined as:
AR is the recall averaged over all IoU ∈ [0.5, 1.0] and can be computed as two times the area under the recall-IoU curve:
In the code what we need to do is create a function that calculates the Average Recall, and then we follow with the approach similar to mAP, we will go through each of the images, calculate their Average Recall, add it to a list and at the end we make an average and we find the mAR.
from sklearn import metrics
def compute_ar(pred_boxes, gt_boxes, list_iou_thresholds):
AR = []
for iou_threshold in list_iou_thresholds:
try:
recall, _ = compute_recall(pred_boxes, gt_boxes, iou=iou_threshold)
AR.append(recall)
except:
AR.append(0.0)
pass
AUC = 2 * (metrics.auc(list_iou_thresholds, AR))
return AUC
Basically, we are calling the compute_recall function of the utils.py module for each of the thresholds that we define in the formula.
Where, pred_boxes: Are the coordinates of the expected bounding box; gt_boxes: Are the coordinates of the actual bounding box; list_iou_thresholds: List of thresholds that will be used.
Now let's add mAR to our evaluate_model function.
def evaluate_model(dataset, model, cfg, list_iou_thresholds=None):
if list_iou_thresholds is None: list_iou_thresholds = np.arange(0.5, 1.01, 0.1)
APs = []
ARs = []
for image_id in dataset.image_ids:
image, image_meta, gt_class_id, gt_bbox, gt_mask = load_image_gt(dataset, cfg, image_id, use_mini_mask=False)
scaled_image = mold_image(image, cfg)
sample = expand_dims(scaled_image, 0)
yhat = model.detect(sample, verbose=0)
r = yhat[0]
AP, precisions, recalls, overlaps = compute_ap(gt_bbox, gt_class_id, gt_mask, r["rois"], r["class_ids"], r["scores"], r['masks'], iou_threshold=0.5)
AR = compute_ar(r['rois'], gt_bbox, list_iou_thresholds)
ARs.append(AR)
APs.append(AP)
mAP = mean(APs)
mAR = mean(ARs)
return mAP, mAR
Calculating F1-Score
Now that we know our mAP and mAR, just apply the f1-score formula. Let's add the f1-score formula to our evaluate_model function.
def evaluate_model(dataset, model, cfg, list_iou_thresholds=None):
if list_iou_thresholds is None: list_iou_thresholds = np.arange(0.5, 1.01, 0.1)
APs = []
ARs = []
for image_id in dataset.image_ids:
image, image_meta, gt_class_id, gt_bbox, gt_mask = load_image_gt(dataset, cfg, image_id, use_mini_mask=False)
scaled_image = mold_image(image, cfg)
sample = expand_dims(scaled_image, 0)
yhat = model.detect(sample, verbose=0)
r = yhat[0]
AP, precisions, recalls, overlaps = compute_ap(gt_bbox, gt_class_id, gt_mask, r["rois"], r["class_ids"], r["scores"], r['masks'], iou_threshold=0.5)
AR = compute_ar(r['rois'], gt_bbox, list_iou_thresholds)
ARs.append(AR)
APs.append(AP)
mAP = mean(APs)
mAR = mean(ARs)
f1_score = 2 * ((mAP * mAR) / (mAP + mAR))
return mAP, mAR, f1_score
This was the way I found to calculate mAP, mAR and f1-score, what did you think? I believe that I am on the right path, I am not an expert in the area and I had a lot of difficulty in reaching this result, I accept any type of feedback. I hope to contribute in some way!
Hello, did this method work for you?
Hi @sohinimallick ! So far it has worked well
Big thanks for this! It's working on my end so far.
Edit: No it's not, whoops. I'm getting an error when calling evaluate_model
. Within the utils.compute_ap function, there is a shape mismatch when calculating intersections. Here's the error dump:
~/project/2_MaskRCNN/mrcnn/utils.py in compute_overlaps_masks(masks1, masks2)
109
110 # intersections and union
--> 111 intersections = np.dot(masks1.T, masks2)
112 union = area1[:, None] + area2[None, :] - intersections
113 overlaps = intersections / union
<__array_function__ internals> in dot(*args, **kwargs)
ValueError: shapes (2,65536) and (3136,51) not aligned: 65536 (dim 1) != 3136 (dim 0)
I have a feeling that it's either the fact that I'm using a newer version of TF, or the expand_dims
function is not working correctly. What is the expected output when calling expand_dims
?
Here's my code for reference 👇
def evaluate_model(dataset, model, cfg, list_iou_thresholds=None):
if list_iou_thresholds is None: list_iou_thresholds = np.arange(0.5, 1.01, 0.1)
APs = []
ARs = []
for image_id in dataset.image_ids:
image, image_meta, gt_class_id, gt_bbox, gt_mask = modellib.load_image_gt(dataset, cfg, image_id)
scaled_image = modellib.mold_image(image, cfg)
sample = np.expand_dims(scaled_image, 0)
yhat = model.detect(sample, verbose=0)
r = yhat[0]
AP, precisions, recalls, overlaps = utils.compute_ap(gt_bbox, gt_class_id, gt_mask, r["rois"], r["class_ids"], r["scores"], r['masks'], iou_threshold=0.5)
AR = compute_ar(r['rois'], gt_bbox, list_iou_thresholds)
ARs.append(AR)
APs.append(AP)
mAP = mean(APs)
mAR = mean(ARs)
f1_score = 2 * ((mAP * mAR) / (mAP + mAR))
return mAP, mAR, f1_score
evaluate_model(dataset,model,config)
Hi @wiktor-jurek
I'm using the Colab environment for training my models, I run this command (magic cell):
%tensorflow_version 1.x
And it returns me the entire environment configured to work with Tensorflow in version 1.15.2. Colab maintains a stable version of Tensorflow 1 and 2. Well, I believe that the version of tensorflow may be the problem, but I also noticed that its function compute_overlaps_masks
is slightly different from what I find in my utils.py, so I'm sending you the utils.py da mask rcnn that I have here: https://drive.google.com/file/d/1EWI3kVvBpKGBoBJ-f0rq_NpoURBszlrR/view?usp=sharing.
Big thanks for this! It's working on my end so far.
Edit: No it's not, whoops. I'm getting an error when calling
evaluate_model
. Within the utils.compute_ap function, there is a shape mismatch when calculating intersections. Here's the error dump:~/project/2_MaskRCNN/mrcnn/utils.py in compute_overlaps_masks(masks1, masks2) 109 110 # intersections and union --> 111 intersections = np.dot(masks1.T, masks2) 112 union = area1[:, None] + area2[None, :] - intersections 113 overlaps = intersections / union <__array_function__ internals> in dot(*args, **kwargs) ValueError: shapes (2,65536) and (3136,51) not aligned: 65536 (dim 1) != 3136 (dim 0)
I have a feeling that it's either the fact that I'm using a newer version of TF, or the
expand_dims
function is not working correctly. What is the expected output when callingexpand_dims
?Here's my code for reference 👇
def evaluate_model(dataset, model, cfg, list_iou_thresholds=None): if list_iou_thresholds is None: list_iou_thresholds = np.arange(0.5, 1.01, 0.1) APs = [] ARs = [] for image_id in dataset.image_ids: image, image_meta, gt_class_id, gt_bbox, gt_mask = modellib.load_image_gt(dataset, cfg, image_id) scaled_image = modellib.mold_image(image, cfg) sample = np.expand_dims(scaled_image, 0) yhat = model.detect(sample, verbose=0) r = yhat[0] AP, precisions, recalls, overlaps = utils.compute_ap(gt_bbox, gt_class_id, gt_mask, r["rois"], r["class_ids"], r["scores"], r['masks'], iou_threshold=0.5) AR = compute_ar(r['rois'], gt_bbox, list_iou_thresholds) ARs.append(AR) APs.append(AP) mAP = mean(APs) mAR = mean(ARs) f1_score = 2 * ((mAP * mAR) / (mAP + mAR)) return mAP, mAR, f1_score evaluate_model(dataset,model,config)
@wiktor-jurek I solved this putting USE_MINI_MASK = False in both inference and training
BTW @WillianaLeite...do you have any suggestions to output number of TP/FP somehow.
Hello @WillianaLeite I have a question. Why use mold_image? What's the difference from just putting in [image]?
Hello @WillianaLeite I tried the code you wrote in my own work. I have 5 classes in my dataset. When I found mAP using the method in issues #1839, it was 0.6, but when I tried yours, it got 0.3 and the f1 score was 0. What should I do? How can I increase these values ? I need f1 score and mAP value for my thesis please help. Thanks.
Results: (0.34649124363358835, 0.0, 0.0)
that its function
compute_overlaps_masks
is slightly different from what I find in my utils.py, so I'm sending you the utils.py da mask rcnn that I have here: https://drive.google.com/file/d/1EWI3kVvBpKGBoBJ-f0rq_NpoURBszlrR/view?usp=sharing.
Bingo. That made the difference. Thanks.
Hello @sain0722, I had the same question regarding mold_image and [image]. Have you received an answer? Or do you know why it is necessary to mold the images before detection? Thanks.
@sain0722 i believe mold_image does the normalization i.e subtract from the mean of the dataset
Hello @WillianaLeite I tried the code you wrote in my own work. I have 5 classes in my dataset. When I found mAP using the method in issues #1839, it was 0.6, but when I tried yours, it got 0.3 and the f1 score was 0. What should I do? How can I increase these values ? I need f1 score and mAP value for my thesis please help. Thanks.
Results: (0.34649124363358835, 0.0, 0.0)
The compute_ar()
function is not able to find the compute_recall(..)
so it keeps appending 0.0 to the AR list instead of throwing an exception because of the try-except bloc.
to fix the 0.0 values in the mAR and F1 score you need to replace compute_recall(..)
with utils.compute_recall(..)
inside of the compute_ar()
function.
Hi anyone know how to calculate the mAP for bbox. Currently, lots of calculations were found focus on instance but for detection, how to do that? Thank you for the help
Problem is, compute_ap function does it with masks and compute_recall does it with bboxes, it does not work
@marcojulioarg Do you disagree with this implementation? As you did?
Hey Actually, if you are using matterport TF 2.0 version from here , then you have to add USE_MINI_MASK=False inside config then do the things.
@sain0722 @CZ2021 I found that in the model.py file in the detect function mold_image()
is already applied when an image is passed to it and there is another function called detect_molded()
that says: "Runs the detection pipeline, but expect inputs that are molded already. Used mostly for debugging and inspecting the model ". I don't know why you find mold_image()
applied before using model.detect()
. So should model.detect_molded()
be used?
@WillianaLeite Hi! In doing my master thesis I have came to your same problems! I found this issue and I have currently done same path as you! I am using detectron2 for my project and avaliable metrics in instance segmentation tasks are AP and AR, in particular there is an evaluator that uses standard COCO metrics. To translate AP and AR in F1 I've ended up using your same F1 formula. Honestly i do not know if this is the right way to do present a publication, but imo AP@IoU=0.50 should be the standard metric. Average F1 could be appended specifying the method we used to obtain it!
. So should
model.detect_molded()
be used?
I understood the same as you @felipetobars. I think it's not necessary to use mold_image before calling the detection function. I did some testing and got better results by not using the mold_image before.
. So should
model.detect_molded()
be used?I understood the same as you @felipetobars. I think it's not necessary to use mold_image before calling the detection function. I did some testing and got better results by not using the mold_image before.
I also got better results without using the function @guilhermemarim
Hello, looking at the detect() method in the model.py script, i see:
`def detect(self, images, verbose=0):
assert self.mode == "inference", "Create model in inference mode."
assert len(
images) == self.config.BATCH_SIZE, "len(images) must be equal to BATCH_SIZE"
if verbose:
log("Processing {} images".format(len(images)))
for image in images:
log("image", image)
# Mold inputs to format expected by the neural network
molded_images, image_metas, windows = self.mold_inputs(images)
# Validate image sizes
# All images in a batch MUST be of the same size
image_shape = molded_images[0].shape
for g in molded_images[1:]:
assert g.shape == image_shape,\
"After resizing, all images must have the same size. Check IMAGE_RESIZE_MODE and image sizes."`
why exactly do you mold the image before giving it to the to the detect()-method? The way you do it, the input image gets molded double, right?
Are you aware that the weighted-average or micro-average recall is just another name for the ordinary accuracy score? Or that the macro-average recall (equal weight per class irrespective of imbalance in number of instances) is just another name for the balanced accuracy score?
Compare them for yourself. They match exactly:
classification_report(y_tst, y_pred_tst, digits=15) =
precision recall f1-score support
0 0.817246835443038 0.683879510095995 0.744638673634889 3021
1 0.829678021465236 0.696708463949843 0.757401490947817 2552
2 0.770103092783505 0.630912162162162 0.693593314763231 1184
3 0.331294597349643 0.844155844155844 0.475841874084919 385
4 0.394505494505494 0.920512820512820 0.552307692307692 390
accuracy 0.700345193839618 7532
macro avg 0.628565608309383 0.755233760175333 0.644756609147710 7532
weighted avg 0.767319254559894 0.700345193839618 0.717240526308044 7532
balanced_accuracy_score(
y_tst, y_pred_tst) = 0.755233760175333
@WillianaLeite thank you so much, it works on my project
My question is what needs to be passed into evaluate_model. I need to see where someone calls the function passing in the parameters into it.
Thank you.
@WillianaLeite Hi! In doing my master thesis I have came to your same problems! I found this issue and I have currently done same path as you! I am using detectron2 for my project and avaliable metrics in instance segmentation tasks are AP and AR, in particular there is an evaluator that uses standard COCO metrics. To translate AP and AR in F1 I've ended up using your same F1 formula. Honestly i do not know if this is the right way to do present a publication, but imo AP@IoU=0.50 should be the standard metric. Average F1 could be appended specifying the method we used to obtain it!
Hey @andreaceruti ! Facing the same issue atm - did you get it work with detectron2? thanks in advance :)
Hi @WillianaLeite ,
thanks for providing your code. I have a question regarding the compute_ap() function from mrcnn that you maybe know:
Does it compute the AP and mAP based on the boxes or based on the segmentation? I made a formula by hand myself, that is using the segmentation and I do get different results from the compute_ap() one built in mrcnn. However I do not know if eventually I am maybe doing something wrong with my function or it they are just using different inputs.
Thanks and regards!
@WillianaLeite I'm sorry but how is the compute_ar
function correct? In the compute_ap
, the precision is calculated at a specific iou threshold but in compute_ar
, it loops through all the thresholds and calculates the AUC which is returned as the AR. How does that work??
I mean, even if we want to get the AP across all the thresholds, then why don't we use compute_ap_range instead?
But back to the main point, the recall formula looks off to me and I would appreciate any response especially if I'm looking at it the wrong way.
I also just observed that the compute_recall
doesn't return the average recall like the compute_ap
returns the average precision, not to talk of the fact that the compute_recall
is implemented purely from the bounding box as it doesn't use any of the mask predictions or the ground truth masks in its calculation. So how do we compare the results from this function with the result of the compute_ap
which utilizes both the bounding boxes and the masks?
Would appreciate any response to this and if I'm wrong, please let me know as well.
Hi @WillianaLeite,
I believe that your formula for computing F1-score is not accurate and is not applicable to your model. Mean average precision, or average precision for a single class is computed as an estimate of the area under the precision-recall curve. This unification is done because the precision and recall metrics are inversely proportional and change when you alter the IoU threshold.
Furthermore, the F1 score formula is used for binary classification tasks, not for object detection or segmentation. You are better off sticking to mAP and AR scores to compare your different models.
Hello. I am a beginner, I appreciate the code and ideas you provided, in my project I followed your code hoping to output the F1 score and mAP and mAR but he reported an error, I am inexperienced and still hope you can help me. Thank you very much. Here is my code: from tensorflow import expand_dims from mrcnn.utils import Dataset from tensorflow.python.keras.backend import mean
from build.lib.mrcnn.model import load_image_gt, mold_image from build.lib.mrcnn.utils import compute_ap from mrcnn.utils import compute_recall def compute_ar(pred_boxes, gt_boxes, list_iou_thresholds):
AR = []
for iou_threshold in list_iou_thresholds:
try:
recall, _ = compute_recall(pred_boxes, gt_boxes, iou=iou_threshold)
AR.append(recall)
except:
AR.append(0.0)
pass
AUC = 2 * (metrics.auc(list_iou_thresholds, AR))
return AUC
def evaluate_model(Dataset, model, cfg, list_iou_thresholds=None): if list_iou_thresholds is None: list_iou_thresholds = np.arange(0.5, 1.01, 0.1)
APs = []
ARs = []
for image_id in Dataset.image_ids:
image, image_meta, gt_class_id, gt_bbox, gt_mask = load_image_gt(Dataset, cfg, image_id, use_mini_mask=False)
scaled_image = mold_image(image, cfg)
sample = expand_dims(scaled_image, 0)
yhat = model.detect(sample, verbose=0)
r = yhat[0]
AP, precisions, recalls, overlaps = compute_ap(gt_bbox, gt_class_id, gt_mask, r["rois"], r["class_ids"],
r["scores"], r['masks'], iou_threshold=0.5)
AR = compute_ar(r['rois'], gt_bbox, list_iou_thresholds)
ARs.append(AR)
APs.append(AP)
mAP = mean(APs)
mAR = mean(ARs)
f1_score = 2 * ((mAP * mAR) / (mAP + mAR))
return mAP, mAR, f1_score
evaluate_model(Dataset,model,config)`
Tips for reporting errors:
for image_id in Dataset.image_ids:
TypeError: 'property' object is not iterable