supervision Add BenchmarkEvaluator with basic precision/recall computation

Summary

This PR introduces a utility class BenchmarkEvaluator in supervision/metrics/benchmark.py to support benchmarking object detection results across different datasets or models.

Features

Computes basic precision and recall
Accepts Detections objects for ground truth and prediction
Optional support for class mapping and IoU thresholding (future extensions)
Includes a unit test at tests/metrics/test_benchmark.py

Motivation

Addresses Issue #1778: Improving object detection benchmarking process for unrelated datasets.

Let me know if you'd like me to extend this in future PRs with:

mAP, F1, or per-class metrics
Confusion matrix visualization
Colab notebook example

Thanks for the opportunity to contribute!

Jul 06 '25 17:07 Muhammedswalihu

Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.

Muhammed Swalihu seems not to be a GitHub user. You need a GitHub account to be able to sign the CLA. If you have already a GitHub account, please add the email address used for this commit to your account.
_{You have signed the CLA already but the status is still pending? Let us recheck it.}

Jul 06 '25 17:07 CLAassistant

Hi @SkalskiP @onuralpszr — I've submitted this PR for the BenchmarkEvaluator (Issue #1778 ). Let me know if you'd like me to fix the pre-commit error or extend this further. Thanks for reviewing!

Jul 06 '25 17:07 Muhammedswalihu

Hi @Muhammedswalihu, this seems like a really valuable feature! Can you please replace the placeholder logic with a working one, provide a working example and testcases; and we can review the PR.

Jul 08 '25 10:07 soumik12345

Hi @soumik12345 , thanks for the review!

I’ll go ahead and:

Replace the placeholder logic in BenchmarkEvaluator with full precision/recall/mAP computation,

Add a working demo example (maybe in a Colab notebook for clarity), and

Improve the test coverage with more edge cases and per-class evaluation.

Let me know if there’s anything specific you’d like to see included. Appreciate the opportunity — excited to take this further!

Jul 08 '25 21:07 Muhammedswalihu

Check out this pull request on

See visual diffs & provide feedback on Jupyter Notebooks.

Powered by ReviewNB

Jul 08 '25 21:07 review-notebook-app[bot]

Hi @soumik12345, I've added a Colab-style demo notebook BenchmarkEvaluator_Demo.ipynb!

It includes:

How to import and use the BenchmarkEvaluator
Per-class precision and recall visualization
A visual example comparing predicted and ground truth bounding boxes

This should help users understand and adopt the module more easily.

Let me know if you'd like me to polish or extend this notebook further!

Jul 08 '25 21:07 Muhammedswalihu

Great initiative on the BenchmarkEvaluator! This addresses a crucial need for standardized evaluation metrics. I'd like to offer some technical guidance to help you complete the implementation effectively.

Key Implementation Recommendations:

IoU-based Matching Algorithm: For proper TP/FP/FN computation, you'll need Hungarian assignment or greedy matching based on IoU thresholds:

def compute_matches(pred_boxes, gt_boxes, iou_threshold=0.5):
    # Compute IoU matrix
    # Apply optimal assignment (e.g., scipy.optimize.linear_sum_assignment)
    # Return matched pairs, unmatched predictions (FP), unmatched ground truth (FN)

Multi-class Support: Consider class-aware matching for per-class metrics:
- Group detections by class_id
- Compute metrics separately for each class
- Aggregate for overall performance
Confidence Thresholding: Implement confidence-based filtering for realistic evaluation scenarios
Standard Metrics: Beyond precision/recall, consider adding:
- F1-score
- Average Precision (AP) at different IoU thresholds
- Mean Average Precision (mAP)

Performance Considerations:

Vectorized IoU computation using numpy/supervision utilities
Batch processing for large evaluation sets
Memory-efficient handling of detection arrays

This evaluator will be invaluable for the community's benchmarking needs. Happy to provide more specific implementation details if needed!

Best regards, Gabriel

Sep 27 '25 14:09 galafis

supervision supervision copied to clipboard

Add BenchmarkEvaluator with basic precision/recall computation

Summary

Features

Motivation

supervision
supervision copied to clipboard