PolyFuzz icon indicating copy to clipboard operation
PolyFuzz copied to clipboard

Analyse precision recall curve

Open KoenLoeffen opened this issue 1 year ago • 1 comments

I have two questions:

  1. The precision-recall curve is a trade off between the min similarity and the percentage matched. So in the ideal case you want both the precision as the recall as high as possible. However I found out in my results that the model with the highest precision and recall isn't always the best. Am I missing something?
  2. How would I set the optimal threshold for the similarity? Is this also based on the precision recall curve?

KoenLoeffen avatar May 26 '23 04:05 KoenLoeffen

The precision-recall curve is a trade off between the min similarity and the percentage matched. So in the ideal case you want both the precision as the recall as high as possible. However I found out in my results that the model with the highest precision and recall isn't always the best. Am I missing something?

The precision-recall curve is an approximation as we do not have the ground-truth available. We ideally still want this to be as high as possible but it would still be an approximation.

How would I set the optimal threshold for the similarity? Is this also based on the precision recall curve?

Yes, that is the main purpose of the precision-recall curve as defined in PolyFuzz. It helps you understand what the threshold would be to get a certain amount of matches and the relative accuracy of the results.

MaartenGr avatar May 28 '23 04:05 MaartenGr