PolyFuzz
PolyFuzz copied to clipboard
Analyse precision recall curve
I have two questions:
- The precision-recall curve is a trade off between the min similarity and the percentage matched. So in the ideal case you want both the precision as the recall as high as possible. However I found out in my results that the model with the highest precision and recall isn't always the best. Am I missing something?
- How would I set the optimal threshold for the similarity? Is this also based on the precision recall curve?
The precision-recall curve is a trade off between the min similarity and the percentage matched. So in the ideal case you want both the precision as the recall as high as possible. However I found out in my results that the model with the highest precision and recall isn't always the best. Am I missing something?
The precision-recall curve is an approximation as we do not have the ground-truth available. We ideally still want this to be as high as possible but it would still be an approximation.
How would I set the optimal threshold for the similarity? Is this also based on the precision recall curve?
Yes, that is the main purpose of the precision-recall curve as defined in PolyFuzz. It helps you understand what the threshold would be to get a certain amount of matches and the relative accuracy of the results.