splink
splink copied to clipboard
[FEAT] Cluster evaluation - summary statistics
Is your proposal related to a problem?
I want to better understand the the clustering I get after estimating pairwise match probabilities, thresholding, and getting connected components.
Describe the solution you'd like
It's useful to consider a quasi-identifier such as a name, and to compute the following two metrics:
- Homonymy Rate: The proportion of clusters that share a name with another cluster.
- Name Variation Rate: The proportion of clusters with name variation within them.
For instance, if I know that names are quite clean in my data, then I want the name variation rate to be very low.
Describe alternatives you've considered
The er-evaluation package implements the two metrics, but it uses Pandas and it's quite slow. The formulas are given in this paper (my paper): https://arxiv.org/pdf/2404.05622