splink icon indicating copy to clipboard operation
splink copied to clipboard

[FEAT] Cluster evaluation - summary statistics

Open OlivierBinette opened this issue 9 months ago • 0 comments

Is your proposal related to a problem?

I want to better understand the the clustering I get after estimating pairwise match probabilities, thresholding, and getting connected components.

Describe the solution you'd like

It's useful to consider a quasi-identifier such as a name, and to compute the following two metrics:

  • Homonymy Rate: The proportion of clusters that share a name with another cluster.
  • Name Variation Rate: The proportion of clusters with name variation within them.

For instance, if I know that names are quite clean in my data, then I want the name variation rate to be very low.

Describe alternatives you've considered

The er-evaluation package implements the two metrics, but it uses Pandas and it's quite slow. The formulas are given in this paper (my paper): https://arxiv.org/pdf/2404.05622

OlivierBinette avatar May 20 '24 13:05 OlivierBinette