splink [FEAT] Cluster evaluation

[FEAT] Cluster evaluation - summary statistics

Open OlivierBinette opened this issue 9 months ago • 0 comments

Is your proposal related to a problem?

I want to better understand the the clustering I get after estimating pairwise match probabilities, thresholding, and getting connected components.

Describe the solution you'd like

It's useful to consider a quasi-identifier such as a name, and to compute the following two metrics:

Homonymy Rate: The proportion of clusters that share a name with another cluster.
Name Variation Rate: The proportion of clusters with name variation within them.

For instance, if I know that names are quite clean in my data, then I want the name variation rate to be very low.

Describe alternatives you've considered

The er-evaluation package implements the two metrics, but it uses Pandas and it's quite slow. The formulas are given in this paper (my paper): https://arxiv.org/pdf/2404.05622

May 20 '24 13:05 OlivierBinette

splink splink copied to clipboard

[FEAT] Cluster evaluation - summary statistics

Is your proposal related to a problem?

Describe the solution you'd like

Describe alternatives you've considered

splink
splink copied to clipboard