splink icon indicating copy to clipboard operation
splink copied to clipboard

Explore blocking rules performance metrics

Open RossKen opened this issue 2 years ago • 1 comments

Is your proposal related to a problem?

In general, it is difficult to assess the quality of your blocking rules. Other than looking at the number of records introduced by each of your blocks, we just hope that we have caught all of the potential matches.

The ONS Data Linking Journal Club had a session exploring the paper below:

2021_Dasylva__Estimating_the_false_negatives_due_to_blocking_in_record_linkage.pdf

This looks at a method to estimate the number of true matches being excluded by a set of blocking rules, which could be a useful metric when defining how tight blocking rules should be.

Describe the solution you'd like

This issue is not intended to result in a specific feature, but is to act as a prompt to explore the ideas in the paper above more thoroughly. If, after investigation, it feels like this is worth implementing then create a new issue with a specific output defined.

Questions to consider:

  • Is the methodology effective? How about with more realistic (i.e. messy) data?
  • Can the method be applied in SQL (and all the dialects in the back end)?
  • Is the method computationally practical? In section 5 of the paper it talks about using an EM procedure applied on the complete dataset which I am slightly concerned about given that is our most expensive part of splink.

RossKen avatar Feb 16 '23 14:02 RossKen