splink icon indicating copy to clipboard operation
splink copied to clipboard

[FEAT] Addition to training rules topic guide to prevent violating independence assumption

Open sama-ds opened this issue 1 year ago • 1 comments

Is your proposal related to a problem?

Currently, when training m parameters using EM estimation, splink will not train m values for a column if any of it's comparison levels include any of the columns being blocked on, as this violates the assumption of independence. However, splink will train these parameters if you have created a derived column from the original column.

eg.

  • I have a column date_of_birth, and I make the columns birth_year, birth_month, birth_day splitting this apart
  • I have the comparison:
date_of_birth = {
   cll.exact_match_level("date_of_birth"),
   cll.else_level()
}
  • I have a training blocking rule: birth_year_l = birth_year_r AND first_name_l = first_name_r

The m value for date_of_birth will incorrectly train for date_of_birth as Splink cannot know they were derived columns.

Instead, I must write the blocking rule as something like LEFT(date_of_birth_l, 4) = LEFT(date_of_birth_r, 4) AND first_name_l = first_name_r, where splink will know this and not train the parameters.

Describe the solution you'd like

This should be added to the topic guide on training blocking rules to prevent users doing this unknowingly.

sama-ds avatar Oct 11 '23 11:10 sama-ds

Further suggestion by @aliceoleary0 - could the user supply the comparison with a list of related columns to instruct it that if birth_year is used, then the date_of_birth comparison should be ignored?

This would allow the user to think about this issue once (when creating the comparisons) and then not have to worry about their blocking rules.

samnlindsay avatar Nov 23 '23 09:11 samnlindsay