splink
splink copied to clipboard
[FEAT] Addition to training rules topic guide to prevent violating independence assumption
Is your proposal related to a problem?
Currently, when training m parameters using EM estimation, splink will not train m values for a column if any of it's comparison levels include any of the columns being blocked on, as this violates the assumption of independence. However, splink will train these parameters if you have created a derived column from the original column.
eg.
- I have a column
date_of_birth
, and I make the columnsbirth_year
,birth_month
,birth_day
splitting this apart - I have the comparison:
date_of_birth = {
cll.exact_match_level("date_of_birth"),
cll.else_level()
}
- I have a training blocking rule:
birth_year_l = birth_year_r AND first_name_l = first_name_r
The m value for date_of_birth will incorrectly train for date_of_birth
as Splink cannot know they were derived columns.
Instead, I must write the blocking rule as something like LEFT(date_of_birth_l, 4) = LEFT(date_of_birth_r, 4) AND first_name_l = first_name_r
, where splink will know this and not train the parameters.
Describe the solution you'd like
This should be added to the topic guide on training blocking rules to prevent users doing this unknowingly.
Further suggestion by @aliceoleary0 - could the user supply the comparison with a list of related columns to instruct it that if birth_year
is used, then the date_of_birth
comparison should be ignored?
This would allow the user to think about this issue once (when creating the comparisons) and then not have to worry about their blocking rules.