splink icon indicating copy to clipboard operation
splink copied to clipboard

Can't sensibly supply `comparison_levels_to_reverse_blocking_rule` in Splink 4

Open ADBond opened this issue 2 years ago • 1 comments

In linker.estimate_parameters_using_expectation_maximisation there is an option to manually supply comparison_levels_to_reverse_blocking_rule, which take ComparisonLevel objects. However in Splink 4 most users won't deal with these objects directly, instead using ComparisonLevelCreator objects which build these behind-the-scenes.

Right now, a user would have to do something like this:

...
linker = Linker(df, settings, db_api)
linker.estimate_parameters_using_expectation_maximisation(
    "l.postcode = r.postcode",
    comparison_levels_to_reverse_blocking_rule=[linker._settings_obj.comparisons[0].comparison_levels[2], ...]
)

My proposal is introducing a (unique) name to each ComparisonLevel, which we can use to refer to these - this will be systematically created if not user-supplied. Comparison levels would have a fully unique name in the format "{comparison_name}.{comparison_level_name}". We already have output_column_name for Comparison which works this way, but I wonder if we shouldn't also include a name for consistency (but maybe that just complicates things, idk).

With this the above snippet would be something like:

...
linker = Linker(df, settings, db_api)
linker.estimate_parameters_using_expectation_maximisation(
    "l.postcode = r.postcode",
    comparison_levels_to_reverse_blocking_rule=["location.exact_match", ...]
)

This would also mean we can use this to get levels/comparisons directly as we may sometimes wish to do, without needing to go via gamma-values (and remember the numbering scheme).

ADBond avatar Mar 01 '24 17:03 ADBond

Whilst there are some edges cases in which this setting may be useful, I think it might be able to be removed.

When Splink3 was first written, it was assumed that the user wanted to train lambda (probability_two_random_records_match) using EM. We therefore needed to implement both an upward adjustment to probability_two_random_records_match for training, and then reverse this back out to estimate lambda. We now no long advise this and insead suggest the use of linker.estimate_probability_two_random_records_match

In terms of the high level purpose:

  • We have a global probability_two_random_records_match
  • When EM training, we need a probability_two_random_records_match specific to the blocking rule, which is much higher than the global probability_two_random_records_match
  • We allow probability_two_random_records_match to vary during EM training but then throw away the final value.
  • But it's desirable for the starting value for probability_two_random_records_match to be close to the true value, so some adjustment is merited
  • Assuming conditional independence we can work out the upward adjustment from the global probability_two_random_records_match by looking at the u parameter on exact match

But I'm starting to wonder whether we can get to this in a better and simpler way - namely simply computing the reduction in the number of comparisons that results from the blocking rule. i.e. how many comparisons with no blocking rule vs how many comparisons from the EM training blocking rule

This methodology would also get around the fact the the current approach assumes conditional independence e.g. it looks separately for an exact match on first name and surname and multiplies them, but in reality these are correlated

RobinL avatar Mar 04 '24 12:03 RobinL

Removed in #2272

ADBond avatar Jul 25 '24 07:07 ADBond