Can't sensibly supply `comparison_levels_to_reverse_blocking_rule` in Splink 4
In linker.estimate_parameters_using_expectation_maximisation there is an option to manually supply comparison_levels_to_reverse_blocking_rule, which take ComparisonLevel objects. However in Splink 4 most users won't deal with these objects directly, instead using ComparisonLevelCreator objects which build these behind-the-scenes.
Right now, a user would have to do something like this:
...
linker = Linker(df, settings, db_api)
linker.estimate_parameters_using_expectation_maximisation(
"l.postcode = r.postcode",
comparison_levels_to_reverse_blocking_rule=[linker._settings_obj.comparisons[0].comparison_levels[2], ...]
)
My proposal is introducing a (unique) name to each ComparisonLevel, which we can use to refer to these - this will be systematically created if not user-supplied. Comparison levels would have a fully unique name in the format "{comparison_name}.{comparison_level_name}".
We already have output_column_name for Comparison which works this way, but I wonder if we shouldn't also include a name for consistency (but maybe that just complicates things, idk).
With this the above snippet would be something like:
...
linker = Linker(df, settings, db_api)
linker.estimate_parameters_using_expectation_maximisation(
"l.postcode = r.postcode",
comparison_levels_to_reverse_blocking_rule=["location.exact_match", ...]
)
This would also mean we can use this to get levels/comparisons directly as we may sometimes wish to do, without needing to go via gamma-values (and remember the numbering scheme).
Whilst there are some edges cases in which this setting may be useful, I think it might be able to be removed.
When Splink3 was first written, it was assumed that the user wanted to train lambda (probability_two_random_records_match) using EM. We therefore needed to implement both an upward adjustment to probability_two_random_records_match for training, and then reverse this back out to estimate lambda. We now no long advise this and insead suggest the use of linker.estimate_probability_two_random_records_match
In terms of the high level purpose:
- We have a global
probability_two_random_records_match - When EM training, we need a
probability_two_random_records_matchspecific to the blocking rule, which is much higher than the globalprobability_two_random_records_match - We allow
probability_two_random_records_matchto vary during EM training but then throw away the final value. - But it's desirable for the starting value for
probability_two_random_records_matchto be close to the true value, so some adjustment is merited - Assuming conditional independence we can work out the upward adjustment from the global
probability_two_random_records_matchby looking at the u parameter on exact match
But I'm starting to wonder whether we can get to this in a better and simpler way - namely simply computing the reduction in the number of comparisons that results from the blocking rule. i.e. how many comparisons with no blocking rule vs how many comparisons from the EM training blocking rule
This methodology would also get around the fact the the current approach assumes conditional independence e.g. it looks separately for an exact match on first name and surname and multiplies them, but in reality these are correlated
Removed in #2272