splink
splink copied to clipboard
M values aren't trained for a column
What happens?
Hello, I am using splink to link two datasets, using mostly custom comparisons. One of my columns, "sname" is used in comparison and in neither of my blocking rules. However, when I use EM to calculate the m values, splink says the column is used in the blocking rules (it isn't). Yet, when i print the match weight charts and the parameter estimate comparisons chart, they both show values for sname. What should I believe? Are my M values trained properly or not? Am i missing something obvious?
To Reproduce
A notebook is attached (as a .txt to allow for upload), but I cannot share the data files bugged_ipynb.txt
OS:
Debian
Splink version:
3.9.14
Have you tried this on the latest master
branch?
- [X] I agree
Have you tried the steps to reproduce? Do they include all relevant data and configuration? Does the issue you report still appear there?
- [X] I agree
The condition used to determine whether or not parameters are estimated for a comparison is whether it not any data columns are used in any of the comparison levels.
In your case, the sname
comparison makes reference to the columns sex
and mar
, which also appear in your training blocking rules, and so this comparison cannot be estimated. To train the parameters for the sname
comparison you will need to use a blocking rule that does not use any of the columns sname
, sex
, or mar
, as these are the columns that the sname
comparison depends on.
The match weight chart (and the m u parameters chart) will show the default m-values for any comparison that has no trained values associated to it, so those will probably be what you are seeing there.
The parameter estimates chart should not show default values, and should only be displaying values that are estimated from training sessions (expectation maximisation or estimate u from random sampling) - if you do have m-values appearing there for sname,
would you be able to upload an image of it?
The condition used to determine whether or not parameters are estimated for a comparison is whether it not any data columns are used in any of the comparison levels.
In your case, the
sname
comparison makes reference to the columnssex
andmar
, which also appear in your training blocking rules, and so this comparison cannot be estimated. To train the parameters for thesname
comparison you will need to use a blocking rule that does not use any of the columnssname
,sex
, ormar
, as these are the columns that thesname
comparison depends on.The match weight chart (and the m u parameters chart) will show the default m-values for any comparison that has no trained values associated to it, so those will probably be what you are seeing there.
The parameter estimates chart should not show default values, and should only be displaying values that are estimated from training sessions (expectation maximisation or estimate u from random sampling) - if you do have m-values appearing there for
sname,
would you be able to upload an image of it?
I think possibly the distinction here is whether you're displaying from linker.match_weights_chart()
(which iirc does display default values) or the charts returned by the training session:
training_session = linker.estimate_parameters_using_expectation_maximisation(block_on(["first_name"]))
training_session.match_weights_interactive_history_chart()
(which shouldn't)
I admit, it's a bit confusing that linker.match_weights_chart()
shows default values, we should probably improve that somehow!
Thanks both for the replies this solves it. @ADBond apologies, there was indeed no values shown for sname in parameter_estimate_comparisons_chart()