splink icon indicating copy to clipboard operation
splink copied to clipboard

M values aren't trained for a column

Open lamaeldo opened this issue 9 months ago • 2 comments

What happens?

Hello, I am using splink to link two datasets, using mostly custom comparisons. One of my columns, "sname" is used in comparison and in neither of my blocking rules. However, when I use EM to calculate the m values, splink says the column is used in the blocking rules (it isn't). Yet, when i print the match weight charts and the parameter estimate comparisons chart, they both show values for sname. What should I believe? Are my M values trained properly or not? Am i missing something obvious?

To Reproduce

A notebook is attached (as a .txt to allow for upload), but I cannot share the data files bugged_ipynb.txt

OS:

Debian

Splink version:

3.9.14

Have you tried this on the latest master branch?

  • [X] I agree

Have you tried the steps to reproduce? Do they include all relevant data and configuration? Does the issue you report still appear there?

  • [X] I agree

lamaeldo avatar Apr 28 '24 16:04 lamaeldo

The condition used to determine whether or not parameters are estimated for a comparison is whether it not any data columns are used in any of the comparison levels.

In your case, the sname comparison makes reference to the columns sex and mar, which also appear in your training blocking rules, and so this comparison cannot be estimated. To train the parameters for the sname comparison you will need to use a blocking rule that does not use any of the columns sname, sex, or mar, as these are the columns that the sname comparison depends on.

The match weight chart (and the m u parameters chart) will show the default m-values for any comparison that has no trained values associated to it, so those will probably be what you are seeing there.

The parameter estimates chart should not show default values, and should only be displaying values that are estimated from training sessions (expectation maximisation or estimate u from random sampling) - if you do have m-values appearing there for sname, would you be able to upload an image of it?

ADBond avatar Apr 29 '24 08:04 ADBond

The condition used to determine whether or not parameters are estimated for a comparison is whether it not any data columns are used in any of the comparison levels.

In your case, the sname comparison makes reference to the columns sex and mar, which also appear in your training blocking rules, and so this comparison cannot be estimated. To train the parameters for the sname comparison you will need to use a blocking rule that does not use any of the columns sname, sex, or mar, as these are the columns that the sname comparison depends on.

The match weight chart (and the m u parameters chart) will show the default m-values for any comparison that has no trained values associated to it, so those will probably be what you are seeing there.

The parameter estimates chart should not show default values, and should only be displaying values that are estimated from training sessions (expectation maximisation or estimate u from random sampling) - if you do have m-values appearing there for sname, would you be able to upload an image of it?

I think possibly the distinction here is whether you're displaying from linker.match_weights_chart() (which iirc does display default values) or the charts returned by the training session:

training_session = linker.estimate_parameters_using_expectation_maximisation(block_on(["first_name"]))
training_session.match_weights_interactive_history_chart()

(which shouldn't)

I admit, it's a bit confusing that linker.match_weights_chart() shows default values, we should probably improve that somehow!

RobinL avatar Apr 29 '24 09:04 RobinL

Thanks both for the replies this solves it. @ADBond apologies, there was indeed no values shown for sname in parameter_estimate_comparisons_chart()

lamaeldo avatar Apr 29 '24 18:04 lamaeldo