splink
splink copied to clipboard
Evaluation from ground truth column does not work without blocking rules specified
What happens?
I was attempting to follow the Evaluation from ground truth column with some of my own data. My data is relatively small scale and has no easy-to-express blocking rules, so I set up my linker without any blocking_rules. The model seemed to train just fine, but when I attempted to evaluate against my ground truth column, I got an SQL error that was initially opaque to me: Error was: Binder Error: Referenced column "match_key" not found in FROM clause!
It would be nice if the evaluation function did not strictly require blocking rules.
To Reproduce
This can be reproduced from the tutorial data by simply removing the blocking_rules.
from splink.datasets import splink_datasets
import altair as alt
alt.renderers.enable("html")
df = splink_datasets.fake_1000
df.head(2)
from splink.duckdb.linker import DuckDBLinker
from splink.duckdb.blocking_rule_library import block_on
import splink.duckdb.comparison_template_library as ctl
import splink.duckdb.comparison_library as cl
settings = {
"link_type": "dedupe_only",
"blocking_rules_to_generate_predictions": [
#block_on("first_name"),
#block_on("surname"),
],
"comparisons": [
ctl.name_comparison("first_name"),
ctl.name_comparison("surname"),
ctl.date_comparison("dob", cast_strings_to_date=True),
cl.exact_match("city", term_frequency_adjustments=True),
ctl.email_comparison("email", include_username_fuzzy_level=False),
],
"retain_matching_columns": True,
"retain_intermediate_calculation_columns": True,
}
linker = DuckDBLinker(df, settings, set_up_basic_logging=False)
deterministic_rules = [
"l.first_name = r.first_name and levenshtein(r.dob, l.dob) <= 1",
"l.surname = r.surname and levenshtein(r.dob, l.dob) <= 1",
"l.first_name = r.first_name and levenshtein(r.surname, l.surname) <= 2",
"l.email = r.email"
]
linker.estimate_probability_two_random_records_match(deterministic_rules, recall=0.7)
linker.estimate_u_using_random_sampling(max_pairs=1e6, seed=5)
session_dob = linker.estimate_parameters_using_expectation_maximisation(block_on("dob"))
session_email = linker.estimate_parameters_using_expectation_maximisation(block_on("email"))
linker.truth_space_table_from_labels_column(
"cluster", match_weight_round_to_nearest=0.1
).as_pandas_dataframe(limit=5)
Which generates a long error trace including the SQL to do this calculation and ends with:
Error was: Binder Error: Referenced column "match_key" not found in FROM clause!
Candidate bindings: "__splink__df_predict_4ff203160.match_weight"
LINE 10: not (cast(match_key as int) = 0)
OS:
Mac OS 13.5
Splink version:
3.9.14
Have you tried this on the latest master
branch?
- [X] I agree
Have you tried the steps to reproduce? Do they include all relevant data and configuration? Does the issue you report still appear there?
- [X] I agree