splink
splink copied to clipboard
threshold_match_weight in linker.predict() not checked when equal to 0
What happens?
Weird edge case, but when calling linker.predict with a threshold_match_weight
of 0 to remove negative weighted records, the threshold isn't applied and it returns all records. There is a logical check on threshold_match_weight
but it should check that threshold_match_weight
is not None, to handle the case it may be set to zero.
https://github.com/moj-analytical-services/splink/blob/868fefba022483f4cb0d05822aa55d62ee9b92fa/splink/predict.py#L58
To Reproduce
From the rough and ready example, last two lines of interest.
from splink.datasets import splink_datasets
from splink.duckdb.linker import DuckDBLinker
from splink.duckdb.blocking_rule_library import block_on
import splink.duckdb.comparison_library as cl
df = splink_datasets.historical_50k
settings = {
"link_type": "dedupe_only",
"blocking_rules_to_generate_predictions": [
block_on("full_name"),
block_on(["substr(full_name,1,6)", "dob", "birth_place"]),
block_on(["dob", "birth_place"]),
block_on("postcode_fake"),
],
"comparisons": [
cl.jaro_at_thresholds("full_name", [0.9, 0.7], term_frequency_adjustments=True),
cl.levenshtein_at_thresholds("dob", [1, 2]),
cl.levenshtein_at_thresholds("postcode_fake", 2),
cl.jaro_winkler_at_thresholds("birth_place", 0.9, term_frequency_adjustments=True),
cl.exact_match("occupation", term_frequency_adjustments=True),
],
}
linker = DuckDBLinker(df, settings)
deterministic_rules = [
"l.full_name = r.full_name",
"l.postcode_fake = r.postcode_fake and l.dob = r.dob",
]
linker.estimate_probability_two_random_records_match(deterministic_rules, recall=0.6)
linker.estimate_u_using_random_sampling(max_pairs=2e7)
results = linker.predict(threshold_match_weight=0) # set threshold weight to 0
assert results.as_pandas_dataframe()['match_weight'].min() >= 0 # Fails if there are still negative weights
OS:
Windows 10
Splink version:
3.9.8
Have you tried this on the latest master
branch?
- [X] I agree
Have you tried the steps to reproduce? Do they include all relevant data and configuration? Does the issue you report still appear there?
- [X] I agree
Thanks for the report! If you are happy to feel free to put in a PR with the fix for this. Otherwise someone from the team will have a look into it when there's some space to do so