splink icon indicating copy to clipboard operation
splink copied to clipboard

threshold_match_weight in linker.predict() not checked when equal to 0

Open medwar99 opened this issue 1 year ago • 1 comments

What happens?

Weird edge case, but when calling linker.predict with a threshold_match_weight of 0 to remove negative weighted records, the threshold isn't applied and it returns all records. There is a logical check on threshold_match_weight but it should check that threshold_match_weight is not None, to handle the case it may be set to zero.

https://github.com/moj-analytical-services/splink/blob/868fefba022483f4cb0d05822aa55d62ee9b92fa/splink/predict.py#L58

To Reproduce

From the rough and ready example, last two lines of interest.

from splink.datasets import splink_datasets
from splink.duckdb.linker import DuckDBLinker
from splink.duckdb.blocking_rule_library import block_on
import splink.duckdb.comparison_library as cl

df = splink_datasets.historical_50k

settings = {
    "link_type": "dedupe_only",
    "blocking_rules_to_generate_predictions": [
        block_on("full_name"),
        block_on(["substr(full_name,1,6)", "dob", "birth_place"]),
        block_on(["dob", "birth_place"]),
        block_on("postcode_fake"),
    ],
    "comparisons": [
        cl.jaro_at_thresholds("full_name", [0.9, 0.7], term_frequency_adjustments=True),
        cl.levenshtein_at_thresholds("dob", [1, 2]),
        cl.levenshtein_at_thresholds("postcode_fake", 2),
        cl.jaro_winkler_at_thresholds("birth_place", 0.9, term_frequency_adjustments=True),
        cl.exact_match("occupation",  term_frequency_adjustments=True),
    ],
}

linker = DuckDBLinker(df, settings)
deterministic_rules = [
    "l.full_name = r.full_name",
    "l.postcode_fake = r.postcode_fake and l.dob = r.dob",
]

linker.estimate_probability_two_random_records_match(deterministic_rules, recall=0.6)
linker.estimate_u_using_random_sampling(max_pairs=2e7)

results = linker.predict(threshold_match_weight=0) # set threshold weight to 0

assert results.as_pandas_dataframe()['match_weight'].min() >= 0 # Fails if there are still negative weights

OS:

Windows 10

Splink version:

3.9.8

Have you tried this on the latest master branch?

  • [X] I agree

Have you tried the steps to reproduce? Do they include all relevant data and configuration? Does the issue you report still appear there?

  • [X] I agree

medwar99 avatar Nov 09 '23 10:11 medwar99

Thanks for the report! If you are happy to feel free to put in a PR with the fix for this. Otherwise someone from the team will have a look into it when there's some space to do so

ADBond avatar Nov 09 '23 10:11 ADBond