recordlinkage icon indicating copy to clipboard operation
recordlinkage copied to clipboard

`ECMClassifier` returns almost all candidate pairs

Open Evnsn opened this issue 1 year ago • 2 comments

import recordlinkage
from recordlinkage.index import Block
from recordlinkage.compare import String
from recordlinkage.datasets import load_febrl3

df, true_links = load_febrl3(True)

# Generate candidate pairs
indexer = recordlinkage.Index([
    Block("date_of_birth")
])

candidate_pairs = indexer.index(df)

print(len(candidate_pairs)) # Returns 5966

# Generate comparison vectors
comparer = recordlinkage.Compare([
    String("given_name", "given_name", method="jarowinkler", label="given_name"),
    String("surname", "surname", method="jarowinkler", label="surname"),
    String("soc_sec_id", "soc_sec_id", method="jarowinkler", label="soc_sec_id"),
    String("address_1", "address_1", method="jarowinkler", label="address_1"),
])

comparison_vector = comparer.compute(candidate_pairs, df)

# Match entities
ecm = recordlinkage.ECMClassifier(binarize=0.1)

pred_links = ecm.fit_predict(comparison_vector)

print(len(pred_links)) # Returns 5836

I attempted to replicate my problem in the code snippet above. There are 5966 candidate pairs and my ECM classifier returns 5836 of them as matches.

Problem: I want to use ECMClassifier for Entity matching. However, when I apply it to my dataset, ALL the candidate pairs are identified as matches, which is unfortunate.

Is there some parameter I can set to tweak the threshold for match vs non-match, or am I missing something else here?

Evnsn avatar May 24 '23 11:05 Evnsn

I think the threshold for binarizing is too low and you are thus converting all the feature vectors to 1 and getting all matches. Try increasing the binarize threshold

konsbn avatar Jun 15 '23 13:06 konsbn

Thank you for the suggestion, unfortunately, it does not seem to not make any significant difference. I tried lowering and increasing the threshold.

Evnsn avatar Jun 27 '23 13:06 Evnsn