recordlinkage
recordlinkage copied to clipboard
`ECMClassifier` returns almost all candidate pairs
import recordlinkage
from recordlinkage.index import Block
from recordlinkage.compare import String
from recordlinkage.datasets import load_febrl3
df, true_links = load_febrl3(True)
# Generate candidate pairs
indexer = recordlinkage.Index([
Block("date_of_birth")
])
candidate_pairs = indexer.index(df)
print(len(candidate_pairs)) # Returns 5966
# Generate comparison vectors
comparer = recordlinkage.Compare([
String("given_name", "given_name", method="jarowinkler", label="given_name"),
String("surname", "surname", method="jarowinkler", label="surname"),
String("soc_sec_id", "soc_sec_id", method="jarowinkler", label="soc_sec_id"),
String("address_1", "address_1", method="jarowinkler", label="address_1"),
])
comparison_vector = comparer.compute(candidate_pairs, df)
# Match entities
ecm = recordlinkage.ECMClassifier(binarize=0.1)
pred_links = ecm.fit_predict(comparison_vector)
print(len(pred_links)) # Returns 5836
I attempted to replicate my problem in the code snippet above. There are 5966 candidate pairs and my ECM classifier returns 5836 of them as matches.
Problem: I want to use ECMClassifier
for Entity matching. However, when I apply it to my dataset, ALL the candidate pairs are identified as matches, which is unfortunate.
Is there some parameter I can set to tweak the threshold for match vs non-match, or am I missing something else here?
I think the threshold for binarizing is too low and you are thus converting all the feature vectors to 1 and getting all matches. Try increasing the binarize threshold
Thank you for the suggestion, unfortunately, it does not seem to not make any significant difference. I tried lowering and increasing the threshold.