hmni
hmni copied to clipboard
๐ Fuzzy Name Matching with Machine Learning
HMNI
Fuzzy name matching with machine learning. Perform common fuzzy name matching tasks including similarity scoring, record linkage, deduplication and normalization.
HMNI is trained on an internationally-transliterated Latin firstname dataset, where precision is afforded priority.
Model | Accuracy | Precision | Recall | F1-Score |
---|---|---|---|---|
HMNI-Latin | 0.9393 | 0.9255 | 0.7548 | 0.8315 |
For an introduction to the methodology and research behind HMNI, please refer to my blog post.
Requirements
Python 3.5โ3.8
- tensorflow
- scikit-learn
- fuzzywuzzy
- abydos
- unidecode
QUICK USAGE GUIDE
Installation
Using PIP via PyPI
pip install hmni
Initialize a Matcherย Object
import hmni
matcher = hmni.Matcher(model='latin')
Single Pair Similarity
matcher.similarity('Alan', 'Al')
# 0.6838303319889133
matcher.similarity('Alan', 'Al', prob=False)
# 1
matcher.similarity('Alan Turing', 'Al Turing', surname_first=False)
# 0.6838303319889133
Record Linkage
import pandas as pd
df1 = pd.DataFrame({'name': ['Al', 'Mark', 'James', 'Harold']})
df2 = pd.DataFrame({'name': ['Mark', 'Alan', 'James', 'Harold']})
merged = matcher.fuzzymerge(df1, df2, how='left', on='name')
Name Deduplication and Normalization
names_list = ['Alan', 'Al', 'Al', 'James']
matcher.dedupe(names_list, keep='longest')
# ['Alan', 'James']
matcher.dedupe(names_list, keep='frequent')
# ['Al, 'James']
matcher.dedupe(names_list, keep='longest', replace=True)
# ['Alan, 'Alan', 'Alan', 'James']
Matcher Parameters
hmni.Matcher(model='latin', prefilter=True, allow_alt_surname=True, allow_initials=True, allow_missing_components=True)
- model (str) -- HMNI statistical model (latin by default)
- prefilter (bool) -- Should the matcher prefilter unlikely candidates (True by default)
- allow_alt_surname (bool) -- Should the matcher consider phonetic matching surnames e.g. Smith, Schmidt (True by default)
- allow_initials (bool) -- Should the matcher consider names with initials (True by default)
- allow_missing_components (bool) -- Should the matcher consider names with missing components (True by default)
Matcher Methods
similarity(name_a, name_b, prob=True, surname_first=False)
- name_a (str) -- First name for comparison
- name_b (str) -- Second name for comparison
- prob (bool) -- If True return a predicted probability, else binary class label
- threshold (float) -- Prediction probability threshold for positive match (0.5 by default)
- surname_first (bool) -- If name strings start with surname (False by default)
fuzzymerge(df1, df2, how='inner', on=None, left_on=None, right_on=None, indicator=False, limit=1, threshold=0.5, allow_exact_matches=True, surname_first=False)
- df1 (pandas DataFrame or named Series) -- First/Left object to merge with
- df2 (pandas DataFrame or named Series) -- Second/Right object to merge with
-
how (str) -- Type of merge to be performed
-
inner
(default): Use intersection of keys from both frames, similar to a SQL inner join; preserve the order of the left keys -
left
: Use only keys from left frame, similar to a SQL left outer join; preserve key order -
right
: Use only keys from right frame, similar to a SQL right outer join; preserve key order -
outer
: Use union of keys from both frames, similar to a SQL full outer join; sort keys lexicographically
-
- on (label or list) -- Column or index level names to join on. These must be found in both DataFrames
- left_on (label or list) -- Column or index level names to join on in the left DataFrame
- right_on (label or list) -- Column or index level names to join on in the right DataFrame
- indicator (bool) -- If True, adds a column to output DataFrame called โ_mergeโ with information on the source of each row (False by default)
- limit (int) -- Top number of name matches to consider (1 by default)
- threshold (float) -- Prediction probability threshold for positive match (0.5 by default)
- allow_exact_matches (bool) -- If True allow merging on exact name matches, else do not consider exact matches (True by default)
- surname_first (bool) -- If name strings start with surname (False by default)
dedupe(names, threshold=0.5, keep='longest', reverse=True, limit=3, replace=False, surname_first=False)
- names (list) -- List of names to dedupe
- threshold (float) -- Prediction probability threshold for positive match (0.5 by default)
-
keep (str) -- Specifies method for keeping one of multiple alternative names
-
longest
(default): Keeps longest name -
frequent
: Keeps most frequent name in names list
-
- reverse (bool) -- If True will sort matches descending order, else ascending (True by default)
- limit (int) -- Top number of name matches to consider (3 by default)
- replace (bool) -- If True return normalized name list, else return deduplicated name list (False by default)
- surname_first (bool) -- If name strings start with surname (False by default)
assign_similarity(name_a, name_b, score)
- name_a (str) -- First name for similarity score assignment
- name_b (str) -- Second name for similarity score assignment
- score (float) -- Assigned similarity score for pair of names
Contributing
Pull requests are welcome.
For developers wishing to build a model using Latin or non-Latin writing systems (Chinese, Cyrillic, Arabic),
jupyter notebooks are shared in the dev
folder to build models using similar methods.