recordlinkage
recordlinkage copied to clipboard
A powerful and modular toolkit for record linkage and duplicate detection in Python
-> New method for string comparison: Jaccard Similarity -> Ran all tests, 207 passed 90 failed, 2 errors, 768 warnings, None related to the Jaccard Similarity calculation -> Tested on...
The library documentation do not provide much guidance on test/train split and cross validation. See below an implementation using KFold object in sci-kit learn. How does the blocking strategy used...
Amazing scripts you've got, thanks a lot for sharing. I'm trying to match payment records, but I couldn't find an option to "enforce" that one of the set is present,...
Hi, All string algorithms are computing the similariy : 1 - distance / max_length_string. This puts short chains at a disadvantage compared to long chains, and in some cases, a...
Hello, I am currently using this module to do some record linking stuff, I am thinking of contributing some string matching algorithms that are implemented in [textdistance](https://github.com/life4/textdistance), I'm currently using...
Hi, I'm considering to write an extension making it possible to use spark dataframes with this tool. as it is pretty similar to Pandas dataframes, but does not necessarily have...
Hi, Thank you for making this awesome library! I am bit confused on the parameters of the numeric comparison function specifically offset and scale. [Documentation for numeric](https://recordlinkage.readthedocs.io/en/latest/ref-compare.html) The graph arrow...
Specifying the number of cores (n_jobs) appears to make the algorithm run slower. dupe_indexer = rl.Index() dupe_indexer.block(['first_name_clean','last_name_clean']) dupe_candidate_links = dupe_indexer.index(df) compare_dupes = rl.Compare(**n_jobs=12**).
Hi, I'm just wondering if there is an example of using the current version of this package with the geographic method? If I try to add Haversine distance to my...
I have 2 dataframes - df1 = pd.DataFrame() df2 = pd.DataFrame() df1['company_name'] = ['Crysagi Systems Pvt','Coreview.'] df2['company_name'] = ['Crysagi Systems Pvt Ltd','Coreview','sadadas'] I am trying to do a fuzzy search...