recordlinkage
recordlinkage copied to clipboard
A powerful and modular toolkit for record linkage and duplicate detection in Python
it is a little bit frustrating because I cannot find in the documentation for record linkage any explicit way to solve this though seemingly it would be a very commonplace...
Pandas datatypes, such as `pd.Int64Dtype` (see [here](https://pandas.pydata.org/docs/user_guide/basics.html#dtypes)), do not seem to be supported: ```python import recordlinkage from recordlinkage.datasets import load_febrl4 dfA, dfB = load_febrl4() # Convert column types to pandas...
Hi I am utilizing the ECM classifier as my unsupervised classifier for my problem but I keep getting error while calling them that I do not understand why: ecm.fit(df_feature_vectors) log_m_probablity...
import recordlinkage indexer = recordlinkage.Index() indexer = recordlinkage.SortedNeighbourhoodIndex(on='label', window=9) candidate_links = indexer.index(featuresfinal, targetfinal) comp = recordlinkage.Compare() comp.string('label', 'label', method='jarowinkler', label='labels') mymatches = comp.compute(candidate_links, featuresfinal, targetfinal)
I've been developing some data corruption algorithms (inspired by the documentation from https://dmm.anu.edu.au/geco/flex-data-gen-manual.pdf but not looking at the sourcecode, since it has an unusual license), and I wonder if your...
py 3.9.11, fastparquet 0.8.1: writing dataframe to parquet file from a table data field with rtf doc content falls with TypeError exception fp.write(fpath, rows, compression='GZIP', row_group_offsets=row_group_offsets) falls with traceback: TypeError:...
Hi, Just wondering whether the EM-algorithm for frequency based estimates, or any other algorithm taking into account value frequencies is/will be included in the package? Thanks!!
Hi I am linking two datasets. Both of them contain unique id's as identifiers. After reading two datasets into pandas data frames I set those id's as their indexes. So...
Hello. i have around 0.3 million data and i have to make pair on minimum 3 columns, so after doing that i have 40 million index records, and when I'm...
Nothing mentioned in the docs about the supported languages