recordlinkage
recordlinkage copied to clipboard
How do I perform deduplication with the python record linkage toolkit with large data sets?
I am doing Dedup in a single dataset of 1M size in the machine (M5.4xlarge 16 core and 64 GB RAM). I have done the following matching config, but it is running out of memory.
- Indexing sortedneighbourhood for AddressTypeDescription with window=3
- Indexing block for ['Designation', 'Department', 'City', 'Gender', 'Country', 'Region']
Error running out of memory
Unable to allocate 165. GiB for an array with shape (22179322464,) and data type int64 Unable to allocate 14.2 GiB for an array with shape (1906374956,) and data type int64 Unable to allocate 23.0 GiB for an array with shape (1, 3092850189) and data type object Unable to allocate 23.0 GiB for an array with shape (3092850193, 1) and data type object
Basically it is getting stuck/stop the process at indexing step for large dataset.
Could you please suggest how to overcome this scenario?
Regards Sid