recordlinkage icon indicating copy to clipboard operation
recordlinkage copied to clipboard

How do I perform deduplication with the python record linkage toolkit with large data sets?

Open sidhugithub1 opened this issue 6 months ago • 0 comments

I am doing Dedup in a single dataset of 1M size in the machine (M5.4xlarge 16 core and 64 GB RAM). I have done the following matching config, but it is running out of memory.

  1. Indexing sortedneighbourhood for AddressTypeDescription with window=3
  2. Indexing block for ['Designation', 'Department', 'City', 'Gender', 'Country', 'Region']

Error running out of memory

Unable to allocate 165. GiB for an array with shape (22179322464,) and data type int64 Unable to allocate 14.2 GiB for an array with shape (1906374956,) and data type int64 Unable to allocate 23.0 GiB for an array with shape (1, 3092850189) and data type object Unable to allocate 23.0 GiB for an array with shape (3092850193, 1) and data type object

Basically it is getting stuck/stop the process at indexing step for large dataset.

Could you please suggest how to overcome this scenario?

Regards Sid

sidhugithub1 avatar Aug 01 '24 05:08 sidhugithub1