recordlinkage icon indicating copy to clipboard operation
recordlinkage copied to clipboard

Generating Pairs

Open thbeh opened this issue 4 years ago • 2 comments

Hi, Not a issue per se but just needed to understand the process.

I have a previously some codes for dedup using RL. That process have about 1million rows and 6 columns, mainly people attribute, e.g. FirstName, LastName, etc. With a Block ('last'), the generated pairs was around 185,224,110.

I am using the same code but with about 110K rows with the same set of schema, the generated pairs was 703,701,657. Anyone could help explain the criteria how the pairs are generated and why the huge jump even the rows are less.

Thanks in advance. Cheers

thbeh avatar Dec 06 '20 20:12 thbeh

The default blocking behavior is a union of all possible matches for each indexer. If you are only blocking left/right on last_name, there is a chance that many of the rows have the same last name. Is this a dedup process? You only mention one dataset, opposed to two.

On Sun, Dec 6, 2020 at 1:54 PM T H Beh [email protected] wrote:

Hi, Not a issue per se but just needed to understand the process.

I have a previously some codes for dedup using RL. That process have about 1million rows and 6 columns, mainly people attribute, e.g. FirstName, LastName, etc. With a Block ('last'), the generated pairs was around 185,224,110.

I am using the same code but with about 110K rows with the same set of schema, the generated pairs was 703,701,657. Anyone could help explain the criteria how the pairs are generated and why the huge jump even the rows are less.

Thanks in advance. Cheers

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/J535D165/recordlinkage/issues/150, or unsubscribe https://github.com/notifications/unsubscribe-auth/AN6QRTDBP6O6ABMW56JII5LSTPVQXANCNFSM4UPS45DQ .

-- Vincent Brandon Data Coordinator Utah Data Research Center 140 East 300 South | Salt Lake City, UT 84111 (801) 526-9705 [email protected]

utah-vabrandon avatar Dec 07 '20 15:12 utah-vabrandon

Yes, this is a dedup process that I am testing on one dataset.

thbeh avatar Dec 07 '20 20:12 thbeh