recordlinkage
recordlinkage copied to clipboard
Generating Pairs
Hi, Not a issue per se but just needed to understand the process.
I have a previously some codes for dedup using RL. That process have about 1million rows and 6 columns, mainly people attribute, e.g. FirstName, LastName, etc. With a Block ('last'), the generated pairs was around 185,224,110.
I am using the same code but with about 110K rows with the same set of schema, the generated pairs was 703,701,657. Anyone could help explain the criteria how the pairs are generated and why the huge jump even the rows are less.
Thanks in advance. Cheers
The default blocking behavior is a union of all possible matches for each indexer. If you are only blocking left/right on last_name, there is a chance that many of the rows have the same last name. Is this a dedup process? You only mention one dataset, opposed to two.
On Sun, Dec 6, 2020 at 1:54 PM T H Beh [email protected] wrote:
Hi, Not a issue per se but just needed to understand the process.
I have a previously some codes for dedup using RL. That process have about 1million rows and 6 columns, mainly people attribute, e.g. FirstName, LastName, etc. With a Block ('last'), the generated pairs was around 185,224,110.
I am using the same code but with about 110K rows with the same set of schema, the generated pairs was 703,701,657. Anyone could help explain the criteria how the pairs are generated and why the huge jump even the rows are less.
Thanks in advance. Cheers
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/J535D165/recordlinkage/issues/150, or unsubscribe https://github.com/notifications/unsubscribe-auth/AN6QRTDBP6O6ABMW56JII5LSTPVQXANCNFSM4UPS45DQ .
-- Vincent Brandon Data Coordinator Utah Data Research Center 140 East 300 South | Salt Lake City, UT 84111 (801) 526-9705 [email protected]
Yes, this is a dedup process that I am testing on one dataset.