fastLink icon indicating copy to clipboard operation
fastLink copied to clipboard

dedupeMatches does not consider exact matches

Open jw2249a opened this issue 1 year ago • 2 comments

The deduplication appears to take the match pattern and matched value's index and take the highest zeta value, but does not account for zeta values that are exactly equal. This leads to weird behavior.

Prefact: Issue can be recreated if you append the first row of dfA (where firstname is "daniel") to both dfA and dfB. This means the record will be an exact match to a row in dfA and dfB.

Issue 1: The dedupe algorithm will return all of the matched values as setup above. However, if you change the value of the firstname in the first row to NA, then it will be removed.

Issue 2: f you change the lastname "secuya" to "secuyas" while leaving the first name as NA, it will still be removed by the dedupe function. But, if you add the name "daniel" back to the firstname, it will not be deduped.

jw2249a avatar Jan 04 '24 03:01 jw2249a

Thanks for letting us know! I will try to reproduce what you describe and report back.

tedenamorado avatar Jan 04 '24 03:01 tedenamorado

I found the issue with the deduplication. The order of the dataframes matters because the duplicate row ids are removed before checking for them again in dfb.

jw2249a avatar Jan 26 '24 20:01 jw2249a