fastLink icon indicating copy to clipboard operation
fastLink copied to clipboard

Using reweight.names in fastlink() returns only completely NA rows

Open brittlh opened this issue 2 years ago • 5 comments

I've run the fastLink function both with and without the reweight.names option to ensure the data is matched without issue otherwise.

Code:

fastLink(dfA = dfA, dfB = dfB, varnames = c("first", "last", "company"), stringdist.match = c("first", "last", "company"), stringdist.method = "lv", return.df = TRUE, reweight.names = TRUE, firstname.field = "first", dedupe.matches = FALSE, verbose = TRUE)

The matched data output includes NA cases; each field for each case is "NA":

image

Any idea what's gone wrong here? Thank you for looking into this.

brittlh avatar Jul 20 '22 15:07 brittlh

Hi,

Your code looks OK. Do you happen to have a reproducible example you could share with us? More than happy to take a look.

All my best,

Ted

tedenamorado avatar Jul 20 '22 15:07 tedenamorado

I wasn't able to create a reproducible scaled-down example, which led me to taking a SRS of the two datasets (10% of each) I'm working with to try again. This time, I received 18 rows back, of which 8 were NA and 10 were match rows. Is it possible the issue is linked to the size of data sets? (dfA has about 1k rows, dfB about 220k).

brittlh avatar Jul 21 '22 16:07 brittlh

Hi,

Are there NAs in the name variable?

All my best,

Ted

tedenamorado avatar Aug 04 '22 00:08 tedenamorado

Ted,

Did the check, no NAs. There were 2 "" blank strings. Once I filtered out for testing, I reran fastLink and got the same result as I described above.

Appreciate your help. I'm going to keep looking into this in my spare time and see if any other data anomalies catch my attention that might trigger this issue.

brittlh avatar Aug 12 '22 22:08 brittlh

Disclaimer: I am a regular fastLink user, not a fastLink developer.

Is the scaled-down dataset dfA about 1K rows or about 100 rows? Do the read in datasets look fine to you? Approximately how much missingness is there? How many exact matches are there? Can you show the linkage patterns for the 18 returned rows? No/Little overlap could be the cause...

Anders

aalexandersson avatar Aug 19 '22 12:08 aalexandersson