fuzzyjoin icon indicating copy to clipboard operation
fuzzyjoin copied to clipboard

A couple questiosn and a request

Open carlganz opened this issue 8 years ago • 7 comments

Hey thanks for the wonderful package.

I find that there are situations where I need to merge by multiple variables, but only one of those variables is fuzzy. Let's say I have a list of school names and their zip codes, plus a list of individuals who state the name of their school and their zip code. I expect there to be some error in the school name, but the ZIP code should be correct, so I try something like this:

library(fuzzyjoin)
library(dplyr)

# correct names
official <- data.frame(stringsAsFactors=FALSE,
                name = c("School One", "School Two", "School Three"),
                ZIP = c("91427", "91428", "01427"))
# slightly wrong names
response <- data.frame(stringsAsFactors=FALSE,
                name = c("School Oune", "School Two", "School Thee"),
                ZIP = c("91427"))

# inner join them
stringdist_inner_join(official, response, by =c("name","ZIP"),
                      max_dist = 2)
# don't understand why there are duplicate rows

stringdist_inner_join(official, response, by =c("name","ZIP"),
                      max_dist = 2, distance_col = "dist")
# don't understand why distance_col changes # of rows but this is # of rows I expect

# ideally I'd like only to match if ZIPs are the same
# I can't specify different max distances for differnt columns though
stringdist_inner_join(official, response, by =c("name","ZIP"),
                      max_dist = c(2,0), distance_col = "dist")

So a few things:

  • Why did the inner join generate duplicate rows?

  • Why does specifying a distance_col change the number of rows in the inner join?

  • How do approach cases where I want different max distance for different variables?

Would it be possible to make the max_dist parameter accept a vector of distances for each by variable?

Regards, Carl

carlganz avatar Jan 19 '17 18:01 carlganz

Thanks for your report! This was simply a (serious) bug that occurred when there were multiple matching columns but no distance_col. It's now fixed.

Currently there's no way to specify different max distances and while it would be nice to have it's not trivial to add. I'll leave this open until I can add it!

dgrtwo avatar Jan 19 '17 23:01 dgrtwo

Hello, I have been using your very useful package in my script and came across the duplicated data bug too. Do you have any intention of pushing the fixed package to CRAN? Kind regards Chris

hally166 avatar Apr 07 '17 10:04 hally166

@hally166 Good news; the bug fix is finally on CRAN.

dgrtwo avatar Jun 20 '17 00:06 dgrtwo

Was looking to use multiple max distances today. So I'm going to +1 this.

jaredlander avatar Jun 14 '18 23:06 jaredlander

And here the same ;) +1 from my side.

culpinnis avatar Sep 18 '19 10:09 culpinnis

Same ideia. A vector of max_dist. +1

celomf avatar Nov 12 '20 01:11 celomf

I agree with the previous comments, a vector of max_dist would be ideal. +1

paugrau avatar Apr 27 '23 07:04 paugrau