fuzzyjoin
fuzzyjoin copied to clipboard
A couple questiosn and a request
Hey thanks for the wonderful package.
I find that there are situations where I need to merge by multiple variables, but only one of those variables is fuzzy. Let's say I have a list of school names and their zip codes, plus a list of individuals who state the name of their school and their zip code. I expect there to be some error in the school name, but the ZIP code should be correct, so I try something like this:
library(fuzzyjoin)
library(dplyr)
# correct names
official <- data.frame(stringsAsFactors=FALSE,
name = c("School One", "School Two", "School Three"),
ZIP = c("91427", "91428", "01427"))
# slightly wrong names
response <- data.frame(stringsAsFactors=FALSE,
name = c("School Oune", "School Two", "School Thee"),
ZIP = c("91427"))
# inner join them
stringdist_inner_join(official, response, by =c("name","ZIP"),
max_dist = 2)
# don't understand why there are duplicate rows
stringdist_inner_join(official, response, by =c("name","ZIP"),
max_dist = 2, distance_col = "dist")
# don't understand why distance_col changes # of rows but this is # of rows I expect
# ideally I'd like only to match if ZIPs are the same
# I can't specify different max distances for differnt columns though
stringdist_inner_join(official, response, by =c("name","ZIP"),
max_dist = c(2,0), distance_col = "dist")
So a few things:
-
Why did the inner join generate duplicate rows?
-
Why does specifying a distance_col change the number of rows in the inner join?
-
How do approach cases where I want different max distance for different variables?
Would it be possible to make the max_dist parameter accept a vector of distances for each by variable?
Regards, Carl
Thanks for your report! This was simply a (serious) bug that occurred when there were multiple matching columns but no distance_col. It's now fixed.
Currently there's no way to specify different max distances and while it would be nice to have it's not trivial to add. I'll leave this open until I can add it!
Hello, I have been using your very useful package in my script and came across the duplicated data bug too. Do you have any intention of pushing the fixed package to CRAN? Kind regards Chris
@hally166 Good news; the bug fix is finally on CRAN.
Was looking to use multiple max distances today. So I'm going to +1 this.
And here the same ;) +1 from my side.
Same ideia. A vector of max_dist. +1
I agree with the previous comments, a vector of max_dist would be ideal. +1