fuzzyjoin
fuzzyjoin copied to clipboard
max_dist for multiple columns in stringdist_join
Is it possible to allow max_dist
to be a vector so that it can be different for each column passed to stringdist_join
's by
argument?
Something like this: stringdist_inner_join(df1, df2, by=c("col1", "col2"), max_dist=c(1, 2))
was looking for this functionality as well today.
having different distance thresholds tailored to the specific column would be a much used feature.
you can imagine that
the string distance threshold joining by "state" should be less than the string distance threshold joining by "person name"
due to the higher typo potential in "person name".
I was looking for this too! I would love this functionality.
This would be very useful!
Same here! Such a functionality would be great.
This functionality would be fantastic. I wish I had skills to contribute to get it.
This is exactly what I need! Does anyone have a solution for this? My variables are numeric I basically need something like this distance_join (df1, df2, by = c('v1', 'v2'), max_dist = c(0.01, 0.001), mode = "inner", distance_col = NULL)
but obviously I get an error because I use a vector in max_dist.
Adding support for this feature in all joins, not just stringdist! Would be great for spatiotemporal data where the units of lat/lon and time are very different internally.
Same as everyone else! This would be fantastic to have
In the meantime, it can be achieved using match_fun. A bit clunky but it works.
df2$var1_min <- df2$var1 -0.01 # modify to your distance needs df2$var1_max <- df2$var1 +0.01 df2$var2_min <- df2$var2 -0.001 df2$var2_max <- df2$var2 +0.001
joined_df <- fuzzy_join(df1, df2, by = c("var1" = "var1_min",
"var1" = "var1_max",
"var2" = "var2_min",
"var2" = "var2_max"),
match_fun = list(>=
,<=
,>=
,<=
), mode = 'inner') # change the mode to your needs.