fuzzyjoin icon indicating copy to clipboard operation
fuzzyjoin copied to clipboard

max_dist for multiple columns in stringdist_join

Open mparada opened this issue 6 years ago • 9 comments

Is it possible to allow max_dist to be a vector so that it can be different for each column passed to stringdist_join's by argument?

Something like this: stringdist_inner_join(df1, df2, by=c("col1", "col2"), max_dist=c(1, 2))

mparada avatar Sep 25 '17 09:09 mparada

was looking for this functionality as well today.

having different distance thresholds tailored to the specific column would be a much used feature.

you can imagine that

the string distance threshold joining by "state" should be less than the string distance threshold joining by "person name"

due to the higher typo potential in "person name".

statsccpr avatar Jun 18 '18 18:06 statsccpr

I was looking for this too! I would love this functionality.

nutnetadmin avatar Sep 19 '19 17:09 nutnetadmin

This would be very useful!

ssmil avatar Apr 05 '21 12:04 ssmil

Same here! Such a functionality would be great.

elehna avatar Oct 19 '21 18:10 elehna

This functionality would be fantastic. I wish I had skills to contribute to get it.

bluejayblues avatar Nov 26 '21 00:11 bluejayblues

This is exactly what I need! Does anyone have a solution for this? My variables are numeric I basically need something like this distance_join (df1, df2, by = c('v1', 'v2'), max_dist = c(0.01, 0.001), mode = "inner", distance_col = NULL)

but obviously I get an error because I use a vector in max_dist.

possumhound avatar Apr 12 '22 22:04 possumhound

Adding support for this feature in all joins, not just stringdist! Would be great for spatiotemporal data where the units of lat/lon and time are very different internally.

wkumler avatar May 05 '22 18:05 wkumler

Same as everyone else! This would be fantastic to have

gitoro1 avatar Nov 28 '22 21:11 gitoro1

In the meantime, it can be achieved using match_fun. A bit clunky but it works.

df2$var1_min <- df2$var1 -0.01 # modify to your distance needs df2$var1_max <- df2$var1 +0.01 df2$var2_min <- df2$var2 -0.001 df2$var2_max <- df2$var2 +0.001

joined_df <- fuzzy_join(df1, df2, by = c("var1" = "var1_min", "var1" = "var1_max", "var2" = "var2_min", "var2" = "var2_max"), match_fun = list(>=,<=,>=,<=), mode = 'inner') # change the mode to your needs.

possumhound avatar Nov 28 '22 22:11 possumhound