fuzzyjoin
fuzzyjoin copied to clipboard
fuzzy join based on similarity instead of distance
Hi! Thanks for this wonderful package.
I am interested in matched two columns by similarity score and I read from the README that there is only stringdist_* family of functions provided. I wonder if there is a way for me to use join functions based on stringsim?
Thanks a lot!
It seems that, in method = 'jw' case, if I set max_dist = 0.1, that is equivalent to setting a similarity threshold of 0.9. I wonder if such a shortcut/workaround is available to other distance functions as well?
(BTW, the default max_dist = 2 under method = 'jw' seems to always match.)
I thought this was a pretty good idea and implemented the function(s). Not sure what @dgrtwo will think of it but it was a nice practice. This is how it works:
library(dplyr)
library(fuzzyjoin)
a <- tibble(id = 1, text = "Lorem ipsum dolor sit")
b <- tibble(id = 2, text = "Lorem ipsum dolor sit amet")
stringdist::stringsim(a$text[1], b$text[1], method = "soundex")
#> [1] 1
a %>%
stringsim_left_join(b, by = "text", similarity_col = "sim", min_sim = 0.8)
#> # A tibble: 1 x 5
#> id.x text.x id.y text.y sim
#> <dbl> <chr> <dbl> <chr> <dbl>
#> 1 1 Lorem ipsum dolor sit 2 Lorem ipsum dolor sit amet 0.808
You can test it from my repo (remotes::install_github("JBGruber/fuzzyjoin")).
@JBGruber Thanks a lot! I tried it out a bit and it seems that your implementation works fine. I am not sure what @dgrtwo would think but I personally like it!
Maybe you can try to send a PR and see whether they would like to merge it into the main branch?
I already created the PR but haven't got a reply yet: https://github.com/dgrtwo/fuzzyjoin/pull/74