fuzzyjoin
fuzzyjoin copied to clipboard
max_dist=2 is inappropriate for the normalized Jaccard and cosine metrics
The docs state that
If method = "soundex", the max_dist is automatically set to 0.5, since soundex returns either a 0 (match) or a 1 (no match).
And that's good. But the same should be set for other normalized metrics including 'jaccard' and 'cosine'.
Right now, the default value max_dist
= 2 leads to all possible matches returned in case of 'jaccard' and 'cosine' as metrics.
library(ggplot2)
library(fuzzyjoin)
library(dplyr)
data(diamonds)
d <- tibble(approximate_name = c("Idea", "Premiums", "Premioom",
"VeryGood", "VeryGood", "Faiir"),
type = 1:6)
print(dim(diamonds)) # 53940x10
# no matches when they are inner-joined:
match1 <- diamonds %>%
inner_join(d, by = c(cut = "approximate_name"))
print(dim(match1)) # 0x11
# but we can match when they're fuzzy joined
match2 <- diamonds %>%
stringdist_inner_join(d, by = c(cut = "approximate_name"), method='jaccard')
print(dim(match2)) # 323640x12, i.e. all pairs 53940 * 6 = 323640
The default value of max_dist
need to be set up to 0.5 in case of method
= "jaccard" or method
= "cosine".