fuzzyjoin icon indicating copy to clipboard operation
fuzzyjoin copied to clipboard

max_dist=2 is inappropriate for the normalized Jaccard and cosine metrics

Open Yorko opened this issue 3 years ago • 0 comments

The docs state that

If method = "soundex", the max_dist is automatically set to 0.5, since soundex returns either a 0 (match) or a 1 (no match).

And that's good. But the same should be set for other normalized metrics including 'jaccard' and 'cosine'.

Right now, the default value max_dist= 2 leads to all possible matches returned in case of 'jaccard' and 'cosine' as metrics.

library(ggplot2)
library(fuzzyjoin)
library(dplyr)

data(diamonds)

d <- tibble(approximate_name = c("Idea", "Premiums", "Premioom",
                                     "VeryGood", "VeryGood", "Faiir"),
                type = 1:6)

print(dim(diamonds))  # 53940x10

# no matches when they are inner-joined:
match1 <- diamonds %>%
  inner_join(d, by = c(cut = "approximate_name"))

print(dim(match1))  # 0x11 

# but we can match when they're fuzzy joined
match2 <- diamonds %>%
 stringdist_inner_join(d, by = c(cut = "approximate_name"), method='jaccard')

print(dim(match2))  # 323640x12, i.e. all pairs 53940 * 6 = 323640

The default value of max_dist need to be set up to 0.5 in case of method = "jaccard" or method = "cosine".

Yorko avatar May 06 '21 12:05 Yorko