stringdist
stringdist copied to clipboard
alternative Jaccard
Suggested by Tom Magerman by e-mail to add
to the q-gram distances
On top of that it would be nice to be able to use the Jaccard similarity with whole words instead of q-grams. What do you think about it?
You can tokenize.using one of the many tokenizers available in R, then hash the tokens (words) to integer using the hashr package and.then use stringdist::seq_dist.
It's basically why I wrote the hashr package, but it has gone unnoticed. I should.blog about this at some point..