stringdist icon indicating copy to clipboard operation
stringdist copied to clipboard

alternative Jaccard

Open markvanderloo opened this issue 6 years ago • 2 comments

Suggested by Tom Magerman by e-mail to add

to the q-gram distances

markvanderloo avatar Jun 14 '19 07:06 markvanderloo

On top of that it would be nice to be able to use the Jaccard similarity with whole words instead of q-grams. What do you think about it?

wynksaiddestroy avatar Aug 04 '20 12:08 wynksaiddestroy

You can tokenize.using one of the many tokenizers available in R, then hash the tokens (words) to integer using the hashr package and.then use stringdist::seq_dist.

It's basically why I wrote the hashr package, but it has gone unnoticed. I should.blog about this at some point..

markvanderloo avatar Aug 07 '20 21:08 markvanderloo