string-similarity icon indicating copy to clipboard operation
string-similarity copied to clipboard

This is not the Dice coefficient

Open vibl opened this issue 4 years ago • 1 comments

Your algorithm is not the Dice coefficient. It counts all bigram duplicates, whereas the Dice coefficient only counts distinct bigrams (as defined in Wikipedia).

As an example, let's compare two versions of the main file of this repo (https://github.com/aceakash/string-similarity/blob/2718c82bbbf5190ebb8e9c54d4cbae6d1259527a/compare-strings.js and the latest https://github.com/aceakash/string-similarity/blob/eaeec5d74c98a6f6fcb1b06fad44ad7f3d8c2965/src/index.js. They have a Dice coefficient of 0.90, but this lib string-similarity outputs 0.74 when comparing these two files.

Please have a look at the implementations in Talisman, NLTK or in many languages in https://en.wikibooks.org/wiki/Algorithm_Implementation/Strings/Dice%27s_coefficient

vibl avatar Jun 26 '21 10:06 vibl

frr bruh like "dollar' and "money" return a match of 0 :((( like dawg I want semantic similarity who needs string similarity anyways 🤷

aimeeaidanu avatar Feb 24 '23 18:02 aimeeaidanu