StringCompare icon indicating copy to clipboard operation
StringCompare copied to clipboard

Checking for null case should be done at the token bag level for the Jacard Similarity

Open OlivierBinette opened this issue 2 years ago • 1 comments

The check for null case should be done at the token bag level rather than the string level:

https://github.com/OlivierBinette/StringCompare/blob/be58f4c1c9c24bc2cef5d9bb81053fa7ea003792/stringcompare/distance/jaccard.py#L17

I would recommend refactoring jaccard.py as follows:

  1. Have the jacard() function take two token sets as arguments and compute their jaccard similarity (overlap percentage). Checking for empty token bags should be done here.
  2. Have the compare() function deal with the tokenization and anything else (e.g. transforming the distance to a similarity).

OlivierBinette avatar Apr 13 '22 19:04 OlivierBinette