StringCompare
StringCompare copied to clipboard
Checking for null case should be done at the token bag level for the Jacard Similarity
The check for null case should be done at the token bag level rather than the string level:
https://github.com/OlivierBinette/StringCompare/blob/be58f4c1c9c24bc2cef5d9bb81053fa7ea003792/stringcompare/distance/jaccard.py#L17
I would recommend refactoring jaccard.py as follows:
- Have the
jacard()
function take two token sets as arguments and compute their jaccard similarity (overlap percentage). Checking for empty token bags should be done here. - Have the
compare()
function deal with the tokenization and anything else (e.g. transforming the distance to a similarity).