java-string-similarity icon indicating copy to clipboard operation
java-string-similarity copied to clipboard

Jaro Winkler similarity on short strings

Open fabriziofortino opened this issue 7 years ago • 2 comments
trafficstars

I am trying to use jaro wrinkler similarity to check colors strings coming from user inputted form against a palette of fixed colors.

Using jaro wrinkler similarity, I get these kind of results for very short strings:

  • s1 = "ed" - s2 = "red" -> similarity = 0
  • s1 = "nude" - s2 = "red" -> similarity = 0.5833333134651184

Is it correct to get similarity = 0 in the first case?

fabriziofortino avatar Nov 01 '18 15:11 fabriziofortino

The Jaro Similarity of ed and red is 0, since the number of matching characters (parameter m) is 0. Furthermore, the length of the common prefix of s1 and s2 (parameter l) is 0. This results in a Jaro-Winkler Similarity of 0 as

sim_jw = sim_j + l * 0.1 * (1 - sim_j) = 0 + 0 * 0.1 * 1 = 0

Jaro-Winkler gives more favorable ratings to strings that match from the beginning.

saschaszott avatar Dec 26 '18 11:12 saschaszott

when I compare 2 strings wrt jaroWinkler "abcdefghij","aaaaaaaaa" my output comes around 0.4023.....

when I check the same on https://asecuritysite.com/forensics/simstring It gives me 0.46 Kindly help in this regard.

manshulgoel avatar Apr 25 '19 10:04 manshulgoel