java-string-similarity
java-string-similarity copied to clipboard
N-Gram Example Comment Incorrect?
This issue was reported to my team's .NET port of your library but I confirmed that it is an issue here as well.
The example on the README shows that the N-Gram code expects values of 0.416666 and 0.97222. However, different results are given when the code is actually ran. I am not sure if this is a bug in the code, or that the README comment is outdated/incorrect.
I created a unit test for the README example, and sure enough it fails:
@Test
public void exampleFromReadme() {
// produces 0.416666
NGram twogram = new NGram(2);
assertEquals(0.416666, twogram.distance("ABCD", "ABTUIO"), 0.001);
// produces 0.97222
String s1 = "Adobe CreativeSuite 5 Master Collection from cheap 4zp";
String s2 = "Adobe CreativeSuite 5 Master Collection from cheap d1x";
NGram ngram = new NGram(4);
assertEquals(0.97222, ngram.distance(s1, s2), 0.001);
}
Results:
java.lang.AssertionError:
Expected :0.416666
Actual :0.5833333134651184
This result (0.583) is the same that we get on the .NET side of things. As I am not an expert in these algorithms, I am unsure if this is a code bug or a need to update the README.
... and I just noticed after posting this that 0.583333 ~= (1 - 0.416666). So this is an issue of difference vs similarity I guess. But I suppose we still need to confirm if the returned value from the library is incorrect, or if the comment is incorrect.
I think this might be a case of the README being incorrect, as it looks like the logic was changed as a result of #22