java-string-similarity icon indicating copy to clipboard operation
java-string-similarity copied to clipboard

N-Gram Example Comment Incorrect?

Open paulirwin opened this issue 6 years ago • 2 comments
trafficstars

This issue was reported to my team's .NET port of your library but I confirmed that it is an issue here as well.

The example on the README shows that the N-Gram code expects values of 0.416666 and 0.97222. However, different results are given when the code is actually ran. I am not sure if this is a bug in the code, or that the README comment is outdated/incorrect.

I created a unit test for the README example, and sure enough it fails:

    @Test
    public void exampleFromReadme() {
        // produces 0.416666
        NGram twogram = new NGram(2);
        assertEquals(0.416666, twogram.distance("ABCD", "ABTUIO"), 0.001);

        // produces 0.97222
        String s1 = "Adobe CreativeSuite 5 Master Collection from cheap 4zp";
        String s2 = "Adobe CreativeSuite 5 Master Collection from cheap d1x";
        NGram ngram = new NGram(4);
        assertEquals(0.97222, ngram.distance(s1, s2), 0.001);
    }

Results:

java.lang.AssertionError: 
Expected :0.416666
Actual   :0.5833333134651184

This result (0.583) is the same that we get on the .NET side of things. As I am not an expert in these algorithms, I am unsure if this is a code bug or a need to update the README.

paulirwin avatar Oct 02 '19 14:10 paulirwin

... and I just noticed after posting this that 0.583333 ~= (1 - 0.416666). So this is an issue of difference vs similarity I guess. But I suppose we still need to confirm if the returned value from the library is incorrect, or if the comment is incorrect.

paulirwin avatar Oct 02 '19 15:10 paulirwin

I think this might be a case of the README being incorrect, as it looks like the logic was changed as a result of #22

paulirwin avatar Oct 02 '19 16:10 paulirwin