java-string-similarity Implement Kondrak n-gram similarity

Implement Kondrak n-gram similarity

Open tdebatty opened this issue 8 years ago • 4 comments

Aug 11 '16 07:08 tdebatty

How about using the Lucene implementation ?

Sep 08 '16 09:09 MpoMp

@MpoMp Sounds like a good idea. Based on my interpretation of the Apache license, I assume we would just need to make sure to include the Apache license header and indicate that we changed the code (to fit this library's constraints)? Despite being an ASF member and my familiarity with Lucene and its respective projects, I'm not an expert in licenses and copying the actual code.

(To introduce myself, I'm a member of the team that ported this library to .NET, and a member of the Lucene.NET PMC.)

cc @jamesmblair

Sep 22 '16 15:09 paulirwin

Disregard my deleted comment. What I meant was, isn't this the same as src/main/java/info/debatty/java/stringsimilarity/NGram.java?

Sep 23 '16 19:09 paulirwin

Looks like I misinterpreted the issue in the first place.

As @tdebatty mentions in the README:

The algorithm uses affixing with special character '\n' to increase the weight of first characters. The normalization is achieved by dividing the total similarity score the original length of the longest word.

In the paper, Kondrak also defines a similarity measure, which is not implemented (yet).

Which probably refers to the algorithms described on page 8 here.

Sep 26 '16 13:09 MpoMp

java-string-similarity java-string-similarity copied to clipboard

Implement Kondrak n-gram similarity

java-string-similarity
java-string-similarity copied to clipboard