java-string-similarity icon indicating copy to clipboard operation
java-string-similarity copied to clipboard

Implement Kondrak n-gram similarity

Open tdebatty opened this issue 8 years ago • 4 comments

tdebatty avatar Aug 11 '16 07:08 tdebatty

How about using the Lucene implementation ?

MpoMp avatar Sep 08 '16 09:09 MpoMp

@MpoMp Sounds like a good idea. Based on my interpretation of the Apache license, I assume we would just need to make sure to include the Apache license header and indicate that we changed the code (to fit this library's constraints)? Despite being an ASF member and my familiarity with Lucene and its respective projects, I'm not an expert in licenses and copying the actual code.

(To introduce myself, I'm a member of the team that ported this library to .NET, and a member of the Lucene.NET PMC.)

cc @jamesmblair

paulirwin avatar Sep 22 '16 15:09 paulirwin

Disregard my deleted comment. What I meant was, isn't this the same as src/main/java/info/debatty/java/stringsimilarity/NGram.java?

paulirwin avatar Sep 23 '16 19:09 paulirwin

Looks like I misinterpreted the issue in the first place.

As @tdebatty mentions in the README:

The algorithm uses affixing with special character '\n' to increase the weight of first characters. The normalization is achieved by dividing the total similarity score the original length of the longest word.

In the paper, Kondrak also defines a similarity measure, which is not implemented (yet).

Which probably refers to the algorithms described on page 8 here.

MpoMp avatar Sep 26 '16 13:09 MpoMp