java-string-similarity
java-string-similarity copied to clipboard
Implement Kondrak n-gram similarity
How about using the Lucene implementation ?
@MpoMp Sounds like a good idea. Based on my interpretation of the Apache license, I assume we would just need to make sure to include the Apache license header and indicate that we changed the code (to fit this library's constraints)? Despite being an ASF member and my familiarity with Lucene and its respective projects, I'm not an expert in licenses and copying the actual code.
(To introduce myself, I'm a member of the team that ported this library to .NET, and a member of the Lucene.NET PMC.)
cc @jamesmblair
Disregard my deleted comment. What I meant was, isn't this the same as src/main/java/info/debatty/java/stringsimilarity/NGram.java?
Looks like I misinterpreted the issue in the first place.
As @tdebatty mentions in the README:
The algorithm uses affixing with special character '\n' to increase the weight of first characters. The normalization is achieved by dividing the total similarity score the original length of the longest word.
In the paper, Kondrak also defines a similarity measure, which is not implemented (yet).
Which probably refers to the algorithms described on page 8 here.