commons-text TEXT-126: Adding Sorensen-Dice similarity algoritham

This pull request adds the Sorensen-Dice Similarity Algorithm to Apache Commons Text.

The Sorensen–Dice coefficient is a statistic used for comparing the similarity of two samples. It was independently developed by the botanists Thorvald Sorensen and Lee Raymond Dice who published in 1948 and 1945 respectively.

Feb 26 '19 18:02 ameyjadiye

Coverage increased (+0.009%) to 97.926% when pulling 8f0a97cfa6f61a1aaa0f55ffdf4ff2ca48ec9a04 on ameyjadiye:master into a6f2b44e17a8fa11066ec3519411adf5acc2abaa on apache:master.

Feb 26 '19 18:02 coveralls

Hi @ameyjadiye ! A new metric! 🎉 Added a few comments, but only spent a couple of minutes looking at the code. I will take a better look later, and read about this similarity (never heard about it).

Thanks !!!

Feb 26 '19 20:02 kinow

Hi @ameyjadiye. Thanks for the contribution. I've had a look at this similarity as it is essentially a binary scoring metric:

Sørensen–Dice coefficient

Thus you must compute the matches (true-positives) between the two sets X and Y (the union) and then include a penalty for items in either set that are not matched (false-positives and false negatives).

(2 * TP) / (2 * TP + FP + FN)

Or simply:

2 * |X union Y| / (|X| + |Y|)

with |N| the cardinality, or count.

The score is related to the Jaccard score and can be computed directly from it (S = 2J / (1 + J)). However the JaccardSimilarity class in the same package uses single characters from each CharSequence and not character pairs (bigrams). So this algorithm using pairs is new.

The way you have defined the score is to take each unique character pair (bigram) as the set from each string. Thus:

sim.apply("aa", "aa")      = [aa] vs [aa] = 1.0
sim.apply("aaa", "aa")     = [aa] vs [aa] = 1.0
sim.apply("aaaa", "aa")    = [aa] vs [aa] = 1.0
sim.apply("aaaa", "aaa")   = [aa] vs [aa] = 1.0

The original article you referenced includes duplicate observations of each unique character pair (bigram). Thus:

sim.apply("aa", "aa")      = [aa]         vs [aa]     = (2 * 1) / (1 + 1) = 1.0
sim.apply("aaa", "aa")     = [aa][aa]     vs [aa]     = (2 * 1) / (2 + 1) = 0.667
sim.apply("aaaa", "aa")    = [aa][aa][aa] vs [aa]     = (2 * 1) / (3 + 1) = 0.5
sim.apply("aaaa", "aaa")   = [aa][aa][aa] vs [aa][aa] = (2 * 2) / (3 + 2) = 0.8

This requires a different algorithm without using Set. The author provides one which could be used for testing although it could be improved for efficiency.

Q. Do you know if there is a preference in the field for parsing unique bigrams or including duplicates? Either way it should be clear in the Javadoc what the intension is.

I also note that splitting the CharSequence into bigrams can be done efficiently using a 32-bit integer to store each pair of 16-bit chars:

/**
 * Creates the bigrams packed as two 16-bit chars into a 32-bit int. This does not
 * support extended character sets.
 *
 * @param cs the char sequence
 * @return the bigrams
 */
private int[] createBigrams(CharSequence cs) {
    if (cs.length() < 2) {
        return new int[0];
    }
    final int[] bigrams = new int[cs.length() - 1];
    int ch2 = cs.charAt(0);
    for (int i = 0; i < bigrams.length; i++) {
        final int ch1 = ch2;
        ch2 = cs.charAt(i + 1);
        bigrams[i] = (ch1 << 16) | ch2;
    }
    return bigrams;
}

You can then do what you want with these including writing an algorithm that sorts the bigrams for efficient comparison, or use them in your current algorithm with Set<Integer>.

Also be careful of overflow in the result computation. I acknowledge that with the Set method you would have to input two CharSequences with close to all 2^16 * 2^16 bigrams to get two set sizes above half Integer.MAX_VALUE. With the duplicates method you just have to input two CharSequence that have a combined length above Integer.MAX_VALUE and then adding the sizes will overflow the int. For example if you were to compare two books!

Feb 27 '19 11:02 aherbert

Hi @kinow , looks like all suggestions are accommodated with readable code and considerable performance. the code also mimics algorithm given on Wikipedia. test cases passing as well. do you want me to improve anything more here?

Mar 06 '19 15:03 ameyjadiye

Hi @ameyjadiye,

The code is now a good working implementation. However it should be noted that the Sorensen-Dice similarity is a binary scoring metric. It can be applied to anything where you have two sets to be matched. It is closely related to the Jaccard score.

This PR brings in a new algorithm to compute matches using pairs of characters (bigrams). This is a conflict with existing similarity measures in the package implementing SimilarityScore<R>:

JaccardSimilarity - uses Set<Character> for matching single characters to produce Double JaroWinklerSimilarity - uses single character matching to produce an edit distance which is then converted to a similarity Double LongestCommonSubsequence - uses single character matching for sub-sequences to produce Integer

The rest of the package is for an extension of SimilarityScore<R> which is EditDistance<R>:

CosineDistance - uses a RegexTokenizer on whitespace to match words to produce Double HammingDistance - uses single character matching to produce Integer JaccardDistance - just inverts the JaccardSimilarity to produce Double JaroWinklerDistance - complementary of Jaro-Winkler similarity to produce Double LevenshteinDetailedDistance - single character changes to produce LevenshteinResults LevenshteinDistance - single character changes to produce Integer LongestCommonSubsequenceDistance - single character based substring to produce Double

So all the other measures use single characters or words (CosineDistance).

If you want a Sorensen-Dice distance score using single characters then just use the JaccardDistance and map it: S = 2J / (1 + J)

This class is effectively a JaccardDistance with bigrams but with a mapped output score.

One suggestion would be a change to be something like BigramJaccardSimilarity and then compute the Jaccard using bigrams. Then put new classes in for SorensenDiceSimilarity and BiSorensenDiceSimilarity.

However the current JaccardDistance uses Set<Character> which has space and efficiency advantages over Set<String>. Once you are using strings then there is no reason to have only bigrams.

My preference would be to add a new class TokenizerJaccardSimilarity (or something nicer) that accepts a Tokenizer<CharSequence> in the constructor. This then tokenises the input into two sets and computes the union of the two sets.

Your bigram case can be fulfilled by writing a new Tokenizer<CharSequence> that returns the bigrams. This new class would be flexible for computing the Jaccard for words, bigrams, trigrams, n-grams, or whatever else you can tokenise a CharSequence into.

Then you add a TokenSorensonDiceSimilarity that just maps the Jaccard score. You can see this for example in the JaccardDistance class.

@kinow Opinions on a more flexible approach?

Mar 06 '19 22:03 aherbert

I've had more of a think and have mocked an implementation for a generic IntersectionSimilarity.

See TEXT-155 and the code in the linked PR.

This class can be used to compute the Sorensen-Dice similarity for bigrams or any other type of structure that you would like to analyse from the CharSequence, e.g. characters, bigrams, trigrams, n-grams, words, etc.

Note: Sorensen-Dice similarity is the same as the F1-score

Mar 07 '19 13:03 aherbert

Been busy with a few other bugs these past two days, but hopefully will be able to review this and TEXT-155 this weekend 🎉 thanks a lot @aherbert ! @ameyjadiye sorry the delay, just a bit longer and we should be done with the review for your contribution (:

Mar 07 '19 21:03 kinow

I have just noticed that the JaccardSimilarity computes 0 for the similarity of zero length input sequences. This computes 1 so it should be changed.

Mar 07 '19 23:03 aherbert

@ameyjadiye see last comment from @aherbert about empty strings and 0 vs. 1.

@aherbert while we are discussing #109 , do you think that is a blocker for this pull request? So far I think at least the API proposed here would be kept right?

If so, this could be merged once the last comment is resolved, and then we can discuss how to organize the classes and where the sorensen-dice coefficient is calculated.

I think the only thing missing is deciding on the name of the classes? Whether it should use Bigram in the name or be just SorensenDiceSimilarity.

I like the idea of having a descriptive name such as BigramSorensenDiceSimilarity (or Bigram in other place/order). However, I think we should also considerate what users would expect. i.e. in other libraries, does the Sorensen Dice similarity used is for bigrams always? If other implementations Python/JS/Java in used bigrams, then we could leave it as SorensenDiceSimilarity and either add another method/constructor/etc to customize the similarity, or then have another class...

What do you think? (@ameyjadiye if you have any suggestion, feel free to chime in too :+1: [or any other person reading this :slightly_smiling_face: ])

Mar 09 '19 04:03 kinow

@kinow @aherbert I have written code by keeping Wikipedia as standard and researched some other libraries from other languages just for reference. all of them are using bigrams for calculating similarities. and I personally think that if we go further and use triGram, qurterGram ... nGram the resulting %age would be incorrect. we can't use charachter by charachter match i.e uniGram as that will also make result bad as ab!=ba

with nGram are we tampering the existing proved algoritham ? if its giving better results than existing algo I'm ok with that, also does someone really need it in real world examples ?

Wikipedia says

When taken as a string similarity measure, the coefficient may be calculated for two strings, x and y using bigrams as follows:[9]

s=2nt / nx + ny where nt is the number of character bigrams found in both strings, nx is the number of bigrams in string x and ny is the number of bigrams in string y. For example, to calculate the similarity between:

night nacht We would find the set of bigrams in each word:

{ni,ig,gh,ht} {na,ac,ch,ht} Each set has four elements, and the intersection of these two sets has only one element: ht.

Inserting these numbers into the formula, we calculate, s = (2 · 1) / (4 + 4) = 0.25.

not sure but are we over engineering similarities with #109 ? let me know if there is practicle use of nGram in real world ? would like to study it more.

Mar 09 '19 06:03 ameyjadiye

@aherbert @kinow , however up and above the existing algorithm I liked the Idea of changing the code in #109 so the results will be "aa" vs "aaaaaaaa" == 2 / 8 = 0.25.

you guys can change the existing class later or you want me to change the code here to pass this case?

Mar 09 '19 06:03 ameyjadiye

I'll try and summarise where we are:

It is not clear what the result should be of 0 / 0

This is the "" vs "" case.

Possibles:

0 / 0 = n / n = 1
0 / 0 = n / 0 = NaN
0 / 0 = 0 / n = 0

The library I found (python distance library) dealt with this by returning NaN. Others have been found that return 1 (because they are the same, even if they are the same of nothing).

The current JaccardSimilarity has decided to return 0.

The new SorensenDiceSimilarity returns 1.

So this discrepancy in our own library should be resolved.

characters or bigrams

How do you split up a CharSequence to build a Set? There are so many ways. Or do you build a List?

There are library examples for splitting a string into a set of characters (python distance library) and for using bigrams. I note that @kinow example, the python textdistance library, supports using a Set or List, and supports variable length n-grams:

>>> textdistance.Sorensen(qval=1, as_set=1)("aaaba","aab")
1.0
>>> textdistance.Sorensen(qval=1, as_set=0)("aaaba","aab")
0.75
>>> textdistance.Sorensen(qval=2, as_set=1)("aaaba","aab")
0.8
>>> textdistance.Sorensen(qval=2, as_set=0)("aaaba","aab")
0.6666666666666666

This problem is solved by #109 which allows the user to define how to split the sequence with a function.

The current JaccardSimilarity has decided to split into single characters.

The new SorensenDiceSimilarity splits into bigrams.

Both use sets. So this discrepancy should be resolved.

I think the solution is to support an n-gram size argument for the measure with default of 1. Then support a useSet flag with default of true. This is all possible using #109 to do the intersect computation. The key is to not break compatibility for anyone already using the JaccardSimilarity.

How to score a CharSequence using bigrams with only 1 character

"a" vs "a"  = 0   or   1?
"a" vs "b"  = 0   or   NaN?

This is similar to point 1 where you are actually scoring 0/0 again since you have no bigrams.

Handling null

Currently this just throws an exception. However null is similar to null.

So if the library decides to not support null because it does not exists where does it stand on a zero length CharSequence. This does also not exist, but does have established meaning in the context of strings.

Currently the stance in the Jaccard is it is a programming error to pass around null and expect comparisons but it is perfectly reasonable to pass around empty stuff.

Discussion

To consolidate 0, 1 and 4 perhaps we can change the library to state that:

Similarity is undefined for null and throw an exception
Similarity between two empty sequences is either: (a) exception; (b) perfect; or (c) nothing
Similarity between equal sequences that do not satisfy the criteria for the comparison algorithm is either: (a) perfect; or (b) nothing

Then support n-gram size and both the Set or List approach to defining the two collections of n-grams to compare.

This involves some decisions.

My vote is:

Add n-gram and useSet parameters to the JaccardSimilarity
Add n-gram and useSet parameters to the SorensenDiceSimilarity
Make similarity 1 if the sequence does not satisfy the criteria for the algorithm but is identical. This is less destructive than throwing an exception or returning NaN

As long as the Javadoc explains the chosen functionality then the I would be happy with any of the options. But my preference would be for equality being similar:

"" vs "" == 1 (using a character comparison algorithm)
"a" vs "a" == 1 (using a bigram algorithm)

This is in-line with other similarity measures in the library which state equal sequences are perfectly similar.

Implementing a more generic approach can be done using methods in #109. So perhaps hold this PR until that is resolved.

Alex

Mar 09 '19 13:03 aherbert

I'm okay with that either :) , and just overlooked #109. looks nice @aherbert . let me know if you guys merged it with trunk will make changes here as well. Thanks.

Mar 09 '19 15:03 ameyjadiye

Good summary @aherbert , thanks a lot for that.

My brain is still digesting it, and I won't risk commenting here on Sunday ~evening. Will sleep on it, and probably work on a few other things next week (might be a bit busier at $work too).

Let's see if we can find one or two possible alternatives, and if we can't find a solution that looks really good, we can take it to the dev mailing list to see what others think too.

@ameyjadiye thanks for your patience so far. The work on #109 is really nice indeed! I was going to merge it now but there are two pending checkstyle issues. Once fixed, I believe they will be merged. Then if you could update this branch with that code, that would make three of us who have looked over the API (so some good code review). Any feedback welcome, as we have time to fix anything before 1.7.

Cheers Bruno

Mar 10 '19 04:03 kinow

@kinow @aherbert changed my code and looks good to me now. please review.

Mar 10 '19 08:03 ameyjadiye

Will have time throughout this week to review with more calm (Sunday evening here now). But I suspect it would be easier to wait for the other pull request to be merged first, then simply rebase your code and update it?

Mar 10 '19 08:03 kinow

@kinow @aherbert , this looks good to me now, can you please take a look.

Mar 18 '19 17:03 ameyjadiye

Overall this class now works nicely. But I think that my initial reservations have not been met.

There is no reason that this should use bigrams, or allow duplicate bigrams.

If this goes through then you end up with a library that can compute the Jaccard with unique single characters (by using a Set) and the Sorensen with bigrams including duplicates (by using a List converted to a Bag). Since the two scores are related by the equations:

J = S / (2 - S)
S = 2J / (1 + J)

It really does not make sense to compute one with a different method from the other.

I would actually state we have:

J1U and S2D

Where the number is the number of characters (n-gram), U is for unique (non-duplicates), and D is for duplicates.

We actually have possibilities for the following in the library:

J1U
J1D
J2U
J2D
S1U
S1D
S2U
S2D

Or generically: [JS]N[UD]

Putting in an S2D metric to complement a J1U metric makes a confusing and inconsistent library.

BTW. I believe bigrams is better and duplicates is better. So this is the best algorithm. It just totally contradicts the current Jaccard implementation.

A cleaner design is to have both metrics support single or bigrams (or even n-grams) and also allow or prevent duplicates via constructor arguments.

The current Jaccard can be modified to do so by allowing the default constructor to default to single, non-duplicate mode and overloaded constructors added to it.

This idea may be best achieved by pushing shared functionality into a NGramIntersectionResult class that accepts the size of the n-gram and the duplicates option and computes the IntersectionResult. This can internally use an IntersectionSimilarity class to do the computation.

class NGramIntersectionResult {

    NGramIntersectionResult(int ngram, boolean duplicates) {
         // ...
    }

    IntersectionResult compute(CharSequence cs1, CharSequence cs2) {
         // Use IntersectionSimilarity to compute the IntersectionResult:
         // Pass to the IntersectionSimilarity the function to split the CharSequence into 
         // the appropriate collection (Set or List) using n-grams
         //
         // All the handling of the generic type (Character, Integer, String) used for the n-gram
         // should be done in here and hidden from the API.
    }
}

This can be made more generic if necessary. But this then leads to the library that @kinow identified (Simetrics) which can support tokenisation of input, case handling, splitting to n-grams, handling duplicates, extended unicode character handling, etc.

This level of generalisation is out of scope for the current similarity package. But n-grams and duplicates should be in scope because this is the difference between this algorithm for S2D and the current library J1U algorithm.

@ameyjadiye Would you like to add support for n-grams and unique/duplicate matching to this or attempt to add a package scope class that can be shared with the JaccardSimilarity?

Mar 24 '19 15:03 aherbert

@aherbert sounds a good idea, let me create another PR for that, will comeback on this.

Mar 25 '19 10:03 ameyjadiye

@aherbert apologies I was busy with much other stuff, will have to go through this. this PR is real mess.

Sep 28 '19 18:09 ameyjadiye

@ameyjadiye whenever you have time to update this PR, I will have another look and try to review/merge it :+1: thanks!

Jun 02 '20 04:06 kinow

Hey, thanks for reminder @kinow , I will finish this asap. this PR became real mess though. squash will be real hard. I will try to incorporate @aherbert suggestions here and push the changes.

Jun 02 '20 07:06 ameyjadiye

Brilliant @ameyjadiye , but if you are busy having fun with commons-graph, we can leave this one for later. Whenever you have time :+1:

Jun 02 '20 08:06 kinow

Hi All. Where are we here?

Mar 31 '22 11:03 garydgregory

@garydgregory - Long pending issue, its development lost in discussions. I'll go through it once again and try to finish. thanks for reminding for one. :)

Mar 31 '22 11:03 ameyjadiye

commons-text commons-text copied to clipboard

TEXT-126: Adding Sorensen-Dice similarity algoritham

commons-text
commons-text copied to clipboard