simstring icon indicating copy to clipboard operation
simstring copied to clipboard

Fix regression in dice::min_size

Open AndreyBozhko opened this issue 3 years ago • 5 comments

Justification

The SimString paper (link) defines the conditions for each similarity measure as follows:

image

The denominator in the min_size formula for Dice measure reads (2-alpha), not (2-|X|).

AndreyBozhko avatar May 16 '21 21:05 AndreyBozhko

This regression was introduced in 69a6a6fe0c25adc76e1c064e21276124a6923957.

AndreyBozhko avatar May 16 '21 21:05 AndreyBozhko

@chokkan @cynthia Please take a look.

AndreyBozhko avatar May 19 '21 23:05 AndreyBozhko

@AndreyBozhko do you have a test case that fails due to this? While it's faithful to the paper, the regression being around for almost 11 years could mean that certain code can depend on the bug.

cynthia avatar May 20 '21 00:05 cynthia

@cynthia Thank you for the feedback. From what I understand, this regression only affects the performance of the algorithm, but does not affect its correctness. Here's my reasoning.


Suppose, one wants to retrieve matches using the simstring algorithm such that formula (1), where formula is the Dice coefficient, X and Y represent the n-gram features of the query string and the candidate string, respectively, formula denotes the cardinality of a set, and formula is the chosen threshold (assuming formula).


Since it is always true that formula for any sets X and Y, then together with (1) this yields formula, or formula (2).

Relation (2) is a necessary condition to satisfy the original query threshold, meaning it is guaranteed upfront that if a string Y does not satisfy (2), then similarity between X and Y must be below the chosen threshold.


So how does the regression in question affect the algorithm?

One can show that the min_size value produced by the current implementation never exceeds the min_size value defined by (2), namely: formula,

and

formula.

In the expressions above, formula denotes the ceil function.

This means that during the overlapjoin step, the algorithm should in theory only check the strings whose number of features is between min_size and max_size (these are the min and max values from the paper). But because of the regression, the algorithm additionally checks the strings whose number of features is between 1 and min_size (and which are guaranteed to not match X by virtue of condition (2)).

Since the condition for the number formula of required matching features between the strings X and Y remains unchanged, the algorithm is still guaranteed to only return valid matches.


To summarize, the error in dice::min_size makes the algorithm increase its search space and additionally check strings that are guaranteed to not be a match. While doing so is not wrong, and in the end the algorithm still produces the correct results, - it's better when the algorithm doesn't spend time doing needless checks.

With that, I believe the fix does not introduce any user-facing changes - the code would still work as expected, just a little more efficient.

AndreyBozhko avatar May 20 '21 06:05 AndreyBozhko

And here's the simple test that illustrates the potential problem - https://github.com/chokkan/simstring/compare/master...AndreyBozhko:failing-test.

The test checks that the returned value of dice::min_size (although incorrect) at least makes sense - namely, that it will always be not greater than 1, in which case the simstring algorithm would still behave correctly overall.

Below are the outcome of running the test:

  • using g++ compiler
$ g++ --version
g++ (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0
Copyright (C) 2019 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

// with no optimizations
$ g++ sample/tests.cpp -o executable -std=c++17 -I./include -O0 && ./executable

// with sanitizer and no optimizations
$ g++ sample/tests.cpp -o executable -std=c++17 -I./include -O0 -fsanitize=float-divide-by-zero,float-cast-overflow && ./executable
include/simstring/measure.h:69:45: runtime error: division by zero
include/simstring/measure.h:69:30: runtime error: inf is outside the range of representable values of type 'int'

// with optimizations
$ g++ sample/tests.cpp -o executable -std=c++17 -I./include -O3 && ./executable

$
  • using clang compiler
$ clang++ --version
clang version 10.0.0-4ubuntu1 
Target: x86_64-pc-linux-gnu
Thread model: posix
InstalledDir: /usr/bin

// with no optimizations
$ clang++ sample/tests.cpp -o executable -std=c++17 -I./include -O0 && ./executable

// with sanitizer and no optimizations
$ clang++ sample/tests.cpp -o executable -std=c++17 -I./include -O0 -fsanitize=float-divide-by-zero,float-cast-overflow && ./executable
include/simstring/measure.h:69:45: runtime error: division by zero
SUMMARY: UndefinedBehaviorSanitizer: undefined-behavior include/simstring/measure.h:69:45 in 
include/simstring/measure.h:69:21: runtime error: inf is outside the range of representable values of type 'int'
SUMMARY: UndefinedBehaviorSanitizer: undefined-behavior include/simstring/measure.h:69:21 in 

// with optimizations
$ clang++ sample/tests.cpp -o executable -std=c++17 -I./include -O3 && ./executable
Assertion failed in sample/tests.cpp, line 14: 8250 not equal to 1

$

Note that the regression in dice::min_size leads to undefined behavior when the parameter qsize=2. While in practice, the compiler may handle such behavior in a way that makes sense for the simstring algorithm (e.g., produce infinity after dividing positive float by zero, and the return INT_MIN as a result of casting infinity as int), the behavior is still undefined, and no particular outcome is guaranteed.

As a result, this may lead to some sneaky production bugs like in the test above that used clang, where the test passes with -O0 optimization, but fails with -O3 optimization (since dice::min_size produced a value that is clearly wrong).

AndreyBozhko avatar May 23 '21 05:05 AndreyBozhko