js-levenshtein icon indicating copy to clipboard operation
js-levenshtein copied to clipboard

Early stop when maximum distance reached [feature request]

Open pimente opened this issue 6 years ago • 8 comments

Hello,

Thank you for this implementation. It is indeed very fast!

I am trying to optimize the algorithm for the specific (but rather common) case where we look for only close items. Stopping early when we know that the distance will certainly be over a given maximum distance can make the algorithm much faster.

I get already a good speed improvement (x2) by stopping early at the very beginning here and/or here with something like:

if (typeof max === 'number' && lb - la >= max) return max;

But when I try to stop early in the for loops below, by keeping the minimum value of vector and stopping if min + lb - la > max, it works but gets slower ;(

Do you have any idea of how to implement this feature in an optimized way? The best would be of course to support this option directly in your code ;) but any advice is welcome!

Thank you very much for your help.

Cheers

pimente avatar Jul 25 '17 09:07 pimente

Hi,

Sounds like a good feature, I have been thinking about this too :)

It will most likely have a negative impact on the general performance, but if we can find a solution that is not too much slower I think we can include this in the default function. Otherwise we can make a separate function depending on if a max argument is provided.

I will look into this as soon as I can find some time.

Do you have some sample input I can use for testing/benchmarking?

gustf avatar Jul 26 '17 17:07 gustf

Great that you like it!

Just to be clear, the formal definition of the option I had in mind is: levenshtein(a, b, max) = min(max, levenshtein(a, b))

Here are some unit tests in your current format (for testing but not benchmarking):

test(t =>
     {
       t.is(levenshtein('a', 'b'), 1);
       t.is(levenshtein('ab', 'ac'), 1);
       t.is(levenshtein('ac', 'bc'), 1);
       t.is(levenshtein('abc', 'axc'), 1);
       t.is(levenshtein('kitten', 'sitting'), 3);
       t.is(levenshtein('xabxcdxxefxgx', '1ab2cd34ef5g6'), 6);
       t.is(levenshtein('cat', 'cow'), 2);
       t.is(levenshtein('xabxcdxxefxgx', 'abcdefg'), 6);
       t.is(levenshtein('javawasneat', 'scalaisgreat'), 7);
       t.is(levenshtein('example', 'samples'), 3);
       t.is(levenshtein('sturgeon', 'urgently'), 6);
       t.is(levenshtein('levenshtein', 'frankenstein'), 6);
       t.is(levenshtein('distance', 'difference'), 5);
       t.is(levenshtein('因為我是中國人所以我會說中文', '因為我是英國人所以我會說英文'), 2);
       // Tests with max option
        // max === 1
       t.is(levenshtein('因為我是中國人所以我會說中文', '因為我是英國人所以我會說英文', 1), 1);
       t.is(levenshtein('distance', 'difference', 1), 1);
       t.is(levenshtein('distance', 'distance', 1), 0);
       t.is(levenshtein('zéphïr', 'zéphïr', 1), 0);
       t.is(levenshtein('zéphir', 'zéphïr', 1), 1);
        // max === 2
       t.is(levenshtein('kitten', 'sitting', 2), 2);
       t.is(levenshtein('xabxcdxxefxgx', '1ab2cd34ef5g6', 2), 2);
       t.is(levenshtein('cat', 'cow', 2), 2);
       t.is(levenshtein('xabxcdxxefxgx', 'abcdefg', 2), 2);
       t.is(levenshtein('javawasneat', 'scalaisgreat', 2), 2);
       t.is(levenshtein('example', 'samples', 2), 2);
       t.is(levenshtein('sturgeon', 'urgently', 2), 2);
       t.is(levenshtein('ubiquitous', 'universal', 2), 2);
       t.is(levenshtein('abcdefghij', 'axbcdefghij', 2), 1);
       t.is(levenshtein('abcdefghij', 'abcdefghij', 2), 0);
        // max === 3
       t.is(levenshtein('abcdefghij', 'abcdefghij', 3), 0);
       t.is(levenshtein('abcdefghij', 'axbcdefghij', 3), 1);
       t.is(levenshtein('abcdefghij', 'acefghij', 3), 2);
       t.is(levenshtein('abcdefghij', 'acefhj', 3), 3);
       t.is(levenshtein('abcdefghij', 'accdffghii', 3), 3);
       t.is(levenshtein('abcdefghij', 'accdffhhii', 3), 3);
});

Best,

pimente avatar Jul 27 '17 12:07 pimente

Ok, thanks for the example input. I see your use case is for smaller strings.

I'm not sure that the formal definition would be the best for a library function as it will make it impossible for someone who would like to know if a string is more than max.

I think it would be more useful for more cases to return Infinity or something like that instead. You can always wrap the call in a function which returns max if the call returns Infinity.

gustf avatar Jul 28 '17 18:07 gustf

Yes, I agree with your suggestion. It will cover more use cases!

pimente avatar Jul 29 '17 17:07 pimente

Would love (a version with) this. Still, looks like I have a new default distance calc for my fuzzball package either way! :D

(changing the 1's on lines 12 and 99 to 2's will correctly set the substitution cost to 2 ya?)

nol13 avatar Aug 03 '17 02:08 nol13

@nol13 Sorry for late reply, vacation time here in sweden.

Nice to hear you use this lib in your fuzzball package! And you are right about the changes, that will do the trick 👍

gustf avatar Aug 14 '17 20:08 gustf

I would love such feature too. I'm doing word-by-word comparison, with a tolerance of 1 distance, therefore it is pretty ridiculous to continue until distance e.g. 8. please please please please? :pray:

OoDeLally avatar Nov 24 '17 15:11 OoDeLally

Hey all, I needed this feature so I went ahead and tried an implementation. It might not be 100% optimized, but if you're not using the max argument, then it shouldn't impact the existing performance of the algorithm. If it's not good enough feel free to close though. :P

fonograph avatar Jan 22 '19 15:01 fonograph