crystal icon indicating copy to clipboard operation
crystal copied to clipboard

Implement Myers algorithm for Levenshtein distance calculation

Open darkstego opened this issue 3 years ago • 34 comments

This implements a faster algorithm for calculating Levenshtein distance. Fixes #11335.

The algorithm is 3x faster than the original one for short strings and 10x for long ones.

darkstego avatar Oct 26 '21 23:10 darkstego

Could you please run crystal tool format?

BTW, this looks incredible!

caspiano avatar Oct 26 '21 23:10 caspiano

I got these results against the benchmark from #8324 Note: the benchmark does not test the algorithm against longer strings.

gist here.

old 404.67  (  2.47ms) (± 3.54%)   195kB/op   1.72× slower
new 696.13  (  1.44ms) (± 6.50%)  1.72MB/op        fastest

caspiano avatar Oct 27 '21 00:10 caspiano

The implementation could be faster still with some changes. A separate ASCII method makes this submitted implementation 1.5x slower. But that isn't a very DRY approach and contemplated back and forth about implementing it. The only difference being the character bit-vector store is not conditionally defined and no case statement to clear it.

The real value though is for long strings because this algorithm is O([m/w]n) whereas the existing one is O(mn). I am sure there are further optimizations that can be done. w is word size and set to 32, but should be 64 on 64 bit systems, that should make it even faster on long strings.

darkstego avatar Oct 27 '21 00:10 darkstego

I've tried benchmarking a new implementation using this code https://github.com/crystal-lang/crystal/issues/11335#issuecomment-947410609

  • MacOS Catalina 10.15.7
  • 2.3 GHz Quad-Core Intel Core i7
  • NodeJS 16.13.0
  • Crystal 1.2.1 (LLVM 11.1.0)

Now it's just a tiny bit slower than presumably the fastest Levenshtein distance algorithm for JavaScript from fastest-levenshtein npm package:

Crystal 1.2.1

% hyperfine ./lev_myers
Benchmark #1: ./lev_builtin_myers
  Time (mean ± σ):      7.616 s ±  0.146 s    [User: 7.585 s, System: 0.031 s]
  Range (min … max):    7.355 s …  7.805 s    10 runs

NodeJS 16.13.0

% hyperfine 'node lev_node.js'
Benchmark #1: node lev_node.js
  Time (mean ± σ):      7.194 s ±  0.066 s    [User: 7.053 s, System: 0.189 s]
  Range (min … max):    7.093 s …  7.331 s    10 runs

vlazar avatar Oct 27 '21 09:10 vlazar

So I decided to really optimize for performance. This latest commit should get 1.8x speedup over my old commit in long ascii texts.

One potential avenue for further speed up is finding out is making an Int64 version of this that would be compiled on 64bit systems.

darkstego avatar Oct 27 '21 13:10 darkstego

Sorry about the constant formatting fixes. I don't know why my pre-commit hooks aren't working.

darkstego avatar Oct 27 '21 13:10 darkstego

So I finally have what I feel is a complete implementation. I implemented a 64 bit version of the algorithm when compiled on 64 bit systems. I implemented the ASCII and Unicode versions through Macros mainly because I kept forgetting to implement changes on both codebases. They diverge in a few spots that are commented mainly to do with the dictionary implementation and are commented, so it isn't bad at all.

The performance is really good now. Currently clocking in at 30x over the original at string length of 1500 on 64 bit systems

Default 210.39  (  4.75ms) (± 0.49%)  5.83kB/op  30.73× slower
Orig PR   1.86k (538.91µs) (± 3.07%)  13.9kB/op   3.48× slower
Latest   6.46k (154.69µs) (± 1.57%)  23.3kB/op        fastest

Now at Over 3x speed from the original PR.

I did run a test to compare against the Node JS implementation and it was over twice as fast on my System. I was using the method described in this #11335 comment.

  • Linux 5.14.7-2-MANJARO
  • AMD Ryzen Threadripper 2950X 16-Core Processor
  • Node 16.11.0

Node JS

real    0m4.905s
user    0m4.868s
sys     0m0.020s

Crystal

real    0m1.993s
user    0m2.011s
sys     0m0.071s

darkstego avatar Oct 27 '21 23:10 darkstego

🚀 Wow, with the latest changes this implementation is now 2.4x faster than the NodeJS on this benchmark for me https://github.com/crystal-lang/crystal/issues/11335#issuecomment-947410609

Great job @darkstego !

vlazar avatar Oct 28 '21 06:10 vlazar

One thing to note is memory consumption has grown in new implementation.

I've run the benchmark already mentioned here https://github.com/crystal-lang/crystal/pull/11370#issuecomment-952428372

My numbers:

old 305.94  (  3.27ms) (± 3.08%)   195kB/op   1.59× slower
new 486.96  (  2.05ms) (± 1.34%)  3.24MB/op        fastest

It uses a case with 100000 characters long strings though. Not sure how real world this use case is.

If I remove these 2 cases

    Levenshtein.new_distance("a" * 100000, "hello")
    Levenshtein.new_distance("hello", "a" * 100000)

The memory consumption is still much higher, but it's not in MBs anymore.

But notice the results, for some reason it's now slower than original? 😕

old   1.23M (812.94ns) (± 0.82%)    208B/op        fastest
new 511.08k (  1.96µs) (± 0.90%)  2.37kB/op   2.41× slower

Code:

# From https://github.com/crystal-lang/crystal/pull/11370
Benchmark.ips do |bm|
  bm.report "old" do
    Levenshtein.distance("algorithm", "altruistic")
    Levenshtein.distance("hello", "hallo")
    Levenshtein.distance("こんにちは", "こんちは")
    Levenshtein.distance("hey", "hey")
    Levenshtein.distance("hippo", "zzzzzzzz")
    # Levenshtein.distance("a" * 100000, "hello")
    # Levenshtein.distance("hello", "a" * 100000)
  end

  bm.report "new" do
    LevenshteinNew.distance("algorithm", "altruistic")
    LevenshteinNew.distance("hello", "hallo")
    LevenshteinNew.distance("こんにちは", "こんちは")
    LevenshteinNew.distance("hey", "hey")
    LevenshteinNew.distance("hippo", "zzzzzzzz")
    # LevenshteinNew.distance("a" * 100000, "hello")
    # LevenshteinNew.distance("hello", "a" * 100000)
  end
end

vlazar avatar Oct 28 '21 06:10 vlazar

I've run cases 1 by 1 and added results near each one that's slower. The most slowdown seems to be for short strings with small distance, especially for Unicode string.

Benchmark.ips do |bm|
  bm.report "old" do
    Levenshtein.distance("algorithm", "altruistic") # 1.15× slower
    Levenshtein.distance("hello", "hallo")
    Levenshtein.distance("こんにちは", "こんちは")
    Levenshtein.distance("hey", "hey")
    Levenshtein.distance("hippo", "zzzzzzzz")
    Levenshtein.distance("a" * 100000, "hello") # 1.75× slower
    Levenshtein.distance("hello", "a" * 100000) # 1.80× slower
  end

  bm.report "new" do
    LevenshteinNew.distance("algorithm", "altruistic")
    LevenshteinNew.distance("hello", "hallo") # 2.06× slower
    LevenshteinNew.distance("こんにちは", "こんちは") # 4.65× slower
    LevenshteinNew.distance("hey", "hey")
    LevenshteinNew.distance("hippo", "zzzzzzzz") # 1.47× slower
    LevenshteinNew.distance("a" * 100000, "hello")
    LevenshteinNew.distance("hello", "a" * 100000)
  end
end

I wonder what cases would be used most frequently in the real world 🙂 It would be nice to optimize for most those.

vlazar avatar Oct 28 '21 07:10 vlazar

I'm pretty sure that short strings are a very common use case for Levenshtein distance (thing search query strings for example). So we should make sure that performance does at least not deteriorate for short strings. There should be individual benchmarks for different string lengths.

straight-shoota avatar Oct 28 '21 07:10 straight-shoota

I'm pretty sure that short strings are a very common use case for Levenshtein distance (thing search query strings for example). So we should make sure that performance does at least not deteriorate for short strings. There should be individual benchmarks for different string lengths.

I know why this is happening. For short Unicode strings the dictionary used is a hash. That is created and populated before traversing the columns of the matrix, but that step takes times and if the string is short the improved big O times of the algorithm doesn't get a chance to offset that. In my early testing I realized that the cutoff point for unicode was about 30 chars or so.

The fix is pretty easy, I would just call the old algorithm if the strings were below the cutoff length.

The StaticArray (ascii) method was always faster for me, so there is a regression somewhere, and I have a good idea where that might be.

If anyone has any suggestion on an efficient way to store and retrieve bit information in Crystal that would help a lot. The algorithm needs 2 things:

  • A dictionary that maps Char to an int of size (word width). Dictionary only needs to store (word width) number of entries).
  • An array of bits of size equal to the longer string.

I just realized there was a bitarray in Crystal and might look at that, but if I am not mistaken it stores every bit as a U32.

One thing to note is memory consumption has grown in new implementation.

I know the reason for this. The algorithm stores a bit array of length = longest string, and I encoded each bit as a U64 on 64 bit systems. There is potential to reduce the memory consumption by quite a bit.

darkstego avatar Oct 28 '21 11:10 darkstego

Out of curiosity...

Looking at the JS library, it seems that the Hash they use has a fixed size of 0x10000:

https://github.com/ka-weihe/fastest-levenshtein/blob/6c5236b00d04a30428e4fac23b2840222d931f05/mod.ts#L1

Does that mean that the library will fail for unicode codepoints that exceed 0x10000?

asterite avatar Oct 28 '21 11:10 asterite

I just realized there was a bitarray in Crystal and might look at that, but if I am not mistaken it stores every bit as a U32.

UInt32 is used for internal representation, but the size can be bigger than 32. It just allocates as many UInt32 as necessary to represent the size.

straight-shoota avatar Oct 28 '21 11:10 straight-shoota

Yes, it probably will fail for strings containing characters outside the BMP. JavaScript's String.prototype.charCodeAt says:

If the Unicode code point cannot be represented in a single UTF-16 code unit (because its value is greater than 0xFFFF) then the code unit returned will be the first part of a surrogate pair for the code point.

So any two characters that differ in both their high and low surrogate parts will be doubly counted there, e.g. U+10000 𐀀 v.s. U+1F602 😂. (The index there is in terms of UTF-16 code units, not characters.) We don't have an equivalent for that method so we're fine.

HertzDevil avatar Oct 28 '21 11:10 HertzDevil

Does that mean that the library will fail for unicode codepoints that exceed 0x10000?

Correct. That library will fail if you use emojis.

darkstego avatar Oct 28 '21 12:10 darkstego

So I hope this is really close to the final version. I ran the following benchmark test.

These are the results.

Original: ASCII-Size:3  13.04M ( 76.68ns) (± 4.41%)  32.0B/op   3.34× slower
     New: ASCII-Size:3  43.51M ( 22.98ns) (± 1.35%)   0.0B/op        fastest
Original: ASCII-Size:6   8.34M (119.85ns) (± 3.06%)  32.0B/op   3.28× slower
     New: ASCII-Size:6  27.40M ( 36.49ns) (± 1.48%)   0.0B/op        fastest
Original: ASCII-Size:12   3.38M (296.29ns) (± 3.73%)  64.0B/op   4.81× slower
     New: ASCII-Size:12  16.23M ( 61.62ns) (± 1.18%)   0.0B/op        fastest
Original: ASCII-Size:24 896.63k (  1.12µs) (± 1.16%)  112B/op   9.46× slower
     New: ASCII-Size:24   8.48M (117.88ns) (± 2.97%)  0.0B/op        fastest
Original: ASCII-Size:50 218.48k (  4.58µs) (± 1.25%)   208B/op  10.97× slower
     New: ASCII-Size:50   2.40M (417.16ns) (± 4.15%)  33.0B/op        fastest
Original: ASCII-Size:100  55.26k ( 18.09µs) (± 3.09%)   448B/op  14.41× slower
     New: ASCII-Size:100 796.32k (  1.26µs) (± 4.16%)  64.0B/op        fastest
Original: ASCII-Size:500   2.20k (455.57µs) (± 2.84%)  2.0kB/op  23.42× slower
     New: ASCII-Size:500  51.42k ( 19.45µs) (± 3.39%)   161B/op        fastest
Original: ASCII-Size:1000 557.76  (  1.79ms) (± 1.55%)  3.93kB/op  23.42× slower
     New: ASCII-Size:1000  13.06k ( 76.55µs) (± 3.02%)    289B/op        fastest
Original: Unicode-Size:3   4.83M (206.97ns) (± 4.33%)  80.0B/op        fastest
     New: Unicode-Size:3   4.77M (209.59ns) (± 4.67%)  80.0B/op   1.01× slower
Original: Unicode-Size:6   3.25M (307.88ns) (± 2.89%)  96.0B/op        fastest
     New: Unicode-Size:6   3.14M (318.42ns) (± 3.10%)  96.0B/op   1.03× slower
Original: Unicode-Size:12   1.61M (622.76ns) (± 5.19%)  161B/op   1.02× slower
     New: Unicode-Size:12   1.63M (613.43ns) (± 2.51%)  161B/op        fastest
Original: Unicode-Size:24 549.65k (  1.82µs) (± 3.14%)  257B/op        fastest
     New: Unicode-Size:24 547.07k (  1.83µs) (± 5.18%)  257B/op   1.00× slower
Original: Unicode-Size:50 161.32k (  6.20µs) (± 3.51%)  449B/op   1.02× slower
     New: Unicode-Size:50 164.46k (  6.08µs) (± 3.95%)  449B/op        fastest
Original: Unicode-Size:100  50.01k ( 20.00µs) (± 2.05%)    928B/op   2.14× slower
     New: Unicode-Size:100 107.00k (  9.35µs) (± 3.26%)  2.06kB/op        fastest
Original: Unicode-Size:500   2.40k (417.22µs) (± 1.59%)  4.03kB/op   4.09× slower
     New: Unicode-Size:500   9.79k (102.10µs) (± 1.85%)  3.72kB/op        fastest
Original: Unicode-Size:1000 623.88  (  1.60ms) (± 1.09%)  7.87kB/op   4.33× slower
     New: Unicode-Size:1000   2.70k (370.55µs) (± 2.02%)  5.76kB/op        fastest

This new implementation is 3x faster for short ASCII strings and over 20x for long strings. For Unicode the speed is the same for short strings (same algorithm) but once you get over 64 chars long, the Myers algorithm is used and gets much faster as the strings get longer.

Also, the memory usage is overall better than the old implementation.

The unexpected regression I had was that when moving to 64 bit implementation for 64 bit systems the performance took a nosedive for short strings, even with the same code. The 64 bit was faster than the 32 bit one on long strings. So I added a simplified method for short strings that works really well.

darkstego avatar Oct 28 '21 17:10 darkstego

Now on this benchmark it's 2.5x faster for me https://github.com/crystal-lang/crystal/pull/11370#issuecomment-952428372

For the one comparing to NodeJS it's 1.9x faster (a bit lower than 2.4x at this point https://github.com/crystal-lang/crystal/pull/11370#issuecomment-953536196)

And memory consumption is now much better!

Thank you @darkstego !

vlazar avatar Oct 28 '21 18:10 vlazar

For the one comparing to NodeJS it's 1.9x faster (a bit lower than 2.4x at this point https://github.com/crystal-lang/crystal/pull/11370#issuecomment-953536196)

And memory consumption is now much better!

It is indeed a bit slower than the previous commits in the extremely long strings while being faster in strings that under 200 or so in length. I don't remember the exact number. This seems to be due to using BitArrays. The reading and writing to those arrays is slower since I am receiving and sending back a boolean. The setup is much faster though, but given a long enough string the array method is slightly faster at the cost of much higher memory usage.

darkstego avatar Oct 28 '21 19:10 darkstego

@darkstego No worries at all! I was just sharing my numbers for the record (for prospective). Current code is much more balanced and optimized for more realistic use cases.

vlazar avatar Oct 28 '21 19:10 vlazar

Another improvement just added. I realized that if a cutoff is given to the algorithm it can abort as soon as it knows the distance will be larger than the cutoff. This makes for a huge improvement when searching for the best match among long strings. In the NodeJS example you can now find the best match in 5x speedup over the NodeJS example by using a cutoff. The distance does end up with an optional cutoff parameter.

Sorry the commits weren't squashed. I thought I did that before uploading.

darkstego avatar Oct 29 '21 04:10 darkstego

@darkstego I've tried running benchmark you mentioned here https://github.com/crystal-lang/crystal/pull/11370#issuecomment-954044475 and ran into this error:

% ./levenshtein_benchmark
Original: ASCII-Size:3  14.23M ( 70.29ns) (± 1.19%)  32.0B/op   1.59× slower
     New: ASCII-Size:3  22.69M ( 44.08ns) (± 4.48%)   0.0B/op        fastest
Original: ASCII-Size:6   5.73M (174.59ns) (± 4.42%)  32.0B/op   2.68× slower
     New: ASCII-Size:6  15.33M ( 65.21ns) (± 4.50%)   0.0B/op        fastest
Original: ASCII-Size:12   1.83M (545.93ns) (± 3.55%)  64.0B/op   4.99× slower
     New: ASCII-Size:12   9.13M (109.50ns) (± 3.84%)   0.0B/op        fastest
Original: ASCII-Size:24 527.97k (  1.89µs) (± 3.47%)  112B/op   9.59× slower
     New: ASCII-Size:24   5.06M (197.50ns) (± 3.36%)  0.0B/op        fastest
Original: ASCII-Size:50 128.05k (  7.81µs) (± 3.68%)   208B/op  13.12× slower
     New: ASCII-Size:50   1.68M (595.30ns) (± 3.34%)  32.0B/op        fastest
Original: ASCII-Size:100  32.28k ( 30.97µs) (± 3.48%)   448B/op  16.99× slower
     New: ASCII-Size:100 548.43k (  1.82µs) (± 3.61%)  64.0B/op        fastest
Original: ASCII-Size:500   1.29k (773.83µs) (± 2.95%)  2.0kB/op  25.50× slower
     New: ASCII-Size:500  32.95k ( 30.35µs) (± 3.55%)   160B/op        fastest
Original: ASCII-Size:1000 322.31  (  3.10ms) (± 2.83%)  3.93kB/op  26.05× slower
     New: ASCII-Size:1000   8.40k (119.12µs) (± 3.26%)    288B/op        fastest
Original: Unicode-Size:3   5.64M (177.19ns) (± 1.32%)  80.0B/op        fastest
     New: Unicode-Size:3   5.35M (186.76ns) (± 0.94%)  80.0B/op   1.05× slower
Original: Unicode-Size:6   3.27M (305.84ns) (± 1.02%)  96.0B/op        fastest
     New: Unicode-Size:6   3.17M (315.20ns) (± 0.89%)  96.0B/op   1.03× slower
Unhandled exception: 0xd9dc out of char range (ArgumentError)
  from /usr/local/Cellar/crystal/1.2.1/src/pointer.cr:437:13 in '__crystal_main'
  from /usr/local/Cellar/crystal/1.2.1/src/crystal/main.cr:110:5 in 'main'

I've updated benchmark to the latest code here and it still gives me an error:

% ./levenshtein_benchmark
Original: ASCII-Size:3  14.13M ( 70.78ns) (± 1.45%)  32.0B/op   1.59× slower
     New: ASCII-Size:3  22.49M ( 44.47ns) (± 4.27%)   0.0B/op        fastest
Original: ASCII-Size:6   5.57M (179.47ns) (± 3.63%)  32.0B/op   2.78× slower
     New: ASCII-Size:6  15.52M ( 64.45ns) (± 4.01%)   0.0B/op        fastest
Original: ASCII-Size:12   1.90M (525.09ns) (± 4.73%)  64.0B/op   4.72× slower
     New: ASCII-Size:12   8.99M (111.22ns) (± 4.00%)   0.0B/op        fastest
Original: ASCII-Size:24 527.29k (  1.90µs) (± 3.69%)  112B/op   9.17× slower
     New: ASCII-Size:24   4.83M (206.84ns) (± 3.96%)  0.0B/op        fastest
Original: ASCII-Size:50 128.70k (  7.77µs) (± 3.14%)   208B/op  13.52× slower
     New: ASCII-Size:50   1.74M (574.90ns) (± 4.11%)  32.0B/op        fastest
Original: ASCII-Size:100  31.98k ( 31.27µs) (± 3.06%)   448B/op  17.40× slower
     New: ASCII-Size:100 556.38k (  1.80µs) (± 4.01%)  64.0B/op        fastest
Original: ASCII-Size:500   1.26k (795.30µs) (± 5.53%)  2.0kB/op  23.94× slower
     New: ASCII-Size:500  30.10k ( 33.22µs) (± 4.20%)   160B/op        fastest
Original: ASCII-Size:1000 320.76  (  3.12ms) (± 3.79%)  3.93kB/op  21.52× slower
     New: ASCII-Size:1000   6.90k (144.87µs) (± 4.14%)    288B/op        fastest
Original: Unicode-Size:3   6.05M (165.34ns) (± 1.33%)  80.0B/op        fastest
     New: Unicode-Size:3   5.78M (173.00ns) (± 1.21%)  80.0B/op   1.05× slower
Original: Unicode-Size:6   3.33M (300.39ns) (± 1.97%)  96.0B/op        fastest
     New: Unicode-Size:6   3.25M (307.39ns) (± 1.24%)  96.0B/op   1.02× slower
Original: Unicode-Size:12   1.21M (829.06ns) (± 4.56%)  160B/op        fastest
     New: Unicode-Size:12   1.17M (857.51ns) (± 4.81%)  160B/op   1.03× slower
Original: Unicode-Size:24 436.66k (  2.29µs) (± 4.00%)  256B/op   1.01× slower
     New: Unicode-Size:24 440.09k (  2.27µs) (± 3.40%)  256B/op        fastest
Original: Unicode-Size:50 120.21k (  8.32µs) (± 3.58%)  448B/op   1.00× slower
     New: Unicode-Size:50 120.39k (  8.31µs) (± 3.31%)  448B/op        fastest
Original: Unicode-Size:100  33.75k ( 29.63µs) (± 3.37%)    928B/op   3.55× slower
     New: Unicode-Size:100 119.78k (  8.35µs) (± 0.80%)  2.06kB/op        fastest
Unhandled exception: 0xdd53 out of char range (ArgumentError)
  from /usr/local/Cellar/crystal/1.2.1/src/pointer.cr:437:13 in '__crystal_main'
  from /usr/local/Cellar/crystal/1.2.1/src/crystal/main.cr:110:5 in 'main'
% crystal -v
Crystal 1.2.1 (2021-10-21)

LLVM: 11.1.0
Default target: x86_64-apple-macosx

% sysctl -n machdep.cpu.brand_string
Intel(R) Core(TM) i7-3615QM CPU @ 2.30GHz

vlazar avatar Oct 29 '21 06:10 vlazar

@vlazar Seems the testing code was generating illegal chars. I unfortunately didn't realize there was a gap in the unicode space that needed to be accounted for. Try revision 3 of the gist.

darkstego avatar Oct 29 '21 10:10 darkstego

@darkstego I've tried 3rd revision it and results looks suspicious now 😆

Original: ASCII-Size:3  84.00M ( 11.90ns) (± 5.95%)  0.0B/op        fastest
     New: ASCII-Size:3  83.62M ( 11.96ns) (± 6.11%)  0.0B/op   1.00× slower
Original: ASCII-Size:6  56.30M ( 17.76ns) (± 5.75%)  0.0B/op        fastest
     New: ASCII-Size:6  54.48M ( 18.36ns) (± 6.83%)  0.0B/op   1.03× slower
Original: ASCII-Size:12  81.12M ( 12.33ns) (± 6.54%)  0.0B/op   1.00× slower
     New: ASCII-Size:12  81.46M ( 12.28ns) (± 6.00%)  0.0B/op        fastest
Original: ASCII-Size:24  66.20M ( 15.11ns) (± 5.56%)  0.0B/op        fastest
     New: ASCII-Size:24  65.12M ( 15.36ns) (± 6.52%)  0.0B/op   1.02× slower
Original: ASCII-Size:50  48.46M ( 20.64ns) (± 6.71%)  0.0B/op        fastest
     New: ASCII-Size:50  47.85M ( 20.90ns) (± 5.60%)  0.0B/op   1.01× slower
Original: ASCII-Size:100  34.63M ( 28.88ns) (± 4.48%)  0.0B/op        fastest
     New: ASCII-Size:100  34.11M ( 29.32ns) (± 4.90%)  0.0B/op   1.02× slower
Original: ASCII-Size:500  10.51M ( 95.18ns) (± 3.72%)  0.0B/op        fastest
     New: ASCII-Size:500  10.46M ( 95.60ns) (± 3.80%)  0.0B/op   1.00× slower
Original: ASCII-Size:1000   6.31M (158.47ns) (± 4.93%)  0.0B/op        fastest
     New: ASCII-Size:1000   6.31M (158.59ns) (± 5.67%)  0.0B/op   1.00× slower
Original: Unicode-Size:3  82.80M ( 12.08ns) (± 6.39%)  0.0B/op        fastest
     New: Unicode-Size:3  81.51M ( 12.27ns) (± 6.77%)  0.0B/op   1.02× slower
Original: Unicode-Size:6  55.49M ( 18.02ns) (± 6.11%)  0.0B/op        fastest
     New: Unicode-Size:6  54.78M ( 18.25ns) (± 6.07%)  0.0B/op   1.01× slower
...

vlazar avatar Oct 29 '21 19:10 vlazar

@darkstego I've tried 3rd revision it and results looks suspicious now laughing

@vlazar Sorry about that. It was benchmarking two identical strings all the time. Fixed and tested. I also have a test to compare the distance results between the old and new methods to make sure the data is correct.

Is there a way to import a module from a file by a different name? There is a lot of copying and pasting that needs to be done to test and benchmark and I would ideally like to test the module that is in my git repo without having to move it around?

darkstego avatar Oct 29 '21 20:10 darkstego

@darkstego Looks great!

New results from me. All cases are now faster, except negligible slowdown on short Unicode strings fro your benchmarks for different string lengths.

  1. Comparing to NodeJS on this benchmark from original report #11335

Crystal (now 1.6x faster than NodeJS from previously being 8x slower)

Time (mean ± σ):      4.404 s ±  0.084 s    [User: 4.364 s, System: 0.019 s]
Range (min … max):    4.227 s …  4.505 s    10 runs

Crystal with cutoff from https://gist.github.com/darkstego/c41ea58505186542e4f9fdc5079fb1f4 is 6.6x faster than NodeJS

Time (mean ± σ):      1.103 s ±  0.020 s    [User: 1.093 s, System: 0.008 s]
Range (min … max):    1.073 s …  1.139 s    10 runs

NodeJS

Time (mean ± σ):      7.269 s ±  0.062 s    [User: 7.133 s, System: 0.183 s]
Range (min … max):    7.163 s …  7.335 s    10 runs
  1. Benchmark from https://github.com/crystal-lang/crystal/pull/11370#issuecomment-952428372 (short and long strings, small and big distances)
old 296.80  (  3.37ms) (± 2.34%)  195kB/op   2.63× slower
new 781.55  (  1.28ms) (± 3.19%)  244kB/op        fastest
  1. Your benchmark from https://gist.github.com/darkstego/b7f512780454c9088cc8b26a8a6af888
Original: ASCII-Size:3  13.98M ( 71.55ns) (± 1.22%)  32.0B/op   1.56× slower
     New: ASCII-Size:3  21.76M ( 45.95ns) (± 4.73%)   0.0B/op        fastest
Original: ASCII-Size:6   5.46M (183.25ns) (± 3.98%)  32.0B/op   2.66× slower
     New: ASCII-Size:6  14.53M ( 68.80ns) (± 4.53%)   0.0B/op        fastest
Original: ASCII-Size:12   1.81M (552.07ns) (± 3.60%)  64.0B/op   4.80× slower
     New: ASCII-Size:12   8.69M (115.12ns) (± 4.06%)   0.0B/op        fastest
Original: ASCII-Size:32 309.26k (  3.23µs) (± 2.91%)   144B/op   7.97× slower
     New: ASCII-Size:32   2.47M (405.65ns) (± 3.26%)  32.0B/op        fastest
Original: ASCII-Size:64  83.17k ( 12.02µs) (± 3.13%)   272B/op  16.74× slower
     New: ASCII-Size:64   1.39M (718.04ns) (± 3.50%)  32.0B/op        fastest
Original: ASCII-Size:100  36.12k ( 27.69µs) (± 3.39%)   448B/op  15.08× slower
     New: ASCII-Size:100 544.75k (  1.84µs) (± 2.76%)  64.0B/op        fastest
Original: ASCII-Size:500   1.60k (623.85µs) (± 5.44%)  2.0kB/op  20.85× slower
     New: ASCII-Size:500  33.43k ( 29.91µs) (± 5.14%)   160B/op        fastest
Original: ASCII-Size:1000 413.47  (  2.42ms) (± 3.95%)  3.93kB/op  21.37× slower
     New: ASCII-Size:1000   8.84k (113.18µs) (± 3.17%)    288B/op        fastest
Original: Unicode-Size:3   6.04M (165.61ns) (± 1.35%)  80.0B/op        fastest
     New: Unicode-Size:3   5.67M (176.31ns) (± 0.95%)  80.0B/op   1.06× slower
Original: Unicode-Size:6   3.39M (294.95ns) (± 1.40%)  96.0B/op        fastest
     New: Unicode-Size:6   3.25M (307.54ns) (± 0.93%)  96.0B/op   1.04× slower
Original: Unicode-Size:12   1.20M (830.25ns) (± 4.47%)  160B/op        fastest
     New: Unicode-Size:12   1.19M (837.91ns) (± 5.32%)  160B/op   1.01× slower
Original: Unicode-Size:32 269.84k (  3.71µs) (± 3.04%)  320B/op        fastest
     New: Unicode-Size:32 269.68k (  3.71µs) (± 3.21%)  320B/op   1.00× slower
Original: Unicode-Size:64  77.03k ( 12.98µs) (± 2.94%)    576B/op   3.31× slower
     New: Unicode-Size:64 254.94k (  3.92µs) (± 0.70%)  1.86kB/op        fastest
Original: Unicode-Size:100  33.37k ( 29.97µs) (± 3.51%)    928B/op   3.96× slower
     New: Unicode-Size:100 132.29k (  7.56µs) (± 0.67%)  2.06kB/op        fastest
Original: Unicode-Size:500   1.52k (655.94µs) (± 3.33%)  4.03kB/op   7.06× slower
     New: Unicode-Size:500  10.76k ( 92.96µs) (± 3.17%)  3.72kB/op        fastest
Original: Unicode-Size:1000 384.41  (  2.60ms) (± 5.13%)  7.87kB/op   7.94× slower
     New: Unicode-Size:1000   3.05k (327.82µs) (± 3.60%)  5.76kB/op        fastest
Total Errors = 0

vlazar avatar Oct 30 '21 15:10 vlazar

Anything holding this PR? It's so optimized and balanced. Excellent job from @darkstego

paulocoghi avatar Mar 24 '22 00:03 paulocoghi

Sorry for the very long delay @darkstego and thanks for the awesome work! It would be great to add a few test cases for longer strings, to make sure we are covering all the different algorithms that are implemented. I can take care of that if you prefer. 🙇

beta-ziliani avatar Mar 29 '22 17:03 beta-ziliani

I was actually thinking about moving this into a shard a while back, since I didn't know if it will be mainlined. The algorithm is much more efficient at the cost of more complex code.

@beta-ziliani I will see if I can track down some of the test code I had. Since the algorithm had a window size (32 or 64 depending on the architecture), it was important to test strings that were multiple window size in length as well as edge cases when the string is exactly equal to the window size (64 in my case). I believe I had a program that generated strings of random lengths and compared the results from my code to the base implementation just to be sure.

darkstego avatar Mar 29 '22 23:03 darkstego

I was actually thinking about moving this into a shard a while back, since I didn't know if it will be mainlined. The algorithm is much more efficient at the cost of more complex code.

I vote to mainline the code

paulocoghi avatar Mar 30 '22 11:03 paulocoghi