crystal
crystal copied to clipboard
Implement Myers algorithm for Levenshtein distance calculation
This implements a faster algorithm for calculating Levenshtein distance. Fixes #11335.
The algorithm is 3x faster than the original one for short strings and 10x for long ones.
Could you please run crystal tool format
?
BTW, this looks incredible!
I got these results against the benchmark from #8324 Note: the benchmark does not test the algorithm against longer strings.
old 404.67 ( 2.47ms) (± 3.54%) 195kB/op 1.72× slower
new 696.13 ( 1.44ms) (± 6.50%) 1.72MB/op fastest
The implementation could be faster still with some changes. A separate ASCII method makes this submitted implementation 1.5x slower. But that isn't a very DRY approach and contemplated back and forth about implementing it. The only difference being the character bit-vector store is not conditionally defined and no case statement to clear it.
The real value though is for long strings because this algorithm is O([m/w]n) whereas the existing one is O(mn). I am sure there are further optimizations that can be done. w is word size and set to 32, but should be 64 on 64 bit systems, that should make it even faster on long strings.
I've tried benchmarking a new implementation using this code https://github.com/crystal-lang/crystal/issues/11335#issuecomment-947410609
- MacOS Catalina 10.15.7
- 2.3 GHz Quad-Core Intel Core i7
- NodeJS 16.13.0
- Crystal 1.2.1 (LLVM 11.1.0)
Now it's just a tiny bit slower than presumably the fastest Levenshtein distance algorithm for JavaScript from fastest-levenshtein
npm package:
Crystal 1.2.1
% hyperfine ./lev_myers
Benchmark #1: ./lev_builtin_myers
Time (mean ± σ): 7.616 s ± 0.146 s [User: 7.585 s, System: 0.031 s]
Range (min … max): 7.355 s … 7.805 s 10 runs
NodeJS 16.13.0
% hyperfine 'node lev_node.js'
Benchmark #1: node lev_node.js
Time (mean ± σ): 7.194 s ± 0.066 s [User: 7.053 s, System: 0.189 s]
Range (min … max): 7.093 s … 7.331 s 10 runs
So I decided to really optimize for performance. This latest commit should get 1.8x speedup over my old commit in long ascii texts.
One potential avenue for further speed up is finding out is making an Int64 version of this that would be compiled on 64bit systems.
Sorry about the constant formatting fixes. I don't know why my pre-commit hooks aren't working.
So I finally have what I feel is a complete implementation. I implemented a 64 bit version of the algorithm when compiled on 64 bit systems. I implemented the ASCII and Unicode versions through Macros mainly because I kept forgetting to implement changes on both codebases. They diverge in a few spots that are commented mainly to do with the dictionary implementation and are commented, so it isn't bad at all.
The performance is really good now. Currently clocking in at 30x over the original at string length of 1500 on 64 bit systems
Default 210.39 ( 4.75ms) (± 0.49%) 5.83kB/op 30.73× slower
Orig PR 1.86k (538.91µs) (± 3.07%) 13.9kB/op 3.48× slower
Latest 6.46k (154.69µs) (± 1.57%) 23.3kB/op fastest
Now at Over 3x speed from the original PR.
I did run a test to compare against the Node JS implementation and it was over twice as fast on my System. I was using the method described in this #11335 comment.
- Linux 5.14.7-2-MANJARO
- AMD Ryzen Threadripper 2950X 16-Core Processor
- Node 16.11.0
Node JS
real 0m4.905s
user 0m4.868s
sys 0m0.020s
Crystal
real 0m1.993s
user 0m2.011s
sys 0m0.071s
🚀 Wow, with the latest changes this implementation is now 2.4x faster than the NodeJS on this benchmark for me https://github.com/crystal-lang/crystal/issues/11335#issuecomment-947410609
Great job @darkstego !
One thing to note is memory consumption has grown in new implementation.
I've run the benchmark already mentioned here https://github.com/crystal-lang/crystal/pull/11370#issuecomment-952428372
My numbers:
old 305.94 ( 3.27ms) (± 3.08%) 195kB/op 1.59× slower
new 486.96 ( 2.05ms) (± 1.34%) 3.24MB/op fastest
It uses a case with 100000
characters long strings though. Not sure how real world this use case is.
If I remove these 2 cases
Levenshtein.new_distance("a" * 100000, "hello")
Levenshtein.new_distance("hello", "a" * 100000)
The memory consumption is still much higher, but it's not in MBs anymore.
But notice the results, for some reason it's now slower than original? 😕
old 1.23M (812.94ns) (± 0.82%) 208B/op fastest
new 511.08k ( 1.96µs) (± 0.90%) 2.37kB/op 2.41× slower
Code:
# From https://github.com/crystal-lang/crystal/pull/11370
Benchmark.ips do |bm|
bm.report "old" do
Levenshtein.distance("algorithm", "altruistic")
Levenshtein.distance("hello", "hallo")
Levenshtein.distance("こんにちは", "こんちは")
Levenshtein.distance("hey", "hey")
Levenshtein.distance("hippo", "zzzzzzzz")
# Levenshtein.distance("a" * 100000, "hello")
# Levenshtein.distance("hello", "a" * 100000)
end
bm.report "new" do
LevenshteinNew.distance("algorithm", "altruistic")
LevenshteinNew.distance("hello", "hallo")
LevenshteinNew.distance("こんにちは", "こんちは")
LevenshteinNew.distance("hey", "hey")
LevenshteinNew.distance("hippo", "zzzzzzzz")
# LevenshteinNew.distance("a" * 100000, "hello")
# LevenshteinNew.distance("hello", "a" * 100000)
end
end
I've run cases 1 by 1 and added results near each one that's slower. The most slowdown seems to be for short strings with small distance, especially for Unicode string.
Benchmark.ips do |bm|
bm.report "old" do
Levenshtein.distance("algorithm", "altruistic") # 1.15× slower
Levenshtein.distance("hello", "hallo")
Levenshtein.distance("こんにちは", "こんちは")
Levenshtein.distance("hey", "hey")
Levenshtein.distance("hippo", "zzzzzzzz")
Levenshtein.distance("a" * 100000, "hello") # 1.75× slower
Levenshtein.distance("hello", "a" * 100000) # 1.80× slower
end
bm.report "new" do
LevenshteinNew.distance("algorithm", "altruistic")
LevenshteinNew.distance("hello", "hallo") # 2.06× slower
LevenshteinNew.distance("こんにちは", "こんちは") # 4.65× slower
LevenshteinNew.distance("hey", "hey")
LevenshteinNew.distance("hippo", "zzzzzzzz") # 1.47× slower
LevenshteinNew.distance("a" * 100000, "hello")
LevenshteinNew.distance("hello", "a" * 100000)
end
end
I wonder what cases would be used most frequently in the real world 🙂 It would be nice to optimize for most those.
I'm pretty sure that short strings are a very common use case for Levenshtein distance (thing search query strings for example). So we should make sure that performance does at least not deteriorate for short strings. There should be individual benchmarks for different string lengths.
I'm pretty sure that short strings are a very common use case for Levenshtein distance (thing search query strings for example). So we should make sure that performance does at least not deteriorate for short strings. There should be individual benchmarks for different string lengths.
I know why this is happening. For short Unicode strings the dictionary used is a hash. That is created and populated before traversing the columns of the matrix, but that step takes times and if the string is short the improved big O times of the algorithm doesn't get a chance to offset that. In my early testing I realized that the cutoff point for unicode was about 30 chars or so.
The fix is pretty easy, I would just call the old algorithm if the strings were below the cutoff length.
The StaticArray (ascii) method was always faster for me, so there is a regression somewhere, and I have a good idea where that might be.
If anyone has any suggestion on an efficient way to store and retrieve bit information in Crystal that would help a lot. The algorithm needs 2 things:
- A dictionary that maps Char to an int of size (word width). Dictionary only needs to store (word width) number of entries).
- An array of bits of size equal to the longer string.
I just realized there was a bitarray in Crystal and might look at that, but if I am not mistaken it stores every bit as a U32.
One thing to note is memory consumption has grown in new implementation.
I know the reason for this. The algorithm stores a bit array of length = longest string, and I encoded each bit as a U64 on 64 bit systems. There is potential to reduce the memory consumption by quite a bit.
Out of curiosity...
Looking at the JS library, it seems that the Hash they use has a fixed size of 0x10000:
https://github.com/ka-weihe/fastest-levenshtein/blob/6c5236b00d04a30428e4fac23b2840222d931f05/mod.ts#L1
Does that mean that the library will fail for unicode codepoints that exceed 0x10000?
I just realized there was a bitarray in Crystal and might look at that, but if I am not mistaken it stores every bit as a U32.
UInt32 is used for internal representation, but the size can be bigger than 32. It just allocates as many UInt32 as necessary to represent the size.
Yes, it probably will fail for strings containing characters outside the BMP. JavaScript's String.prototype.charCodeAt
says:
If the Unicode code point cannot be represented in a single UTF-16 code unit (because its value is greater than 0xFFFF) then the code unit returned will be the first part of a surrogate pair for the code point.
So any two characters that differ in both their high and low surrogate parts will be doubly counted there, e.g. U+10000 𐀀
v.s. U+1F602 😂
. (The index there is in terms of UTF-16 code units, not characters.) We don't have an equivalent for that method so we're fine.
Does that mean that the library will fail for unicode codepoints that exceed 0x10000?
Correct. That library will fail if you use emojis.
So I hope this is really close to the final version. I ran the following benchmark test.
These are the results.
Original: ASCII-Size:3 13.04M ( 76.68ns) (± 4.41%) 32.0B/op 3.34× slower
New: ASCII-Size:3 43.51M ( 22.98ns) (± 1.35%) 0.0B/op fastest
Original: ASCII-Size:6 8.34M (119.85ns) (± 3.06%) 32.0B/op 3.28× slower
New: ASCII-Size:6 27.40M ( 36.49ns) (± 1.48%) 0.0B/op fastest
Original: ASCII-Size:12 3.38M (296.29ns) (± 3.73%) 64.0B/op 4.81× slower
New: ASCII-Size:12 16.23M ( 61.62ns) (± 1.18%) 0.0B/op fastest
Original: ASCII-Size:24 896.63k ( 1.12µs) (± 1.16%) 112B/op 9.46× slower
New: ASCII-Size:24 8.48M (117.88ns) (± 2.97%) 0.0B/op fastest
Original: ASCII-Size:50 218.48k ( 4.58µs) (± 1.25%) 208B/op 10.97× slower
New: ASCII-Size:50 2.40M (417.16ns) (± 4.15%) 33.0B/op fastest
Original: ASCII-Size:100 55.26k ( 18.09µs) (± 3.09%) 448B/op 14.41× slower
New: ASCII-Size:100 796.32k ( 1.26µs) (± 4.16%) 64.0B/op fastest
Original: ASCII-Size:500 2.20k (455.57µs) (± 2.84%) 2.0kB/op 23.42× slower
New: ASCII-Size:500 51.42k ( 19.45µs) (± 3.39%) 161B/op fastest
Original: ASCII-Size:1000 557.76 ( 1.79ms) (± 1.55%) 3.93kB/op 23.42× slower
New: ASCII-Size:1000 13.06k ( 76.55µs) (± 3.02%) 289B/op fastest
Original: Unicode-Size:3 4.83M (206.97ns) (± 4.33%) 80.0B/op fastest
New: Unicode-Size:3 4.77M (209.59ns) (± 4.67%) 80.0B/op 1.01× slower
Original: Unicode-Size:6 3.25M (307.88ns) (± 2.89%) 96.0B/op fastest
New: Unicode-Size:6 3.14M (318.42ns) (± 3.10%) 96.0B/op 1.03× slower
Original: Unicode-Size:12 1.61M (622.76ns) (± 5.19%) 161B/op 1.02× slower
New: Unicode-Size:12 1.63M (613.43ns) (± 2.51%) 161B/op fastest
Original: Unicode-Size:24 549.65k ( 1.82µs) (± 3.14%) 257B/op fastest
New: Unicode-Size:24 547.07k ( 1.83µs) (± 5.18%) 257B/op 1.00× slower
Original: Unicode-Size:50 161.32k ( 6.20µs) (± 3.51%) 449B/op 1.02× slower
New: Unicode-Size:50 164.46k ( 6.08µs) (± 3.95%) 449B/op fastest
Original: Unicode-Size:100 50.01k ( 20.00µs) (± 2.05%) 928B/op 2.14× slower
New: Unicode-Size:100 107.00k ( 9.35µs) (± 3.26%) 2.06kB/op fastest
Original: Unicode-Size:500 2.40k (417.22µs) (± 1.59%) 4.03kB/op 4.09× slower
New: Unicode-Size:500 9.79k (102.10µs) (± 1.85%) 3.72kB/op fastest
Original: Unicode-Size:1000 623.88 ( 1.60ms) (± 1.09%) 7.87kB/op 4.33× slower
New: Unicode-Size:1000 2.70k (370.55µs) (± 2.02%) 5.76kB/op fastest
This new implementation is 3x faster for short ASCII strings and over 20x for long strings. For Unicode the speed is the same for short strings (same algorithm) but once you get over 64 chars long, the Myers algorithm is used and gets much faster as the strings get longer.
Also, the memory usage is overall better than the old implementation.
The unexpected regression I had was that when moving to 64 bit implementation for 64 bit systems the performance took a nosedive for short strings, even with the same code. The 64 bit was faster than the 32 bit one on long strings. So I added a simplified method for short strings that works really well.
Now on this benchmark it's 2.5x faster for me https://github.com/crystal-lang/crystal/pull/11370#issuecomment-952428372
For the one comparing to NodeJS it's 1.9x faster (a bit lower than 2.4x at this point https://github.com/crystal-lang/crystal/pull/11370#issuecomment-953536196)
And memory consumption is now much better!
Thank you @darkstego !
For the one comparing to NodeJS it's 1.9x faster (a bit lower than 2.4x at this point https://github.com/crystal-lang/crystal/pull/11370#issuecomment-953536196)
And memory consumption is now much better!
It is indeed a bit slower than the previous commits in the extremely long strings while being faster in strings that under 200 or so in length. I don't remember the exact number. This seems to be due to using BitArrays. The reading and writing to those arrays is slower since I am receiving and sending back a boolean. The setup is much faster though, but given a long enough string the array method is slightly faster at the cost of much higher memory usage.
@darkstego No worries at all! I was just sharing my numbers for the record (for prospective). Current code is much more balanced and optimized for more realistic use cases.
Another improvement just added. I realized that if a cutoff is given to the algorithm it can abort as soon as it knows the distance will be larger than the cutoff. This makes for a huge improvement when searching for the best match among long strings. In the NodeJS example you can now find the best match in 5x speedup over the NodeJS example by using a cutoff. The distance does end up with an optional cutoff parameter.
Sorry the commits weren't squashed. I thought I did that before uploading.
@darkstego I've tried running benchmark you mentioned here https://github.com/crystal-lang/crystal/pull/11370#issuecomment-954044475 and ran into this error:
% ./levenshtein_benchmark
Original: ASCII-Size:3 14.23M ( 70.29ns) (± 1.19%) 32.0B/op 1.59× slower
New: ASCII-Size:3 22.69M ( 44.08ns) (± 4.48%) 0.0B/op fastest
Original: ASCII-Size:6 5.73M (174.59ns) (± 4.42%) 32.0B/op 2.68× slower
New: ASCII-Size:6 15.33M ( 65.21ns) (± 4.50%) 0.0B/op fastest
Original: ASCII-Size:12 1.83M (545.93ns) (± 3.55%) 64.0B/op 4.99× slower
New: ASCII-Size:12 9.13M (109.50ns) (± 3.84%) 0.0B/op fastest
Original: ASCII-Size:24 527.97k ( 1.89µs) (± 3.47%) 112B/op 9.59× slower
New: ASCII-Size:24 5.06M (197.50ns) (± 3.36%) 0.0B/op fastest
Original: ASCII-Size:50 128.05k ( 7.81µs) (± 3.68%) 208B/op 13.12× slower
New: ASCII-Size:50 1.68M (595.30ns) (± 3.34%) 32.0B/op fastest
Original: ASCII-Size:100 32.28k ( 30.97µs) (± 3.48%) 448B/op 16.99× slower
New: ASCII-Size:100 548.43k ( 1.82µs) (± 3.61%) 64.0B/op fastest
Original: ASCII-Size:500 1.29k (773.83µs) (± 2.95%) 2.0kB/op 25.50× slower
New: ASCII-Size:500 32.95k ( 30.35µs) (± 3.55%) 160B/op fastest
Original: ASCII-Size:1000 322.31 ( 3.10ms) (± 2.83%) 3.93kB/op 26.05× slower
New: ASCII-Size:1000 8.40k (119.12µs) (± 3.26%) 288B/op fastest
Original: Unicode-Size:3 5.64M (177.19ns) (± 1.32%) 80.0B/op fastest
New: Unicode-Size:3 5.35M (186.76ns) (± 0.94%) 80.0B/op 1.05× slower
Original: Unicode-Size:6 3.27M (305.84ns) (± 1.02%) 96.0B/op fastest
New: Unicode-Size:6 3.17M (315.20ns) (± 0.89%) 96.0B/op 1.03× slower
Unhandled exception: 0xd9dc out of char range (ArgumentError)
from /usr/local/Cellar/crystal/1.2.1/src/pointer.cr:437:13 in '__crystal_main'
from /usr/local/Cellar/crystal/1.2.1/src/crystal/main.cr:110:5 in 'main'
I've updated benchmark to the latest code here and it still gives me an error:
% ./levenshtein_benchmark
Original: ASCII-Size:3 14.13M ( 70.78ns) (± 1.45%) 32.0B/op 1.59× slower
New: ASCII-Size:3 22.49M ( 44.47ns) (± 4.27%) 0.0B/op fastest
Original: ASCII-Size:6 5.57M (179.47ns) (± 3.63%) 32.0B/op 2.78× slower
New: ASCII-Size:6 15.52M ( 64.45ns) (± 4.01%) 0.0B/op fastest
Original: ASCII-Size:12 1.90M (525.09ns) (± 4.73%) 64.0B/op 4.72× slower
New: ASCII-Size:12 8.99M (111.22ns) (± 4.00%) 0.0B/op fastest
Original: ASCII-Size:24 527.29k ( 1.90µs) (± 3.69%) 112B/op 9.17× slower
New: ASCII-Size:24 4.83M (206.84ns) (± 3.96%) 0.0B/op fastest
Original: ASCII-Size:50 128.70k ( 7.77µs) (± 3.14%) 208B/op 13.52× slower
New: ASCII-Size:50 1.74M (574.90ns) (± 4.11%) 32.0B/op fastest
Original: ASCII-Size:100 31.98k ( 31.27µs) (± 3.06%) 448B/op 17.40× slower
New: ASCII-Size:100 556.38k ( 1.80µs) (± 4.01%) 64.0B/op fastest
Original: ASCII-Size:500 1.26k (795.30µs) (± 5.53%) 2.0kB/op 23.94× slower
New: ASCII-Size:500 30.10k ( 33.22µs) (± 4.20%) 160B/op fastest
Original: ASCII-Size:1000 320.76 ( 3.12ms) (± 3.79%) 3.93kB/op 21.52× slower
New: ASCII-Size:1000 6.90k (144.87µs) (± 4.14%) 288B/op fastest
Original: Unicode-Size:3 6.05M (165.34ns) (± 1.33%) 80.0B/op fastest
New: Unicode-Size:3 5.78M (173.00ns) (± 1.21%) 80.0B/op 1.05× slower
Original: Unicode-Size:6 3.33M (300.39ns) (± 1.97%) 96.0B/op fastest
New: Unicode-Size:6 3.25M (307.39ns) (± 1.24%) 96.0B/op 1.02× slower
Original: Unicode-Size:12 1.21M (829.06ns) (± 4.56%) 160B/op fastest
New: Unicode-Size:12 1.17M (857.51ns) (± 4.81%) 160B/op 1.03× slower
Original: Unicode-Size:24 436.66k ( 2.29µs) (± 4.00%) 256B/op 1.01× slower
New: Unicode-Size:24 440.09k ( 2.27µs) (± 3.40%) 256B/op fastest
Original: Unicode-Size:50 120.21k ( 8.32µs) (± 3.58%) 448B/op 1.00× slower
New: Unicode-Size:50 120.39k ( 8.31µs) (± 3.31%) 448B/op fastest
Original: Unicode-Size:100 33.75k ( 29.63µs) (± 3.37%) 928B/op 3.55× slower
New: Unicode-Size:100 119.78k ( 8.35µs) (± 0.80%) 2.06kB/op fastest
Unhandled exception: 0xdd53 out of char range (ArgumentError)
from /usr/local/Cellar/crystal/1.2.1/src/pointer.cr:437:13 in '__crystal_main'
from /usr/local/Cellar/crystal/1.2.1/src/crystal/main.cr:110:5 in 'main'
% crystal -v
Crystal 1.2.1 (2021-10-21)
LLVM: 11.1.0
Default target: x86_64-apple-macosx
% sysctl -n machdep.cpu.brand_string
Intel(R) Core(TM) i7-3615QM CPU @ 2.30GHz
@vlazar Seems the testing code was generating illegal chars. I unfortunately didn't realize there was a gap in the unicode space that needed to be accounted for. Try revision 3 of the gist.
@darkstego I've tried 3rd revision it and results looks suspicious now 😆
Original: ASCII-Size:3 84.00M ( 11.90ns) (± 5.95%) 0.0B/op fastest
New: ASCII-Size:3 83.62M ( 11.96ns) (± 6.11%) 0.0B/op 1.00× slower
Original: ASCII-Size:6 56.30M ( 17.76ns) (± 5.75%) 0.0B/op fastest
New: ASCII-Size:6 54.48M ( 18.36ns) (± 6.83%) 0.0B/op 1.03× slower
Original: ASCII-Size:12 81.12M ( 12.33ns) (± 6.54%) 0.0B/op 1.00× slower
New: ASCII-Size:12 81.46M ( 12.28ns) (± 6.00%) 0.0B/op fastest
Original: ASCII-Size:24 66.20M ( 15.11ns) (± 5.56%) 0.0B/op fastest
New: ASCII-Size:24 65.12M ( 15.36ns) (± 6.52%) 0.0B/op 1.02× slower
Original: ASCII-Size:50 48.46M ( 20.64ns) (± 6.71%) 0.0B/op fastest
New: ASCII-Size:50 47.85M ( 20.90ns) (± 5.60%) 0.0B/op 1.01× slower
Original: ASCII-Size:100 34.63M ( 28.88ns) (± 4.48%) 0.0B/op fastest
New: ASCII-Size:100 34.11M ( 29.32ns) (± 4.90%) 0.0B/op 1.02× slower
Original: ASCII-Size:500 10.51M ( 95.18ns) (± 3.72%) 0.0B/op fastest
New: ASCII-Size:500 10.46M ( 95.60ns) (± 3.80%) 0.0B/op 1.00× slower
Original: ASCII-Size:1000 6.31M (158.47ns) (± 4.93%) 0.0B/op fastest
New: ASCII-Size:1000 6.31M (158.59ns) (± 5.67%) 0.0B/op 1.00× slower
Original: Unicode-Size:3 82.80M ( 12.08ns) (± 6.39%) 0.0B/op fastest
New: Unicode-Size:3 81.51M ( 12.27ns) (± 6.77%) 0.0B/op 1.02× slower
Original: Unicode-Size:6 55.49M ( 18.02ns) (± 6.11%) 0.0B/op fastest
New: Unicode-Size:6 54.78M ( 18.25ns) (± 6.07%) 0.0B/op 1.01× slower
...
@darkstego I've tried 3rd revision it and results looks suspicious now laughing
@vlazar Sorry about that. It was benchmarking two identical strings all the time. Fixed and tested. I also have a test to compare the distance results between the old and new methods to make sure the data is correct.
Is there a way to import a module from a file by a different name? There is a lot of copying and pasting that needs to be done to test and benchmark and I would ideally like to test the module that is in my git repo without having to move it around?
@darkstego Looks great!
New results from me. All cases are now faster, except negligible slowdown on short Unicode strings fro your benchmarks for different string lengths.
- Comparing to NodeJS on this benchmark from original report #11335
Crystal (now 1.6x faster than NodeJS from previously being 8x slower)
Time (mean ± σ): 4.404 s ± 0.084 s [User: 4.364 s, System: 0.019 s]
Range (min … max): 4.227 s … 4.505 s 10 runs
Crystal with cutoff from https://gist.github.com/darkstego/c41ea58505186542e4f9fdc5079fb1f4 is 6.6x faster than NodeJS
Time (mean ± σ): 1.103 s ± 0.020 s [User: 1.093 s, System: 0.008 s]
Range (min … max): 1.073 s … 1.139 s 10 runs
NodeJS
Time (mean ± σ): 7.269 s ± 0.062 s [User: 7.133 s, System: 0.183 s]
Range (min … max): 7.163 s … 7.335 s 10 runs
- Benchmark from https://github.com/crystal-lang/crystal/pull/11370#issuecomment-952428372 (short and long strings, small and big distances)
old 296.80 ( 3.37ms) (± 2.34%) 195kB/op 2.63× slower
new 781.55 ( 1.28ms) (± 3.19%) 244kB/op fastest
- Your benchmark from https://gist.github.com/darkstego/b7f512780454c9088cc8b26a8a6af888
Original: ASCII-Size:3 13.98M ( 71.55ns) (± 1.22%) 32.0B/op 1.56× slower
New: ASCII-Size:3 21.76M ( 45.95ns) (± 4.73%) 0.0B/op fastest
Original: ASCII-Size:6 5.46M (183.25ns) (± 3.98%) 32.0B/op 2.66× slower
New: ASCII-Size:6 14.53M ( 68.80ns) (± 4.53%) 0.0B/op fastest
Original: ASCII-Size:12 1.81M (552.07ns) (± 3.60%) 64.0B/op 4.80× slower
New: ASCII-Size:12 8.69M (115.12ns) (± 4.06%) 0.0B/op fastest
Original: ASCII-Size:32 309.26k ( 3.23µs) (± 2.91%) 144B/op 7.97× slower
New: ASCII-Size:32 2.47M (405.65ns) (± 3.26%) 32.0B/op fastest
Original: ASCII-Size:64 83.17k ( 12.02µs) (± 3.13%) 272B/op 16.74× slower
New: ASCII-Size:64 1.39M (718.04ns) (± 3.50%) 32.0B/op fastest
Original: ASCII-Size:100 36.12k ( 27.69µs) (± 3.39%) 448B/op 15.08× slower
New: ASCII-Size:100 544.75k ( 1.84µs) (± 2.76%) 64.0B/op fastest
Original: ASCII-Size:500 1.60k (623.85µs) (± 5.44%) 2.0kB/op 20.85× slower
New: ASCII-Size:500 33.43k ( 29.91µs) (± 5.14%) 160B/op fastest
Original: ASCII-Size:1000 413.47 ( 2.42ms) (± 3.95%) 3.93kB/op 21.37× slower
New: ASCII-Size:1000 8.84k (113.18µs) (± 3.17%) 288B/op fastest
Original: Unicode-Size:3 6.04M (165.61ns) (± 1.35%) 80.0B/op fastest
New: Unicode-Size:3 5.67M (176.31ns) (± 0.95%) 80.0B/op 1.06× slower
Original: Unicode-Size:6 3.39M (294.95ns) (± 1.40%) 96.0B/op fastest
New: Unicode-Size:6 3.25M (307.54ns) (± 0.93%) 96.0B/op 1.04× slower
Original: Unicode-Size:12 1.20M (830.25ns) (± 4.47%) 160B/op fastest
New: Unicode-Size:12 1.19M (837.91ns) (± 5.32%) 160B/op 1.01× slower
Original: Unicode-Size:32 269.84k ( 3.71µs) (± 3.04%) 320B/op fastest
New: Unicode-Size:32 269.68k ( 3.71µs) (± 3.21%) 320B/op 1.00× slower
Original: Unicode-Size:64 77.03k ( 12.98µs) (± 2.94%) 576B/op 3.31× slower
New: Unicode-Size:64 254.94k ( 3.92µs) (± 0.70%) 1.86kB/op fastest
Original: Unicode-Size:100 33.37k ( 29.97µs) (± 3.51%) 928B/op 3.96× slower
New: Unicode-Size:100 132.29k ( 7.56µs) (± 0.67%) 2.06kB/op fastest
Original: Unicode-Size:500 1.52k (655.94µs) (± 3.33%) 4.03kB/op 7.06× slower
New: Unicode-Size:500 10.76k ( 92.96µs) (± 3.17%) 3.72kB/op fastest
Original: Unicode-Size:1000 384.41 ( 2.60ms) (± 5.13%) 7.87kB/op 7.94× slower
New: Unicode-Size:1000 3.05k (327.82µs) (± 3.60%) 5.76kB/op fastest
Total Errors = 0
Anything holding this PR? It's so optimized and balanced. Excellent job from @darkstego
Sorry for the very long delay @darkstego and thanks for the awesome work! It would be great to add a few test cases for longer strings, to make sure we are covering all the different algorithms that are implemented. I can take care of that if you prefer. 🙇
I was actually thinking about moving this into a shard a while back, since I didn't know if it will be mainlined. The algorithm is much more efficient at the cost of more complex code.
@beta-ziliani I will see if I can track down some of the test code I had. Since the algorithm had a window size (32 or 64 depending on the architecture), it was important to test strings that were multiple window size in length as well as edge cases when the string is exactly equal to the window size (64 in my case). I believe I had a program that generated strings of random lengths and compared the results from my code to the base implementation just to be sure.
I was actually thinking about moving this into a shard a while back, since I didn't know if it will be mainlined. The algorithm is much more efficient at the cost of more complex code.
I vote to mainline the code