Quickenshtein
Quickenshtein copied to clipboard
Added branchless Math.Min calculation
Based on this tweet: https://mobile.twitter.com/badamczewski01/status/1273759277707132928
Results
| Method | Runtime | NumberOfCharacters | Mean | Error | StdDev | Ratio | Speedup | Worthiness | RatioSD | Code Size | Gen 0 | Gen 1 | Gen 2 | Allocated |
|----------------------- |-------------- |------------------- |-------------------:|--------------------:|-------------------:|------:|--------:|-----------:|--------:|----------:|-----------:|-----------:|----------:|------------:|
| Baseline | .NET 4.7.2 | 0 | 220.189 ns | 24.1378 ns | 1.3231 ns | 1.000 | 1.00 | 1.00 | 0.00 | 2557 B | 0.1018 | - | - | 321 B |
| Quickenshtein | .NET 4.7.2 | 0 | 2.583 ns | 0.1834 ns | 0.0101 ns | 0.012 | 85.24 | 1,095.27 | 0.00 | 199 B | - | - | - | - |
| Quickenshtein_Threaded | .NET 4.7.2 | 0 | 1.790 ns | 0.5424 ns | 0.0297 ns | 0.008 | 123.05 | 1,757.70 | 0.00 | 179 B | - | - | - | - |
| Fastenshtein | .NET 4.7.2 | 0 | 2.489 ns | 1.4295 ns | 0.0784 ns | 0.011 | 88.51 | 689.97 | 0.00 | 328 B | - | - | - | - |
| Baseline | .NET Core 3.0 | 0 | 112.910 ns | 4.6300 ns | 0.2538 ns | 0.513 | 1.95 | 2.57 | 0.00 | 1938 B | 0.0764 | - | - | 240 B |
| Quickenshtein | .NET Core 3.0 | 0 | 4.067 ns | 0.1389 ns | 0.0076 ns | 0.018 | 54.14 | 538.69 | 0.00 | 257 B | - | - | - | - |
| Quickenshtein_Threaded | .NET Core 3.0 | 0 | 2.127 ns | 0.7267 ns | 0.0398 ns | 0.010 | 103.53 | 1,423.33 | 0.00 | 186 B | - | - | - | - |
| Fastenshtein | .NET Core 3.0 | 0 | 2.396 ns | 0.6282 ns | 0.0344 ns | 0.011 | 91.92 | 714.39 | 0.00 | 329 B | - | - | - | - |
| | | | | | | | | | | | | | | |
| Baseline | .NET 4.7.2 | 10 | 1,252.135 ns | 120.7927 ns | 6.6211 ns | 1.00 | 1.00 | 1.00 | 0.00 | 2557 B | 0.4463 | - | - | 1404 B |
| Quickenshtein | .NET 4.7.2 | 10 | 260.836 ns | 16.0352 ns | 0.8789 ns | 0.21 | 4.80 | 2.23 | 0.00 | 5501 B | - | - | - | - |
| Quickenshtein_Threaded | .NET 4.7.2 | 10 | 262.059 ns | 14.9552 ns | 0.8197 ns | 0.21 | 4.78 | 2.08 | 0.00 | 5878 B | - | - | - | - |
| Fastenshtein | .NET 4.7.2 | 10 | 258.354 ns | 38.7140 ns | 2.1220 ns | 0.21 | 4.85 | 37.79 | 0.00 | 328 B | 0.0200 | - | - | 64 B |
| Baseline | .NET Core 3.0 | 10 | 749.766 ns | 34.1184 ns | 1.8701 ns | 0.60 | 1.67 | 2.20 | 0.00 | 1938 B | 0.3443 | - | - | 1080 B |
| Quickenshtein | .NET Core 3.0 | 10 | 369.429 ns | 22.4171 ns | 1.2288 ns | 0.30 | 3.39 | 1.91 | 0.00 | 4536 B | - | - | - | - |
| Quickenshtein_Threaded | .NET Core 3.0 | 10 | 352.946 ns | 12.0771 ns | 0.6620 ns | 0.28 | 3.55 | 1.70 | 0.00 | 5335 B | - | - | - | - |
| Fastenshtein | .NET Core 3.0 | 10 | 237.126 ns | 23.6289 ns | 1.2952 ns | 0.19 | 5.28 | 41.04 | 0.00 | 329 B | 0.0200 | - | - | 64 B |
| | | | | | | | | | | | | | | |
| Baseline | .NET 4.7.2 | 400 | 889,130.599 ns | 66,393.8252 ns | 3,639.2696 ns | 1.00 | 1.00 | 1.00 | 0.00 | 2557 B | 142.5781 | 59.5703 | - | 668278 B |
| Quickenshtein | .NET 4.7.2 | 400 | 250,343.913 ns | 21,465.3674 ns | 1,176.5892 ns | 0.28 | 3.55 | 1.65 | 0.00 | 5501 B | - | - | - | - |
| Quickenshtein_Threaded | .NET 4.7.2 | 400 | 250,369.938 ns | 8,883.9636 ns | 486.9600 ns | 0.28 | 3.55 | 1.54 | 0.00 | 5878 B | - | - | - | - |
| Fastenshtein | .NET 4.7.2 | 400 | 438,383.122 ns | 39,211.2402 ns | 2,149.3004 ns | 0.49 | 2.03 | 15.81 | 0.00 | 328 B | 0.4883 | - | - | 1633 B |
| Baseline | .NET Core 3.0 | 400 | 865,893.945 ns | 72,297.3258 ns | 3,962.8604 ns | 0.97 | 1.03 | 1.35 | 0.00 | 1938 B | 121.0938 | 60.5469 | - | 657840 B |
| Quickenshtein | .NET Core 3.0 | 400 | 78,144.973 ns | 6,586.3546 ns | 361.0203 ns | 0.09 | 11.38 | 6.41 | 0.00 | 4536 B | - | - | - | - |
| Quickenshtein_Threaded | .NET Core 3.0 | 400 | 78,146.643 ns | 11,238.2866 ns | 616.0084 ns | 0.09 | 11.38 | 5.06 | 0.00 | 5754 B | - | - | - | - |
| Fastenshtein | .NET Core 3.0 | 400 | 357,707.454 ns | 18,138.3426 ns | 994.2238 ns | 0.40 | 2.49 | 19.32 | 0.00 | 329 B | 0.4883 | - | - | 1624 B |
| | | | | | | | | | | | | | | |
| Baseline | .NET 4.7.2 | 8000 | 393,884,266.667 ns | 49,006,820.6381 ns | 2,686,229.2053 ns | 1.00 | 1.00 | 1.00 | 0.00 | 2557 B | 44000.0000 | 23000.0000 | 4000.0000 | 256683568 B |
| Quickenshtein | .NET 4.7.2 | 8000 | 100,370,120.000 ns | 13,370,648.7190 ns | 732,890.3735 ns | 0.25 | 3.92 | 1.82 | 0.00 | 5501 B | - | - | - | - |
| Quickenshtein_Threaded | .NET 4.7.2 | 8000 | 26,717,654.167 ns | 36,141,920.5933 ns | 1,981,060.6232 ns | 0.07 | 14.80 | 6.83 | 0.01 | 5538 B | - | - | - | 1024 B |
| Fastenshtein | .NET 4.7.2 | 8000 | 159,648,433.333 ns | 28,111,703.1594 ns | 1,540,897.3089 ns | 0.41 | 2.47 | 19.23 | 0.00 | 328 B | - | - | - | 32048 B |
| Baseline | .NET Core 3.0 | 8000 | 433,869,533.333 ns | 248,847,936.6442 ns | 13,640,195.1072 ns | 1.10 | 0.91 | 1.28 | 0.04 | 1812 B | 44000.0000 | 23000.0000 | 4000.0000 | 256352240 B |
| Quickenshtein | .NET Core 3.0 | 8000 | 24,698,026.042 ns | 2,775,654.4912 ns | 152,142.9887 ns | 0.06 | 15.95 | 8.99 | 0.00 | 4536 B | - | - | - | 14 B |
| Quickenshtein_Threaded | .NET Core 3.0 | 8000 | 24,912,576.042 ns | 2,509,390.7247 ns | 137,548.1732 ns | 0.06 | 15.81 | 7.05 | 0.00 | 5731 B | - | - | - | - |
| Fastenshtein | .NET Core 3.0 | 8000 | 137,326,183.333 ns | 18,542,076.7650 ns | 1,016,353.7949 ns | 0.35 | 2.87 | 22.29 | 0.00 | 329 B | - | - | - | 32024 B |
Compared to Master .NET Framework - 10 Characters: 20% slower .NET Framework - 400 Characters: 2.7x faster .NET Framework - 8000 Characters: 1.27x faster .NET Framework (Threaded) - 8000 Characters: 1.53x faster
Note: When running the "Huge Text Benchmark", performance tanks - not sure why and may not even relate to the change:
| Method | Job | Runtime | NumberOfCharacters | Mean | Error | StdDev | Ratio | Speedup | Worthiness | RatioSD | Code Size | Gen 0 | Gen 1 | Gen 2 | Allocated |
|---------------------------- |----------- |-------------- |------------------- |------------:|----------:|----------:|------:|--------:|-----------:|--------:|----------:|------:|------:|------:|----------:|
| Quickenshtein_NoThreading | Job-VQPBMC | .NET 4.7.2 | 8192 | 132.62 ms | 0.713 ms | 0.596 ms | 1.00 | 1.00 | 1.00 | 0.00 | 5501 B | - | - | - | - |
| Quickenshtein_NoThreading | Job-LKMQCA | .NET Core 3.0 | 8192 | 22.46 ms | 0.168 ms | 0.157 ms | 0.17 | 5.90 | 7.15 | 0.00 | 4536 B | - | - | - | - |
| | | | | | | | | | | | | | | | |
| Quickenshtein_WithThreading | Job-VQPBMC | .NET 4.7.2 | 8192 | 31.17 ms | 0.621 ms | 1.428 ms | 1.00 | 1.00 | 1.00 | 0.00 | 5523 B | - | - | - | 1024 B |
| Quickenshtein_WithThreading | Job-LKMQCA | .NET Core 3.0 | 8192 | 17.73 ms | 0.208 ms | 0.194 ms | 0.58 | 1.73 | 0.90 | 0.02 | 10611 B | - | - | - | 800 B |
| | | | | | | | | | | | | | | | |
| Fastenshtein | Job-VQPBMC | .NET 4.7.2 | 8192 | 167.37 ms | 1.858 ms | 1.552 ms | 1.00 | 1.00 | 1.00 | 0.00 | 328 B | - | - | - | 32816 B |
| Fastenshtein | Job-LKMQCA | .NET Core 3.0 | 8192 | 166.87 ms | 0.518 ms | 0.433 ms | 1.00 | 1.00 | 1.00 | 0.01 | 329 B | - | - | - | 33126 B |
| | | | | | | | | | | | | | | | |
| Quickenshtein_NoThreading | Job-VQPBMC | .NET 4.7.2 | 32768 | 2,146.58 ms | 10.799 ms | 9.573 ms | 1.00 | 1.00 | 1.00 | 0.00 | 5501 B | - | - | - | - |
| Quickenshtein_NoThreading | Job-LKMQCA | .NET Core 3.0 | 32768 | 365.22 ms | 3.426 ms | 3.205 ms | 0.17 | 5.89 | 11.04 | 0.00 | 2933 B | - | - | - | 1344 B |
| | | | | | | | | | | | | | | | |
| Quickenshtein_WithThreading | Job-VQPBMC | .NET 4.7.2 | 32768 | 449.54 ms | 3.217 ms | 2.512 ms | 1.00 | 1.00 | 1.00 | 0.00 | 5523 B | - | - | - | 8192 B |
| Quickenshtein_WithThreading | Job-LKMQCA | .NET Core 3.0 | 32768 | 273.89 ms | 1.678 ms | 1.487 ms | 0.61 | 1.64 | 0.86 | 0.00 | 10577 B | - | - | - | 800 B |
| | | | | | | | | | | | | | | | |
| Fastenshtein | Job-VQPBMC | .NET 4.7.2 | 32768 | 2,703.89 ms | 27.841 ms | 24.680 ms | 1.00 | 1.00 | 1.00 | 0.00 | 328 B | - | - | - | 131096 B |
| Fastenshtein | Job-LKMQCA | .NET Core 3.0 | 32768 | 2,478.46 ms | 35.907 ms | 33.588 ms | 0.92 | 1.09 | 1.05 | 0.02 | 340 B | - | - | - | 131096 B |
Giving this another crack now...
.NET Framework Baseline
BenchmarkDotNet=v0.13.1, OS=Windows 10.0.19043.1348 (21H1/May2021Update)
Intel Core i5-6600K CPU 3.50GHz (Skylake), 1 CPU, 4 logical and 4 physical cores
.NET SDK=6.0.100
[Host] : .NET 6.0.0 (6.0.21.52210), X64 RyuJIT
Framework : .NET Framework 4.8 (4.8.4420.0), X64 RyuJIT
Job=Framework Runtime=.NET Framework 4.7.2
| Method | NumberOfCharacters | Mean | Error | StdDev | Op/s | CacheMisses/Op | BranchMispredictions/Op | Code Size | Allocated |
|---|---|---|---|---|---|---|---|---|---|
| Quickenshtein | 10 | 190.9 ns | 0.28 ns | 0.26 ns | 5,237,715.05 | 0 | 0 | 4,211 B | - |
| Quickenshtein | 400 | 473,919.8 ns | 743.11 ns | 695.10 ns | 2,110.06 | 235 | 70,856 | 4,211 B | - |
| Quickenshtein | 8000 | 73,353,386.7 ns | 872,388.31 ns | 816,032.56 ns | 13.63 | 39,673 | 4,212,053 | 4,211 B | - |
With Branchless Min
| Method | NumberOfCharacters | Mean | Error | StdDev | Op/s | CacheMisses/Op | BranchMispredictions/Op | Code Size | Allocated |
|---|---|---|---|---|---|---|---|---|---|
| Quickenshtein | 10 | 219.2 ns | 0.23 ns | 0.20 ns | 4,562,704.02 | 0 | 0 | 6,007 B | - |
| Quickenshtein | 400 | 159,471.9 ns | 2,843.29 ns | 2,659.62 ns | 6,270.70 | 72 | 1,251 | 6,007 B | - |
| Quickenshtein | 8000 | 62,462,987.4 ns | 279,845.42 ns | 261,767.58 ns | 16.01 | 31,858 | 202,191 | 6,007 B | - |
❌ Branchless Min but without Outer Loop Unrolling
| Method | NumberOfCharacters | Mean | Error | StdDev | Op/s | BranchMispredictions/Op | CacheMisses/Op | Code Size | Allocated |
|---|---|---|---|---|---|---|---|---|---|
| Quickenshtein | 10 | 212.6 ns | 0.28 ns | 0.26 ns | 4,704,467.28 | 0 | 0 | 2,432 B | - |
| Quickenshtein | 400 | 178,371.5 ns | 216.77 ns | 202.77 ns | 5,606.28 | 717 | 64 | 2,432 B | - |
| Quickenshtein | 8000 | 70,299,671.7 ns | 119,200.72 ns | 111,500.43 ns | 14.22 | 29,559 | 29,559 | 2,432 B | - |
❌ Using SIMD in .NET Framework via Vectors
In defense of the code - it doesn't have ShiftRightLogical128BitLane so my hacked implementation isn't likely optimal.
| Method | NumberOfCharacters | Mean | Error | StdDev | Op/s | CacheMisses/Op | BranchMispredictions/Op | Code Size | Allocated |
|---|---|---|---|---|---|---|---|---|---|
| Quickenshtein | 10 | 591.7 ns | 0.63 ns | 0.59 ns | 1,689,997.326 | 0 | 0 | 2,846 B | - |
| Quickenshtein | 400 | 745,208.6 ns | 1,060.84 ns | 992.31 ns | 1,341.906 | 309 | 800 | 2,846 B | - |
| Quickenshtein | 8000 | 288,599,473.3 ns | 380,012.54 ns | 355,463.96 ns | 3.465 | 132,710 | 60,894 | 2,846 B | - |
I tested the branchless code again for huge text and there really isn't any performance benefit which I find kinda strange. While there are a lot less branch mispredictions, the time taken to process has stayed the same.
Before
| Method | NumberOfCharacters | Mean | Error | StdDev | Op/s | Code Size | CacheMisses/Op | BranchMispredictions/Op | Allocated |
|---|---|---|---|---|---|---|---|---|---|
| Quickenshtein | 16000 | 252.2 ms | 2.83 ms | 2.65 ms | 3.965 | 4,211 B | 196,881 | 8,602,283 | - |
| Quickenshtein | 32000 | 934.9 ms | 11.87 ms | 11.10 ms | 1.070 | 4,211 B | 709,427 | 19,651,243 | - |
After
| Method | NumberOfCharacters | Mean | Error | StdDev | Op/s | Code Size | CacheMisses/Op | BranchMispredictions/Op | Allocated |
|---|---|---|---|---|---|---|---|---|---|
| Quickenshtein | 16000 | 248.5 ms | 0.34 ms | 0.32 ms | 4.025 | 6,007 B | 177,948 | 484,147 | - |
| Quickenshtein | 32000 | 991.4 ms | 1.49 ms | 1.39 ms | 1.009 | 6,007 B | 776,602 | 1,241,361 | - |
For my own curiousity, I wanted to see the scale of how the performance degraded for larger strings:
| Method | NumberOfCharacters | Mean | Error | StdDev | Op/s | Code Size | BranchMispredictions/Op | CacheMisses/Op | Allocated |
|---|---|---|---|---|---|---|---|---|---|
| Normal | 2000 | 7.459 ms | 0.0457 ms | 0.0428 ms | 134.073 | 4,211 B | 846,460 | 3,979 | - |
| Normal | 4000 | 22.967 ms | 0.1766 ms | 0.1652 ms | 43.541 | 4,211 B | 1,983,300 | 12,143 | - |
| Normal | 8000 | 72.782 ms | 0.9586 ms | 0.8967 ms | 13.740 | 4,211 B | 4,094,440 | 43,769 | - |
| Normal | 12000 | 149.794 ms | 2.5139 ms | 2.2285 ms | 6.676 | 4,211 B | 6,540,698 | 92,092 | - |
| Normal | 16000 | 251.899 ms | 3.3323 ms | 3.1170 ms | 3.970 | 4,211 B | 8,565,009 | 174,399 | - |
| Method | NumberOfCharacters | Mean | Error | StdDev | Op/s | Code Size | BranchMispredictions/Op | CacheMisses/Op | Allocated |
|---|---|---|---|---|---|---|---|---|---|
| PR | 2000 | 4.018 ms | 0.0777 ms | 0.0983 ms | 248.874 | 6,007 B | 39,919 | 1,970 | - |
| PR | 4000 | 15.846 ms | 0.2252 ms | 0.2106 ms | 63.108 | 6,007 B | 103,287 | 7,415 | - |
| PR | 8000 | 62.540 ms | 0.3767 ms | 0.3524 ms | 15.990 | 6,007 B | 227,908 | 32,119 | - |
| PR | 12000 | 140.117 ms | 0.4266 ms | 0.3782 ms | 7.137 | 6,007 B | 362,223 | 80,896 | - |
| PR | 16000 | 248.454 ms | 0.4744 ms | 0.4438 ms | 4.025 | 6,007 B | 493,522 | 135,441 | - |
This translates to the PR being... 2000 = 1.86x faster 4000 = 1.45x faster 8000 = 1.16x faster 12000 = 1.07x faster 16000 = 1.01x faster
Not exactly ideal!