Quickenshtein icon indicating copy to clipboard operation
Quickenshtein copied to clipboard

Added branchless Math.Min calculation

Open Turnerj opened this issue 5 years ago • 6 comments
trafficstars

Based on this tweet: https://mobile.twitter.com/badamczewski01/status/1273759277707132928

Results

|                 Method |       Runtime | NumberOfCharacters |               Mean |               Error |             StdDev | Ratio | Speedup | Worthiness | RatioSD | Code Size |      Gen 0 |      Gen 1 |     Gen 2 |   Allocated |
|----------------------- |-------------- |------------------- |-------------------:|--------------------:|-------------------:|------:|--------:|-----------:|--------:|----------:|-----------:|-----------:|----------:|------------:|
|               Baseline |    .NET 4.7.2 |                  0 |         220.189 ns |          24.1378 ns |          1.3231 ns | 1.000 |    1.00 |       1.00 |    0.00 |    2557 B |     0.1018 |          - |         - |       321 B |
|          Quickenshtein |    .NET 4.7.2 |                  0 |           2.583 ns |           0.1834 ns |          0.0101 ns | 0.012 |   85.24 |   1,095.27 |    0.00 |     199 B |          - |          - |         - |           - |
| Quickenshtein_Threaded |    .NET 4.7.2 |                  0 |           1.790 ns |           0.5424 ns |          0.0297 ns | 0.008 |  123.05 |   1,757.70 |    0.00 |     179 B |          - |          - |         - |           - |
|           Fastenshtein |    .NET 4.7.2 |                  0 |           2.489 ns |           1.4295 ns |          0.0784 ns | 0.011 |   88.51 |     689.97 |    0.00 |     328 B |          - |          - |         - |           - |
|               Baseline | .NET Core 3.0 |                  0 |         112.910 ns |           4.6300 ns |          0.2538 ns | 0.513 |    1.95 |       2.57 |    0.00 |    1938 B |     0.0764 |          - |         - |       240 B |
|          Quickenshtein | .NET Core 3.0 |                  0 |           4.067 ns |           0.1389 ns |          0.0076 ns | 0.018 |   54.14 |     538.69 |    0.00 |     257 B |          - |          - |         - |           - |
| Quickenshtein_Threaded | .NET Core 3.0 |                  0 |           2.127 ns |           0.7267 ns |          0.0398 ns | 0.010 |  103.53 |   1,423.33 |    0.00 |     186 B |          - |          - |         - |           - |
|           Fastenshtein | .NET Core 3.0 |                  0 |           2.396 ns |           0.6282 ns |          0.0344 ns | 0.011 |   91.92 |     714.39 |    0.00 |     329 B |          - |          - |         - |           - |
|                        |               |                    |                    |                     |                    |       |         |            |         |           |            |            |           |             |
|               Baseline |    .NET 4.7.2 |                 10 |       1,252.135 ns |         120.7927 ns |          6.6211 ns |  1.00 |    1.00 |       1.00 |    0.00 |    2557 B |     0.4463 |          - |         - |      1404 B |
|          Quickenshtein |    .NET 4.7.2 |                 10 |         260.836 ns |          16.0352 ns |          0.8789 ns |  0.21 |    4.80 |       2.23 |    0.00 |    5501 B |          - |          - |         - |           - |
| Quickenshtein_Threaded |    .NET 4.7.2 |                 10 |         262.059 ns |          14.9552 ns |          0.8197 ns |  0.21 |    4.78 |       2.08 |    0.00 |    5878 B |          - |          - |         - |           - |
|           Fastenshtein |    .NET 4.7.2 |                 10 |         258.354 ns |          38.7140 ns |          2.1220 ns |  0.21 |    4.85 |      37.79 |    0.00 |     328 B |     0.0200 |          - |         - |        64 B |
|               Baseline | .NET Core 3.0 |                 10 |         749.766 ns |          34.1184 ns |          1.8701 ns |  0.60 |    1.67 |       2.20 |    0.00 |    1938 B |     0.3443 |          - |         - |      1080 B |
|          Quickenshtein | .NET Core 3.0 |                 10 |         369.429 ns |          22.4171 ns |          1.2288 ns |  0.30 |    3.39 |       1.91 |    0.00 |    4536 B |          - |          - |         - |           - |
| Quickenshtein_Threaded | .NET Core 3.0 |                 10 |         352.946 ns |          12.0771 ns |          0.6620 ns |  0.28 |    3.55 |       1.70 |    0.00 |    5335 B |          - |          - |         - |           - |
|           Fastenshtein | .NET Core 3.0 |                 10 |         237.126 ns |          23.6289 ns |          1.2952 ns |  0.19 |    5.28 |      41.04 |    0.00 |     329 B |     0.0200 |          - |         - |        64 B |
|                        |               |                    |                    |                     |                    |       |         |            |         |           |            |            |           |             |
|               Baseline |    .NET 4.7.2 |                400 |     889,130.599 ns |      66,393.8252 ns |      3,639.2696 ns |  1.00 |    1.00 |       1.00 |    0.00 |    2557 B |   142.5781 |    59.5703 |         - |    668278 B |
|          Quickenshtein |    .NET 4.7.2 |                400 |     250,343.913 ns |      21,465.3674 ns |      1,176.5892 ns |  0.28 |    3.55 |       1.65 |    0.00 |    5501 B |          - |          - |         - |           - |
| Quickenshtein_Threaded |    .NET 4.7.2 |                400 |     250,369.938 ns |       8,883.9636 ns |        486.9600 ns |  0.28 |    3.55 |       1.54 |    0.00 |    5878 B |          - |          - |         - |           - |
|           Fastenshtein |    .NET 4.7.2 |                400 |     438,383.122 ns |      39,211.2402 ns |      2,149.3004 ns |  0.49 |    2.03 |      15.81 |    0.00 |     328 B |     0.4883 |          - |         - |      1633 B |
|               Baseline | .NET Core 3.0 |                400 |     865,893.945 ns |      72,297.3258 ns |      3,962.8604 ns |  0.97 |    1.03 |       1.35 |    0.00 |    1938 B |   121.0938 |    60.5469 |         - |    657840 B |
|          Quickenshtein | .NET Core 3.0 |                400 |      78,144.973 ns |       6,586.3546 ns |        361.0203 ns |  0.09 |   11.38 |       6.41 |    0.00 |    4536 B |          - |          - |         - |           - |
| Quickenshtein_Threaded | .NET Core 3.0 |                400 |      78,146.643 ns |      11,238.2866 ns |        616.0084 ns |  0.09 |   11.38 |       5.06 |    0.00 |    5754 B |          - |          - |         - |           - |
|           Fastenshtein | .NET Core 3.0 |                400 |     357,707.454 ns |      18,138.3426 ns |        994.2238 ns |  0.40 |    2.49 |      19.32 |    0.00 |     329 B |     0.4883 |          - |         - |      1624 B |
|                        |               |                    |                    |                     |                    |       |         |            |         |           |            |            |           |             |
|               Baseline |    .NET 4.7.2 |               8000 | 393,884,266.667 ns |  49,006,820.6381 ns |  2,686,229.2053 ns |  1.00 |    1.00 |       1.00 |    0.00 |    2557 B | 44000.0000 | 23000.0000 | 4000.0000 | 256683568 B |
|          Quickenshtein |    .NET 4.7.2 |               8000 | 100,370,120.000 ns |  13,370,648.7190 ns |    732,890.3735 ns |  0.25 |    3.92 |       1.82 |    0.00 |    5501 B |          - |          - |         - |           - |
| Quickenshtein_Threaded |    .NET 4.7.2 |               8000 |  26,717,654.167 ns |  36,141,920.5933 ns |  1,981,060.6232 ns |  0.07 |   14.80 |       6.83 |    0.01 |    5538 B |          - |          - |         - |      1024 B |
|           Fastenshtein |    .NET 4.7.2 |               8000 | 159,648,433.333 ns |  28,111,703.1594 ns |  1,540,897.3089 ns |  0.41 |    2.47 |      19.23 |    0.00 |     328 B |          - |          - |         - |     32048 B |
|               Baseline | .NET Core 3.0 |               8000 | 433,869,533.333 ns | 248,847,936.6442 ns | 13,640,195.1072 ns |  1.10 |    0.91 |       1.28 |    0.04 |    1812 B | 44000.0000 | 23000.0000 | 4000.0000 | 256352240 B |
|          Quickenshtein | .NET Core 3.0 |               8000 |  24,698,026.042 ns |   2,775,654.4912 ns |    152,142.9887 ns |  0.06 |   15.95 |       8.99 |    0.00 |    4536 B |          - |          - |         - |        14 B |
| Quickenshtein_Threaded | .NET Core 3.0 |               8000 |  24,912,576.042 ns |   2,509,390.7247 ns |    137,548.1732 ns |  0.06 |   15.81 |       7.05 |    0.00 |    5731 B |          - |          - |         - |           - |
|           Fastenshtein | .NET Core 3.0 |               8000 | 137,326,183.333 ns |  18,542,076.7650 ns |  1,016,353.7949 ns |  0.35 |    2.87 |      22.29 |    0.00 |     329 B |          - |          - |         - |     32024 B |

Compared to Master .NET Framework - 10 Characters: 20% slower .NET Framework - 400 Characters: 2.7x faster .NET Framework - 8000 Characters: 1.27x faster .NET Framework (Threaded) - 8000 Characters: 1.53x faster

Note: When running the "Huge Text Benchmark", performance tanks - not sure why and may not even relate to the change:

|                      Method |        Job |       Runtime | NumberOfCharacters |        Mean |     Error |    StdDev | Ratio | Speedup | Worthiness | RatioSD | Code Size | Gen 0 | Gen 1 | Gen 2 | Allocated |
|---------------------------- |----------- |-------------- |------------------- |------------:|----------:|----------:|------:|--------:|-----------:|--------:|----------:|------:|------:|------:|----------:|
|   Quickenshtein_NoThreading | Job-VQPBMC |    .NET 4.7.2 |               8192 |   132.62 ms |  0.713 ms |  0.596 ms |  1.00 |    1.00 |       1.00 |    0.00 |    5501 B |     - |     - |     - |         - |
|   Quickenshtein_NoThreading | Job-LKMQCA | .NET Core 3.0 |               8192 |    22.46 ms |  0.168 ms |  0.157 ms |  0.17 |    5.90 |       7.15 |    0.00 |    4536 B |     - |     - |     - |         - |
|                             |            |               |                    |             |           |           |       |         |            |         |           |       |       |       |           |
| Quickenshtein_WithThreading | Job-VQPBMC |    .NET 4.7.2 |               8192 |    31.17 ms |  0.621 ms |  1.428 ms |  1.00 |    1.00 |       1.00 |    0.00 |    5523 B |     - |     - |     - |    1024 B |
| Quickenshtein_WithThreading | Job-LKMQCA | .NET Core 3.0 |               8192 |    17.73 ms |  0.208 ms |  0.194 ms |  0.58 |    1.73 |       0.90 |    0.02 |   10611 B |     - |     - |     - |     800 B |
|                             |            |               |                    |             |           |           |       |         |            |         |           |       |       |       |           |
|                Fastenshtein | Job-VQPBMC |    .NET 4.7.2 |               8192 |   167.37 ms |  1.858 ms |  1.552 ms |  1.00 |    1.00 |       1.00 |    0.00 |     328 B |     - |     - |     - |   32816 B |
|                Fastenshtein | Job-LKMQCA | .NET Core 3.0 |               8192 |   166.87 ms |  0.518 ms |  0.433 ms |  1.00 |    1.00 |       1.00 |    0.01 |     329 B |     - |     - |     - |   33126 B |
|                             |            |               |                    |             |           |           |       |         |            |         |           |       |       |       |           |
|   Quickenshtein_NoThreading | Job-VQPBMC |    .NET 4.7.2 |              32768 | 2,146.58 ms | 10.799 ms |  9.573 ms |  1.00 |    1.00 |       1.00 |    0.00 |    5501 B |     - |     - |     - |         - |
|   Quickenshtein_NoThreading | Job-LKMQCA | .NET Core 3.0 |              32768 |   365.22 ms |  3.426 ms |  3.205 ms |  0.17 |    5.89 |      11.04 |    0.00 |    2933 B |     - |     - |     - |    1344 B |
|                             |            |               |                    |             |           |           |       |         |            |         |           |       |       |       |           |
| Quickenshtein_WithThreading | Job-VQPBMC |    .NET 4.7.2 |              32768 |   449.54 ms |  3.217 ms |  2.512 ms |  1.00 |    1.00 |       1.00 |    0.00 |    5523 B |     - |     - |     - |    8192 B |
| Quickenshtein_WithThreading | Job-LKMQCA | .NET Core 3.0 |              32768 |   273.89 ms |  1.678 ms |  1.487 ms |  0.61 |    1.64 |       0.86 |    0.00 |   10577 B |     - |     - |     - |     800 B |
|                             |            |               |                    |             |           |           |       |         |            |         |           |       |       |       |           |
|                Fastenshtein | Job-VQPBMC |    .NET 4.7.2 |              32768 | 2,703.89 ms | 27.841 ms | 24.680 ms |  1.00 |    1.00 |       1.00 |    0.00 |     328 B |     - |     - |     - |  131096 B |
|                Fastenshtein | Job-LKMQCA | .NET Core 3.0 |              32768 | 2,478.46 ms | 35.907 ms | 33.588 ms |  0.92 |    1.09 |       1.05 |    0.02 |     340 B |     - |     - |     - |  131096 B |

Turnerj avatar Jun 19 '20 07:06 Turnerj

Giving this another crack now...

.NET Framework Baseline

BenchmarkDotNet=v0.13.1, OS=Windows 10.0.19043.1348 (21H1/May2021Update)
Intel Core i5-6600K CPU 3.50GHz (Skylake), 1 CPU, 4 logical and 4 physical cores
.NET SDK=6.0.100
  [Host]    : .NET 6.0.0 (6.0.21.52210), X64 RyuJIT
  Framework : .NET Framework 4.8 (4.8.4420.0), X64 RyuJIT

Job=Framework  Runtime=.NET Framework 4.7.2
Method NumberOfCharacters Mean Error StdDev Op/s CacheMisses/Op BranchMispredictions/Op Code Size Allocated
Quickenshtein 10 190.9 ns 0.28 ns 0.26 ns 5,237,715.05 0 0 4,211 B -
Quickenshtein 400 473,919.8 ns 743.11 ns 695.10 ns 2,110.06 235 70,856 4,211 B -
Quickenshtein 8000 73,353,386.7 ns 872,388.31 ns 816,032.56 ns 13.63 39,673 4,212,053 4,211 B -

Turnerj avatar Nov 30 '21 12:11 Turnerj

With Branchless Min

Method NumberOfCharacters Mean Error StdDev Op/s CacheMisses/Op BranchMispredictions/Op Code Size Allocated
Quickenshtein 10 219.2 ns 0.23 ns 0.20 ns 4,562,704.02 0 0 6,007 B -
Quickenshtein 400 159,471.9 ns 2,843.29 ns 2,659.62 ns 6,270.70 72 1,251 6,007 B -
Quickenshtein 8000 62,462,987.4 ns 279,845.42 ns 261,767.58 ns 16.01 31,858 202,191 6,007 B -

Turnerj avatar Nov 30 '21 12:11 Turnerj

Branchless Min but without Outer Loop Unrolling

Method NumberOfCharacters Mean Error StdDev Op/s BranchMispredictions/Op CacheMisses/Op Code Size Allocated
Quickenshtein 10 212.6 ns 0.28 ns 0.26 ns 4,704,467.28 0 0 2,432 B -
Quickenshtein 400 178,371.5 ns 216.77 ns 202.77 ns 5,606.28 717 64 2,432 B -
Quickenshtein 8000 70,299,671.7 ns 119,200.72 ns 111,500.43 ns 14.22 29,559 29,559 2,432 B -

Turnerj avatar Nov 30 '21 12:11 Turnerj

Using SIMD in .NET Framework via Vectors

In defense of the code - it doesn't have ShiftRightLogical128BitLane so my hacked implementation isn't likely optimal.

Method NumberOfCharacters Mean Error StdDev Op/s CacheMisses/Op BranchMispredictions/Op Code Size Allocated
Quickenshtein 10 591.7 ns 0.63 ns 0.59 ns 1,689,997.326 0 0 2,846 B -
Quickenshtein 400 745,208.6 ns 1,060.84 ns 992.31 ns 1,341.906 309 800 2,846 B -
Quickenshtein 8000 288,599,473.3 ns 380,012.54 ns 355,463.96 ns 3.465 132,710 60,894 2,846 B -

Turnerj avatar Dec 02 '21 12:12 Turnerj

I tested the branchless code again for huge text and there really isn't any performance benefit which I find kinda strange. While there are a lot less branch mispredictions, the time taken to process has stayed the same.

Before

Method NumberOfCharacters Mean Error StdDev Op/s Code Size CacheMisses/Op BranchMispredictions/Op Allocated
Quickenshtein 16000 252.2 ms 2.83 ms 2.65 ms 3.965 4,211 B 196,881 8,602,283 -
Quickenshtein 32000 934.9 ms 11.87 ms 11.10 ms 1.070 4,211 B 709,427 19,651,243 -

After

Method NumberOfCharacters Mean Error StdDev Op/s Code Size CacheMisses/Op BranchMispredictions/Op Allocated
Quickenshtein 16000 248.5 ms 0.34 ms 0.32 ms 4.025 6,007 B 177,948 484,147 -
Quickenshtein 32000 991.4 ms 1.49 ms 1.39 ms 1.009 6,007 B 776,602 1,241,361 -

Turnerj avatar Dec 02 '21 12:12 Turnerj

For my own curiousity, I wanted to see the scale of how the performance degraded for larger strings:

Method NumberOfCharacters Mean Error StdDev Op/s Code Size BranchMispredictions/Op CacheMisses/Op Allocated
Normal 2000 7.459 ms 0.0457 ms 0.0428 ms 134.073 4,211 B 846,460 3,979 -
Normal 4000 22.967 ms 0.1766 ms 0.1652 ms 43.541 4,211 B 1,983,300 12,143 -
Normal 8000 72.782 ms 0.9586 ms 0.8967 ms 13.740 4,211 B 4,094,440 43,769 -
Normal 12000 149.794 ms 2.5139 ms 2.2285 ms 6.676 4,211 B 6,540,698 92,092 -
Normal 16000 251.899 ms 3.3323 ms 3.1170 ms 3.970 4,211 B 8,565,009 174,399 -
Method NumberOfCharacters Mean Error StdDev Op/s Code Size BranchMispredictions/Op CacheMisses/Op Allocated
PR 2000 4.018 ms 0.0777 ms 0.0983 ms 248.874 6,007 B 39,919 1,970 -
PR 4000 15.846 ms 0.2252 ms 0.2106 ms 63.108 6,007 B 103,287 7,415 -
PR 8000 62.540 ms 0.3767 ms 0.3524 ms 15.990 6,007 B 227,908 32,119 -
PR 12000 140.117 ms 0.4266 ms 0.3782 ms 7.137 6,007 B 362,223 80,896 -
PR 16000 248.454 ms 0.4744 ms 0.4438 ms 4.025 6,007 B 493,522 135,441 -

This translates to the PR being... 2000 = 1.86x faster 4000 = 1.45x faster 8000 = 1.16x faster 12000 = 1.07x faster 16000 = 1.01x faster

Not exactly ideal!

Turnerj avatar Dec 02 '21 12:12 Turnerj