evmone icon indicating copy to clipboard operation
evmone copied to clipboard

Benchmark branchless binary search

Open yperbasis opened this issue 3 years ago • 6 comments

Results from my machine (Linux andrew-NUC8i7HNK 5.11.0-25-generic #27-Ubuntu SMP Fri Jul 9 23:06:29 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux):

------------------------------------------------------------------------------------------------------------
Benchmark                                                                  Time             CPU   Iterations
------------------------------------------------------------------------------------------------------------
find_jumpdest_split_random<int, lower_bound>                            72.3 ns         72.3 ns      9799000
find_jumpdest_split_random<int, binary_search2>                         61.3 ns         61.3 ns     11133000
find_jumpdest_split_random<int, branchless_binary_search>               11.7 ns         11.7 ns     59149000

yperbasis avatar Aug 11 '21 12:08 yperbasis

Codecov Report

Merging #372 (657d549) into master (4faf801) will increase coverage by 0.00%. The diff coverage is 100.00%.

@@           Coverage Diff           @@
##           master     #372   +/-   ##
=======================================
  Coverage   99.76%   99.76%           
=======================================
  Files          31       31           
  Lines        4266     4276   +10     
=======================================
+ Hits         4256     4266   +10     
  Misses         10       10           
Flag Coverage Δ
consensus 91.73% <100.00%> (+0.05%) :arrow_up:
unittests 99.76% <100.00%> (+<0.01%) :arrow_up:

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
lib/evmone/analysis.hpp 100.00% <100.00%> (ø)

codecov[bot] avatar Aug 11 '21 13:08 codecov[bot]

Unfortunately, I didn't find any performance gains when using branchless binary search in Silkworm execution of all mainnet blocks, which probably means that the benchmark doesn't match real usage patterns well enough. Still, IMHO it makes sense to have this variant of binary search in the benchmark.

yperbasis avatar Aug 17 '21 10:08 yperbasis

Haswell 4.4 GHz, clang.

find_jumpdest_split<int, [binary_search2 vs. branchless_binary_search]>/0/0_mean                             -0.2401         -0.2401             0             0             0             0
find_jumpdest_split<int, [binary_search2 vs. branchless_binary_search]>/3/0_mean                             -0.4503         -0.4503             2             1             2             1
find_jumpdest_split<int, [binary_search2 vs. branchless_binary_search]>/16/0_mean                            -0.4368         -0.4368             6             4             6             4
find_jumpdest_split<int, [binary_search2 vs. branchless_binary_search]>/256/1_mean                           -0.5793         -0.5793            13             5            13             5
find_jumpdest_split<int, [binary_search2 vs. branchless_binary_search]>/256/255_mean                         -0.5037         -0.5037            11             5            11             5
find_jumpdest_split<int, [binary_search2 vs. branchless_binary_search]>/256/511_mean                         -0.5055         -0.5055            11             5            11             5
find_jumpdest_split<int, [binary_search2 vs. branchless_binary_search]>/256/0_mean                           -0.6254         -0.6254            15             5            15             5
find_jumpdest_split<int, [binary_search2 vs. branchless_binary_search]>/24576/1_mean                         -0.6681         -0.6681            26             9            26             9
find_jumpdest_split<int, [binary_search2 vs. branchless_binary_search]>/24576/359_mean                       -0.6593         -0.6593            24             8            24             8
find_jumpdest_split<int, [binary_search2 vs. branchless_binary_search]>/24576/0_mean                         -0.6910         -0.6910            28             9            28             9
find_jumpdest_split_random<int, [binary_search2 vs. branchless_binary_search]>_mean                          -0.2748         -0.2748            35            25            35            25
find_jumpdest_split<uint16_t, [binary_search2 vs. branchless_binary_search]>/0/0_mean                        -0.0000         -0.0000             0             0             0             0
find_jumpdest_split<uint16_t, [binary_search2 vs. branchless_binary_search]>/3/0_mean                        -0.4249         -0.4249             2             1             2             1
find_jumpdest_split<uint16_t, [binary_search2 vs. branchless_binary_search]>/16/0_mean                       -0.4569         -0.4569             7             4             7             4
find_jumpdest_split<uint16_t, [binary_search2 vs. branchless_binary_search]>/256/1_mean                      -0.5842         -0.5842            13             5            13             5
find_jumpdest_split<uint16_t, [binary_search2 vs. branchless_binary_search]>/256/255_mean                    -0.5064         -0.5064            11             5            11             5
find_jumpdest_split<uint16_t, [binary_search2 vs. branchless_binary_search]>/256/511_mean                    -0.5056         -0.5056            11             5            11             5
find_jumpdest_split<uint16_t, [binary_search2 vs. branchless_binary_search]>/256/0_mean                      -0.6303         -0.6303            15             5            15             5
find_jumpdest_split<uint16_t, [binary_search2 vs. branchless_binary_search]>/24576/1_mean                    -0.6666         -0.6666            27             9            27             9
find_jumpdest_split<uint16_t, [binary_search2 vs. branchless_binary_search]>/24576/359_mean                  -0.6622         -0.6622            24             8            24             8
find_jumpdest_split<uint16_t, [binary_search2 vs. branchless_binary_search]>/24576/0_mean                    -0.6945         -0.6945            28             9            28             9
find_jumpdest_split_random<uint16_t, [binary_search2 vs. branchless_binary_search]>_mean                     -0.3100         -0.3100            33            23            33            23
find_jumpdest_split<int, [lower_bound vs. branchless_binary_search]>/0/0_mean                             -0.5044         -0.5044             1             0             1             0
find_jumpdest_split<int, [lower_bound vs. branchless_binary_search]>/3/0_mean                             -0.4846         -0.4846             2             1             2             1
find_jumpdest_split<int, [lower_bound vs. branchless_binary_search]>/16/0_mean                            -0.4287         -0.4287             6             4             6             4
find_jumpdest_split<int, [lower_bound vs. branchless_binary_search]>/256/1_mean                           -0.6049         -0.6049            14             5            14             5
find_jumpdest_split<int, [lower_bound vs. branchless_binary_search]>/256/255_mean                         -0.5302         -0.5302            12             5            12             5
find_jumpdest_split<int, [lower_bound vs. branchless_binary_search]>/256/511_mean                         -0.5309         -0.5309            12             5            12             5
find_jumpdest_split<int, [lower_bound vs. branchless_binary_search]>/256/0_mean                           -0.6029         -0.6029            14             5            14             5
find_jumpdest_split<int, [lower_bound vs. branchless_binary_search]>/24576/1_mean                         -0.6740         -0.6740            27             9            27             9
find_jumpdest_split<int, [lower_bound vs. branchless_binary_search]>/24576/359_mean                       -0.6932         -0.6932            27             8            27             8
find_jumpdest_split<int, [lower_bound vs. branchless_binary_search]>/24576/0_mean                         -0.6777         -0.6777            27             9            27             9
find_jumpdest_split_random<int, [lower_bound vs. branchless_binary_search]>_mean                          -0.3269         -0.3269            38            25            38            25
find_jumpdest_split<uint16_t, [lower_bound vs. branchless_binary_search]>/0/0_mean                        -0.4991         -0.4991             0             0             0             0
find_jumpdest_split<uint16_t, [lower_bound vs. branchless_binary_search]>/3/0_mean                        -0.4899         -0.4899             3             1             3             1
find_jumpdest_split<uint16_t, [lower_bound vs. branchless_binary_search]>/16/0_mean                       -0.4582         -0.4583             7             4             7             4
find_jumpdest_split<uint16_t, [lower_bound vs. branchless_binary_search]>/256/1_mean                      -0.6160         -0.6160            14             5            14             5
find_jumpdest_split<uint16_t, [lower_bound vs. branchless_binary_search]>/256/255_mean                    -0.5482         -0.5482            12             5            12             5
find_jumpdest_split<uint16_t, [lower_bound vs. branchless_binary_search]>/256/511_mean                    -0.5479         -0.5479            12             5            12             5
find_jumpdest_split<uint16_t, [lower_bound vs. branchless_binary_search]>/256/0_mean                      -0.6127         -0.6127            14             5            14             5
find_jumpdest_split<uint16_t, [lower_bound vs. branchless_binary_search]>/24576/1_mean                    -0.6760         -0.6760            27             9            27             9
find_jumpdest_split<uint16_t, [lower_bound vs. branchless_binary_search]>/24576/359_mean                  -0.6996         -0.6996            27             8            27             8
find_jumpdest_split<uint16_t, [lower_bound vs. branchless_binary_search]>/24576/0_mean                    -0.6831         -0.6831            27             9            27             9
find_jumpdest_split_random<uint16_t, [lower_bound vs. branchless_binary_search]>_mean                     -0.2982         -0.2982            33            23            33            23

chfast avatar Aug 25 '21 12:08 chfast

This is GCC vs clang, where clang does not use cmov even with __builtin_unpredictable.

find_jumpdest_split<int, branchless_binary_search>/0/0_mean                             -0.6325         -0.6325             1             0             1             0
find_jumpdest_split<int, branchless_binary_search>/3/0_mean                             -0.7827         -0.7827             6             1             6             1
find_jumpdest_split<int, branchless_binary_search>/16/0_mean                            -0.6952         -0.6952            12             4            12             4
find_jumpdest_split<int, branchless_binary_search>/256/1_mean                           -0.7286         -0.7286            20             5            20             5
find_jumpdest_split<int, branchless_binary_search>/256/255_mean                         -0.7279         -0.7279            20             5            20             5
find_jumpdest_split<int, branchless_binary_search>/256/511_mean                         -0.7282         -0.7282            20             5            20             5
find_jumpdest_split<int, branchless_binary_search>/256/0_mean                           -0.7289         -0.7289            20             5            20             5
find_jumpdest_split<int, branchless_binary_search>/24576/1_mean                         -0.7287         -0.7287            32             9            32             9
find_jumpdest_split<int, branchless_binary_search>/24576/359_mean                       -0.7447         -0.7447            32             8            32             8
find_jumpdest_split<int, branchless_binary_search>/24576/0_mean                         -0.7331         -0.7331            32             9            32             9
find_jumpdest_split_random<int, branchless_binary_search>_mean                          +0.6278         +0.6278            16            25            16            25
find_jumpdest_split<uint16_t, branchless_binary_search>/0/0_mean                        -0.4990         -0.4990             0             0             0             0
find_jumpdest_split<uint16_t, branchless_binary_search>/3/0_mean                        -0.7777         -0.7777             6             1             6             1
find_jumpdest_split<uint16_t, branchless_binary_search>/16/0_mean                       -0.7205         -0.7205            13             4            13             4
find_jumpdest_split<uint16_t, branchless_binary_search>/256/1_mean                      -0.7543         -0.7543            22             5            22             5
find_jumpdest_split<uint16_t, branchless_binary_search>/256/255_mean                    -0.7549         -0.7549            22             5            22             5
find_jumpdest_split<uint16_t, branchless_binary_search>/256/511_mean                    -0.7542         -0.7542            22             5            22             5
find_jumpdest_split<uint16_t, branchless_binary_search>/256/0_mean                      -0.7534         -0.7534            22             5            22             5
find_jumpdest_split<uint16_t, branchless_binary_search>/24576/1_mean                    -0.7524         -0.7524            36             9            36             9
find_jumpdest_split<uint16_t, branchless_binary_search>/24576/359_mean                  -0.7705         -0.7705            36             8            36             8
find_jumpdest_split<uint16_t, branchless_binary_search>/24576/0_mean                    -0.7583         -0.7583            36             9            36             9
find_jumpdest_split_random<uint16_t, branchless_binary_search>_mean                     -0.1337         -0.1337            26            23            26            23

I think we rather should focus on random samples (and drop other benchmark cases).

chfast avatar Aug 25 '21 13:08 chfast

The https://github.com/ethereum/evmone/pull/385 adds a micro-benchmark "jump around" which may provide a hint how good this change is.

chfast avatar Sep 17 '21 13:09 chfast

This improves single benchmark case blake2b_shifts:

advanced/execute/main/blake2b_shifts/2805nulls_mean                    -0.1076         -0.1076          3679          3284          3679          3284
advanced/execute/main/blake2b_shifts/5610nulls_mean                    -0.1066         -0.1066          7355          6571          7355          6571
advanced/execute/main/blake2b_shifts/8415nulls_mean                    -0.1068         -0.1068         11008          9833         11008          9833
advanced/execute/main/blake2b_shifts/65536nulls_mean                   -0.1121         -0.1121         85775         76160         85775         76160

But is actually substantial slower for synthetic benchmark of ~4K "static" jumps.

advanced/execute/micro/jump_around_mean                       +0.3373         +0.3373           175           234           175           234

chfast avatar Sep 20 '21 14:09 chfast