evmone
evmone copied to clipboard
Benchmark branchless binary search
Results from my machine (Linux andrew-NUC8i7HNK 5.11.0-25-generic #27-Ubuntu SMP Fri Jul 9 23:06:29 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux):
------------------------------------------------------------------------------------------------------------
Benchmark Time CPU Iterations
------------------------------------------------------------------------------------------------------------
find_jumpdest_split_random<int, lower_bound> 72.3 ns 72.3 ns 9799000
find_jumpdest_split_random<int, binary_search2> 61.3 ns 61.3 ns 11133000
find_jumpdest_split_random<int, branchless_binary_search> 11.7 ns 11.7 ns 59149000
Codecov Report
Merging #372 (657d549) into master (4faf801) will increase coverage by
0.00%
. The diff coverage is100.00%
.
@@ Coverage Diff @@
## master #372 +/- ##
=======================================
Coverage 99.76% 99.76%
=======================================
Files 31 31
Lines 4266 4276 +10
=======================================
+ Hits 4256 4266 +10
Misses 10 10
Flag | Coverage Δ | |
---|---|---|
consensus | 91.73% <100.00%> (+0.05%) |
:arrow_up: |
unittests | 99.76% <100.00%> (+<0.01%) |
:arrow_up: |
Flags with carried forward coverage won't be shown. Click here to find out more.
Impacted Files | Coverage Δ | |
---|---|---|
lib/evmone/analysis.hpp | 100.00% <100.00%> (ø) |
Unfortunately, I didn't find any performance gains when using branchless binary search in Silkworm execution of all mainnet blocks, which probably means that the benchmark doesn't match real usage patterns well enough. Still, IMHO it makes sense to have this variant of binary search in the benchmark.
Haswell 4.4 GHz, clang.
find_jumpdest_split<int, [binary_search2 vs. branchless_binary_search]>/0/0_mean -0.2401 -0.2401 0 0 0 0
find_jumpdest_split<int, [binary_search2 vs. branchless_binary_search]>/3/0_mean -0.4503 -0.4503 2 1 2 1
find_jumpdest_split<int, [binary_search2 vs. branchless_binary_search]>/16/0_mean -0.4368 -0.4368 6 4 6 4
find_jumpdest_split<int, [binary_search2 vs. branchless_binary_search]>/256/1_mean -0.5793 -0.5793 13 5 13 5
find_jumpdest_split<int, [binary_search2 vs. branchless_binary_search]>/256/255_mean -0.5037 -0.5037 11 5 11 5
find_jumpdest_split<int, [binary_search2 vs. branchless_binary_search]>/256/511_mean -0.5055 -0.5055 11 5 11 5
find_jumpdest_split<int, [binary_search2 vs. branchless_binary_search]>/256/0_mean -0.6254 -0.6254 15 5 15 5
find_jumpdest_split<int, [binary_search2 vs. branchless_binary_search]>/24576/1_mean -0.6681 -0.6681 26 9 26 9
find_jumpdest_split<int, [binary_search2 vs. branchless_binary_search]>/24576/359_mean -0.6593 -0.6593 24 8 24 8
find_jumpdest_split<int, [binary_search2 vs. branchless_binary_search]>/24576/0_mean -0.6910 -0.6910 28 9 28 9
find_jumpdest_split_random<int, [binary_search2 vs. branchless_binary_search]>_mean -0.2748 -0.2748 35 25 35 25
find_jumpdest_split<uint16_t, [binary_search2 vs. branchless_binary_search]>/0/0_mean -0.0000 -0.0000 0 0 0 0
find_jumpdest_split<uint16_t, [binary_search2 vs. branchless_binary_search]>/3/0_mean -0.4249 -0.4249 2 1 2 1
find_jumpdest_split<uint16_t, [binary_search2 vs. branchless_binary_search]>/16/0_mean -0.4569 -0.4569 7 4 7 4
find_jumpdest_split<uint16_t, [binary_search2 vs. branchless_binary_search]>/256/1_mean -0.5842 -0.5842 13 5 13 5
find_jumpdest_split<uint16_t, [binary_search2 vs. branchless_binary_search]>/256/255_mean -0.5064 -0.5064 11 5 11 5
find_jumpdest_split<uint16_t, [binary_search2 vs. branchless_binary_search]>/256/511_mean -0.5056 -0.5056 11 5 11 5
find_jumpdest_split<uint16_t, [binary_search2 vs. branchless_binary_search]>/256/0_mean -0.6303 -0.6303 15 5 15 5
find_jumpdest_split<uint16_t, [binary_search2 vs. branchless_binary_search]>/24576/1_mean -0.6666 -0.6666 27 9 27 9
find_jumpdest_split<uint16_t, [binary_search2 vs. branchless_binary_search]>/24576/359_mean -0.6622 -0.6622 24 8 24 8
find_jumpdest_split<uint16_t, [binary_search2 vs. branchless_binary_search]>/24576/0_mean -0.6945 -0.6945 28 9 28 9
find_jumpdest_split_random<uint16_t, [binary_search2 vs. branchless_binary_search]>_mean -0.3100 -0.3100 33 23 33 23
find_jumpdest_split<int, [lower_bound vs. branchless_binary_search]>/0/0_mean -0.5044 -0.5044 1 0 1 0
find_jumpdest_split<int, [lower_bound vs. branchless_binary_search]>/3/0_mean -0.4846 -0.4846 2 1 2 1
find_jumpdest_split<int, [lower_bound vs. branchless_binary_search]>/16/0_mean -0.4287 -0.4287 6 4 6 4
find_jumpdest_split<int, [lower_bound vs. branchless_binary_search]>/256/1_mean -0.6049 -0.6049 14 5 14 5
find_jumpdest_split<int, [lower_bound vs. branchless_binary_search]>/256/255_mean -0.5302 -0.5302 12 5 12 5
find_jumpdest_split<int, [lower_bound vs. branchless_binary_search]>/256/511_mean -0.5309 -0.5309 12 5 12 5
find_jumpdest_split<int, [lower_bound vs. branchless_binary_search]>/256/0_mean -0.6029 -0.6029 14 5 14 5
find_jumpdest_split<int, [lower_bound vs. branchless_binary_search]>/24576/1_mean -0.6740 -0.6740 27 9 27 9
find_jumpdest_split<int, [lower_bound vs. branchless_binary_search]>/24576/359_mean -0.6932 -0.6932 27 8 27 8
find_jumpdest_split<int, [lower_bound vs. branchless_binary_search]>/24576/0_mean -0.6777 -0.6777 27 9 27 9
find_jumpdest_split_random<int, [lower_bound vs. branchless_binary_search]>_mean -0.3269 -0.3269 38 25 38 25
find_jumpdest_split<uint16_t, [lower_bound vs. branchless_binary_search]>/0/0_mean -0.4991 -0.4991 0 0 0 0
find_jumpdest_split<uint16_t, [lower_bound vs. branchless_binary_search]>/3/0_mean -0.4899 -0.4899 3 1 3 1
find_jumpdest_split<uint16_t, [lower_bound vs. branchless_binary_search]>/16/0_mean -0.4582 -0.4583 7 4 7 4
find_jumpdest_split<uint16_t, [lower_bound vs. branchless_binary_search]>/256/1_mean -0.6160 -0.6160 14 5 14 5
find_jumpdest_split<uint16_t, [lower_bound vs. branchless_binary_search]>/256/255_mean -0.5482 -0.5482 12 5 12 5
find_jumpdest_split<uint16_t, [lower_bound vs. branchless_binary_search]>/256/511_mean -0.5479 -0.5479 12 5 12 5
find_jumpdest_split<uint16_t, [lower_bound vs. branchless_binary_search]>/256/0_mean -0.6127 -0.6127 14 5 14 5
find_jumpdest_split<uint16_t, [lower_bound vs. branchless_binary_search]>/24576/1_mean -0.6760 -0.6760 27 9 27 9
find_jumpdest_split<uint16_t, [lower_bound vs. branchless_binary_search]>/24576/359_mean -0.6996 -0.6996 27 8 27 8
find_jumpdest_split<uint16_t, [lower_bound vs. branchless_binary_search]>/24576/0_mean -0.6831 -0.6831 27 9 27 9
find_jumpdest_split_random<uint16_t, [lower_bound vs. branchless_binary_search]>_mean -0.2982 -0.2982 33 23 33 23
This is GCC vs clang, where clang does not use cmov
even with __builtin_unpredictable
.
find_jumpdest_split<int, branchless_binary_search>/0/0_mean -0.6325 -0.6325 1 0 1 0
find_jumpdest_split<int, branchless_binary_search>/3/0_mean -0.7827 -0.7827 6 1 6 1
find_jumpdest_split<int, branchless_binary_search>/16/0_mean -0.6952 -0.6952 12 4 12 4
find_jumpdest_split<int, branchless_binary_search>/256/1_mean -0.7286 -0.7286 20 5 20 5
find_jumpdest_split<int, branchless_binary_search>/256/255_mean -0.7279 -0.7279 20 5 20 5
find_jumpdest_split<int, branchless_binary_search>/256/511_mean -0.7282 -0.7282 20 5 20 5
find_jumpdest_split<int, branchless_binary_search>/256/0_mean -0.7289 -0.7289 20 5 20 5
find_jumpdest_split<int, branchless_binary_search>/24576/1_mean -0.7287 -0.7287 32 9 32 9
find_jumpdest_split<int, branchless_binary_search>/24576/359_mean -0.7447 -0.7447 32 8 32 8
find_jumpdest_split<int, branchless_binary_search>/24576/0_mean -0.7331 -0.7331 32 9 32 9
find_jumpdest_split_random<int, branchless_binary_search>_mean +0.6278 +0.6278 16 25 16 25
find_jumpdest_split<uint16_t, branchless_binary_search>/0/0_mean -0.4990 -0.4990 0 0 0 0
find_jumpdest_split<uint16_t, branchless_binary_search>/3/0_mean -0.7777 -0.7777 6 1 6 1
find_jumpdest_split<uint16_t, branchless_binary_search>/16/0_mean -0.7205 -0.7205 13 4 13 4
find_jumpdest_split<uint16_t, branchless_binary_search>/256/1_mean -0.7543 -0.7543 22 5 22 5
find_jumpdest_split<uint16_t, branchless_binary_search>/256/255_mean -0.7549 -0.7549 22 5 22 5
find_jumpdest_split<uint16_t, branchless_binary_search>/256/511_mean -0.7542 -0.7542 22 5 22 5
find_jumpdest_split<uint16_t, branchless_binary_search>/256/0_mean -0.7534 -0.7534 22 5 22 5
find_jumpdest_split<uint16_t, branchless_binary_search>/24576/1_mean -0.7524 -0.7524 36 9 36 9
find_jumpdest_split<uint16_t, branchless_binary_search>/24576/359_mean -0.7705 -0.7705 36 8 36 8
find_jumpdest_split<uint16_t, branchless_binary_search>/24576/0_mean -0.7583 -0.7583 36 9 36 9
find_jumpdest_split_random<uint16_t, branchless_binary_search>_mean -0.1337 -0.1337 26 23 26 23
I think we rather should focus on random samples (and drop other benchmark cases).
The https://github.com/ethereum/evmone/pull/385 adds a micro-benchmark "jump around" which may provide a hint how good this change is.
This improves single benchmark case blake2b_shifts:
advanced/execute/main/blake2b_shifts/2805nulls_mean -0.1076 -0.1076 3679 3284 3679 3284
advanced/execute/main/blake2b_shifts/5610nulls_mean -0.1066 -0.1066 7355 6571 7355 6571
advanced/execute/main/blake2b_shifts/8415nulls_mean -0.1068 -0.1068 11008 9833 11008 9833
advanced/execute/main/blake2b_shifts/65536nulls_mean -0.1121 -0.1121 85775 76160 85775 76160
But is actually substantial slower for synthetic benchmark of ~4K "static" jumps.
advanced/execute/micro/jump_around_mean +0.3373 +0.3373 175 234 175 234