intx
intx copied to clipboard
streamline/speed up addc/subc by using in/out carry parameter
I believe this version is significantly faster!
I benchmarked GCC 11. In general we would need to split these into smaller changes (I can do this myself) for better analysis. Some quick takeaways:
- comparison optimization looks promising. If we can show it is always faster than subtraction this can land.
- multiplication looks faster,
- the basic add/sub got slower, I plan to inspect the assembly to see what happen.
We should also consult Clang. It can optimize add/sub perfectly so as long it stays this way we can go ahead for changes making it easier for GCC.
clz<uint32_t, intx::clz>_mean -0.0031 -0.0031 0 0 0 0
clz<uint32_t, intx::clz_generic>_mean +0.0005 +0.0005 1 1 1 1
clz<uint64_t, intx::clz>_mean +0.3459 +0.3459 0 1 0 1
clz<uint64_t, intx::clz_generic>_mean +0.0008 +0.0008 3 3 3 3
div_normalize<internal::normalize>_mean -0.0001 -0.0001 11 11 11 11
reciprocal<uint64_t, neg>_mean +0.0014 +0.0014 0 0 0 0
reciprocal<uint64_t, reciprocal_naive>_mean +0.0012 +0.0012 19 19 19 19
reciprocal<uint64_t, reciprocal_2by1>_mean -0.0002 -0.0002 4 4 4 4
reciprocal<uint64_t, reciprocal_2by1_noinline>_mean +0.0010 +0.0010 4 4 4 4
reciprocal<uint128, reciprocal_3by2>_mean +0.0012 +0.0012 7 7 7 7
reciprocal<uint128, reciprocal_3by2_noinline>_mean +0.0072 +0.0072 6 6 6 6
udiv64<nop>_mean +0.0018 +0.0018 1 1 1 1
udiv64<udiv_by_reciprocal>_mean +0.0015 +0.0015 7 7 7 7
udiv64<udiv_native>_mean -0.0002 -0.0002 7 7 7 7
udiv64<soft_div_unr>_mean +0.0137 +0.0137 7 7 7 7
udiv64<soft_div_unr_unrolled>_mean +0.0190 +0.0190 8 8 8 8
udiv128<gcc_>/0_mean -0.0005 -0.0005 25 25 25 25
udiv128<gcc_>/1_mean -0.0017 -0.0017 24 24 24 24
udiv128<gcc_>/2_mean -0.0007 -0.0007 29 29 29 29
udiv128<gcc_>/3_mean +0.0027 +0.0027 21 21 21 21
udiv128<intx_>/0_mean +0.0047 +0.0047 16 16 16 16
udiv128<intx_>/1_mean +0.0035 +0.0035 16 16 16 16
udiv128<intx_>/2_mean +0.0023 +0.0023 13 13 13 13
udiv128<intx_>/3_mean -0.0006 -0.0006 13 13 13 13
udiv128<gmp_>/0_mean +0.0010 +0.0010 24 24 24 24
udiv128<gmp_>/1_mean +0.0046 +0.0045 25 25 25 25
udiv128<gmp_>/2_mean -0.0007 -0.0007 18 18 18 18
udiv128<gmp_>/3_mean -0.0004 -0.0004 14 14 14 14
umul128<intx::uint128, intx::umul>_mean -0.0021 -0.0022 0 0 0 0
div<uint256, udivrem>/64_mean +0.0221 +0.0221 22 23 22 23
div<uint256, udivrem>/128_mean -0.0029 -0.0029 28 28 28 28
div<uint256, udivrem>/192_mean -0.0097 -0.0097 33 33 33 33
div<uint256, udivrem>/256_mean -0.0031 -0.0031 29 29 29 29
div<uint256, gmp::udivrem>/64_mean +0.0238 +0.0238 24 24 24 24
div<uint256, gmp::udivrem>/128_mean +0.0119 +0.0119 31 32 31 32
div<uint256, gmp::udivrem>/192_mean +0.0037 +0.0037 52 52 52 52
div<uint256, gmp::udivrem>/256_mean +0.0109 +0.0109 52 52 52 52
div<uint512, udivrem>/64_mean +0.0008 +0.0008 37 37 37 37
div<uint512, udivrem>/128_mean -0.0010 -0.0010 47 47 47 47
div<uint512, udivrem>/192_mean -0.0039 -0.0039 67 67 67 67
div<uint512, udivrem>/256_mean +0.0019 +0.0019 65 65 65 65
div<uint512, gmp::udivrem>/64_mean +0.0206 +0.0206 38 38 38 38
div<uint512, gmp::udivrem>/128_mean -0.0005 -0.0005 54 54 54 54
div<uint512, gmp::udivrem>/192_mean -0.0024 -0.0024 78 78 78 78
div<uint512, gmp::udivrem>/256_mean -0.0073 -0.0073 82 81 82 81
mod<addmod>/64_mean -0.0037 -0.0037 29 29 29 29
mod<addmod>/128_mean -0.0069 -0.0069 32 32 32 32
mod<addmod>/192_mean -0.0064 -0.0064 39 38 39 38
mod<addmod>/256_mean -0.0048 -0.0047 35 35 35 35
mod<addmod_public>/64_mean -0.0019 -0.0019 32 32 32 32
mod<addmod_public>/128_mean +0.0039 +0.0038 34 34 34 34
mod<addmod_public>/192_mean +0.0016 +0.0016 40 40 40 40
mod<addmod_public>/256_mean +0.0032 +0.0032 35 35 35 35
mod<addmod_simple>/64_mean +0.0078 +0.0078 31 31 31 31
mod<addmod_simple>/128_mean +0.0012 +0.0012 33 33 33 33
mod<addmod_simple>/192_mean -0.0048 -0.0048 39 39 39 39
mod<addmod_simple>/256_mean +0.0075 +0.0075 34 34 34 34
mod<addmod_prenormalize>/64_mean -0.0509 -0.0509 60 57 60 57
mod<addmod_prenormalize>/128_mean -0.0022 -0.0022 63 63 63 63
mod<addmod_prenormalize>/192_mean -0.0193 -0.0193 74 73 74 73
mod<addmod_prenormalize>/256_mean -0.0155 -0.0155 57 56 57 56
mod<addmod_daosvik_v1>/64_mean -0.0062 -0.0062 32 31 32 31
mod<addmod_daosvik_v1>/128_mean +0.0032 +0.0032 33 33 33 33
mod<addmod_daosvik_v1>/192_mean -0.0040 -0.0040 39 39 39 39
mod<addmod_daosvik_v1>/256_mean +0.0014 +0.0014 34 34 34 34
mod<addmod_daosvik_v2>/64_mean +0.0077 +0.0077 31 32 31 32
mod<addmod_daosvik_v2>/128_mean -0.0024 -0.0024 33 33 33 33
mod<addmod_daosvik_v2>/192_mean +0.0014 +0.0014 39 39 39 39
mod<addmod_daosvik_v2>/256_mean -0.0003 -0.0003 34 34 34 34
mod<mulmod>/64_mean -0.0283 -0.0283 51 50 51 50
mod<mulmod>/128_mean -0.0324 -0.0324 61 59 61 59
mod<mulmod>/192_mean -0.0067 -0.0067 79 78 79 78
mod<mulmod>/256_mean -0.0123 -0.0123 78 77 78 77
ecmod<addmod_public>_mean +0.2047 +0.2047 9 11 9 11
ecmod<addmod_simple>_mean +0.0066 +0.0066 25 25 25 25
ecmod<addmod_prenormalize>_mean +0.3527 +0.3527 8 10 8 10
ecmod<addmod_daosvik_v1>_mean -0.3898 -0.3899 33 20 33 20
ecmod<addmod_daosvik_v2>_mean +0.3402 +0.3402 8 11 8 11
ecmod<mulmod>_mean -0.0154 -0.0154 75 73 75 73
binop<uint256, uint256, add>_mean +1.2626 +1.2626 2 5 2 5
binop<uint256, uint256, inline_add>_mean +0.0012 +0.0012 2 2 2 2
binop<uint256, uint256, sub>_mean +1.0722 +1.0722 3 5 3 5
binop<uint256, uint256, inline_sub>_mean +0.0018 +0.0018 2 2 2 2
binop<uint256, uint256, public_mul>_mean +0.0194 +0.0194 6 6 6 6
binop<uint256, uint256, gmp::mul>_mean -0.0003 -0.0003 14 14 14 14
binop<uint512, uint256, umul_>_mean -0.1252 -0.1252 14 12 14 12
binop<uint512, uint256, gmp::mul_full>_mean +0.0012 +0.0012 15 15 15 15
binop<uint512, uint512, add>_mean +0.4870 +0.4869 4 6 4 6
binop<uint512, uint512, inline_add>_mean +0.0043 +0.0043 4 4 4 4
binop<uint512, uint512, sub>_mean +0.4274 +0.4274 4 6 4 6
binop<uint512, uint512, inline_sub>_mean +0.0037 +0.0037 4 4 4 4
binop<uint512, uint512, public_mul>_mean -0.1096 -0.1096 35 31 35 31
binop<uint512, uint512, gmp::mul>_mean -0.0054 -0.0054 52 52 52 52
shift<uint256, uint256, shl_public>/-1_mean +0.0120 +0.0120 3 3 3 3
shift<uint256, uint256, shl_public>/0_mean -0.0199 -0.0199 4 4 4 4
shift<uint256, uint256, shl_public>/1_mean -0.0525 -0.0525 4 4 4 4
shift<uint256, uint256, shl_public>/2_mean +0.0003 +0.0002 3 3 3 3
shift<uint256, uint256, shl_public>/3_mean +0.0689 +0.0689 2 2 2 2
shift<uint256, uint256, shl_halves>/-1_mean -0.0404 -0.0404 3 3 3 3
shift<uint256, uint256, shl_halves>/0_mean -0.0309 -0.0309 4 4 4 4
shift<uint256, uint256, shl_halves>/1_mean -0.0002 -0.0002 4 4 4 4
shift<uint256, uint256, shl_halves>/2_mean -0.1520 -0.1520 3 3 3 3
shift<uint256, uint256, shl_halves>/3_mean -0.0950 -0.0950 2 2 2 2
shift<uint256, uint64_t, shl_public>/-1_mean -0.0402 -0.0402 3 3 3 3
shift<uint256, uint64_t, shl_public>/0_mean -0.1448 -0.1448 4 4 4 4
shift<uint256, uint64_t, shl_public>/1_mean -0.1506 -0.1506 4 4 4 4
shift<uint256, uint64_t, shl_public>/2_mean -0.0001 -0.0001 3 3 3 3
shift<uint256, uint64_t, shl_public>/3_mean -0.0084 -0.0084 2 2 2 2
shift<uint256, uint64_t, shl_halves>/-1_mean +0.0003 +0.0003 3 3 3 3
shift<uint256, uint64_t, shl_halves>/0_mean -0.0059 -0.0059 4 4 4 4
shift<uint256, uint64_t, shl_halves>/1_mean -0.0348 -0.0348 4 4 4 4
shift<uint256, uint64_t, shl_halves>/2_mean -0.0160 -0.0160 3 3 3 3
shift<uint256, uint64_t, shl_halves>/3_mean +0.0018 +0.0018 2 2 2 2
shift<uint512, uint512, shl_public>/-1_mean -0.0194 -0.0194 10 10 10 10
shift<uint512, uint512, shl_public>/0_mean -0.0185 -0.0185 10 10 10 10
shift<uint512, uint512, shl_public>/1_mean -0.0192 -0.0192 10 10 10 10
shift<uint512, uint512, shl_public>/2_mean -0.0183 -0.0183 10 10 10 10
shift<uint512, uint512, shl_public>/3_mean -0.0184 -0.0184 9 9 9 9
shift<uint512, uint64_t, shl_public>/-1_mean -0.0019 -0.0019 9 9 9 9
shift<uint512, uint64_t, shl_public>/0_mean -0.0034 -0.0034 9 9 9 9
shift<uint512, uint64_t, shl_public>/1_mean -0.0007 -0.0007 9 9 9 9
shift<uint512, uint64_t, shl_public>/2_mean +0.0009 +0.0009 9 9 9 9
shift<uint512, uint64_t, shl_public>/3_mean +0.0006 +0.0006 8 8 8 8
compare<lt_public>/64_mean +0.0058 +0.0058 2 2 2 2
compare<lt_public>/128_mean -0.1167 -0.1167 2 2 2 2
compare<lt_public>/192_mean -0.0361 -0.0361 2 2 2 2
compare<lt_public>/256_mean -0.2078 -0.2078 2 1 2 1
compare<lt_sub>/64_mean -0.0066 -0.0066 2 2 2 2
compare<lt_sub>/128_mean -0.0063 -0.0063 2 2 2 2
compare<lt_sub>/192_mean -0.0059 -0.0059 2 2 2 2
compare<lt_sub>/256_mean -0.0066 -0.0066 2 2 2 2
compare<lt_wordcmp>/64_mean -0.0029 -0.0029 2 2 2 2
compare<lt_wordcmp>/128_mean +0.2282 +0.2282 4 5 4 5
compare<lt_wordcmp>/192_mean +0.4004 +0.4004 4 5 4 5
compare<lt_wordcmp>/256_mean +0.4286 +0.4286 4 5 4 5
compare<lt_halves>/64_mean +0.0044 +0.0044 3 3 3 3
compare<lt_halves>/128_mean +0.0032 +0.0032 3 3 3 3
compare<lt_halves>/192_mean +0.0051 +0.0051 3 3 3 3
compare<lt_halves>/256_mean +0.0046 +0.0046 3 3 3 3
exponentiation/64_mean -0.0100 -0.0100 729 722 729 722
exponentiation/128_mean -0.0073 -0.0073 1457 1447 1457 1447
exponentiation/192_mean -0.0125 -0.0125 2190 2163 2190 2163
exponentiation/256_mean -0.0088 -0.0088 2914 2888 2914 2888
exponentiation2/0_mean -0.0201 -0.0201 11 11 11 11
exponentiation2/64_mean -0.0217 -0.0217 11 11 11 11
exponentiation2/128_mean -0.0781 -0.0781 12 11 12 11
exponentiation2/256_mean -0.0216 -0.0216 10 10 10 10
exponentiation2/512_mean -0.0211 -0.0211 10 10 10 10
count_sigificant_words_256/0_mean -0.0005 -0.0005 0 0 0 0
count_sigificant_words_256/1_mean +0.0010 +0.0010 0 0 0 0
count_sigificant_words_256/2_mean +0.0024 +0.0024 0 0 0 0
count_sigificant_words_256/3_mean +0.0009 +0.0009 0 0 0 0
count_sigificant_words_256/4_mean -0.0000 -0.0000 0 0 0 0
count_sigificant_words_256/5_mean -0.0009 -0.0009 0 0 0 0
count_sigificant_words_256/6_mean -0.0011 -0.0011 0 0 0 0
count_sigificant_words_256/7_mean +0.0005 +0.0005 0 0 0 0
count_sigificant_words_256/8_mean +0.0005 +0.0005 0 0 0 0
to_string<uint128>_mean -0.0043 -0.0043 443 441 443 441
to_string<uint256>_mean +0.0029 +0.0029 2428 2434 2428 2434
to_string<uint512>_mean -0.0102 -0.0102 7608 7530 7608 7530
Thanks for looking into it @chfast ! I'm not sure how to read the numbers you posted, but I am surprised that you found that add/sub got slower. In my tests it was faster. I noticed that sometimes running the same code twice gave different numbers though! Benchmarks are hard.
Also why diff between add and inline_add?
I'm not sure how to read the numbers
This is comparison between master and your changes. Numbers are:
- CPU time diff,
- wall-clock time diff,
- master CPU time in ns,
- your CPU time in ns,
- master wall-clock time in ns,
- your wall-clock time in ns.
This comes from a tool comparing outputs of test/intx-bench.
Also why diff between add and inline_add?
Because the benchmark run in a loop the "inline_add" can be vectorized. This is not relevant for EVM use case. I probably could review all benchmark cases and remove half of these.
Thanks!
the "inline_add" can be vectorized
vectorized or inlined? Why shouldn't it be relevant for EVM?
The main difference is that the addc implementation tries to use compiler builtins via __has_builtin() macro. This macro is not available until GCC 10 (although the builtin is supported). This can be "fixed" by adding more preprocessor conditions, but I try to stick the latest compiler versions if performance matters.
https://godbolt.org/z/1cxc6aTG3
@chfast if you are using gcc 11 and therefore the builtin, I am puzzled that my version would be slower.







