go-ethereum icon indicating copy to clipboard operation
go-ethereum copied to clipboard

Implement binary EEA inversion for faster BNADD precompile

Open shamatar opened this issue 5 years ago • 18 comments

Single field inversion takes almost all the time of execution in case of point addition. Switching from inversion by exponentiation into p-2 power to binary EEA inversion reduces execution time in 3 times

Before change:
BenchmarkG1AddAndMakeAffine-12         85004         13955 ns/op         256 B/op          5 allocs/op
After change:
BenchmarkG1AddAndMakeAffine-12        261412          4593 ns/op         256 B/op          5 allocs/op

shamatar avatar Sep 03 '20 23:09 shamatar

Oooh, this sounds nice. Would it make sense to upstream it to Cloudflare's original library? https://github.com/cloudflare/bn256

Edit: I've opened an upstream issue referencing this https://github.com/cloudflare/bn256/issues/23.

karalabe avatar Sep 04 '20 06:09 karalabe

This is not constant time.

bwesterb avatar Sep 04 '20 07:09 bwesterb

I’m not sure if Cloudflare needs it constant time, but since element is not zero and inverse always exists then such inversion will finish in log(p) steps (for Eth purposes to be bounded computation).

N.B. Original EIP for this functionality does not give constant time guarantees anyway

shamatar avatar Sep 04 '20 07:09 shamatar

EEA is not the recommeded way to perform inversion, because its execution is not performed in constant time with respect to the value of its input. EEA is well-known to be the source of several side-channel attacks, see for example ia.cr/2020/432.

Regarding performance, point operations can be performed in projective coordinates, and at the very end make just one conversion.

armfazh avatar Sep 04 '20 19:09 armfazh

As was mentioned above there is no need to constant time for the precompile. And addition is indeed performed in projective coordinates, but Ethereum ABI requires outputs to be in affine coordinates, so fast inversion is important for single addition performance.

shamatar avatar Sep 04 '20 19:09 shamatar

This change seems to be specific to the application, and it's likely not to be included in cloudflare/bn256. You could add a method called MakeAffineInsecure which uses EEA to disinguish from the original MakeAffine method.

armfazh avatar Sep 04 '20 19:09 armfazh

Had to do some duplication of code (to avoid global config variables), but now there are two flavors of 'constant time' and 'variable time' available functions. Fp6/Fp12 inversions do not benefit from variable time speedup much, but I've continued with the same approach everywhere. Miller loop and final exponentiations are left with constant time cause speedup there is also negligible.

shamatar avatar Sep 07 '20 21:09 shamatar

Before and after:

name                                     old time/op    new time/op          delta
PrecompiledBn256Add/chfast1-Gas=150-6      13.6µs ± 1%           5.1µs ± 1%            -62.24%  (p=0.016 n=5+4)
PrecompiledBn256Add/chfast2-Gas=150-6      13.6µs ± 2%           5.2µs ± 1%            -61.84%  (p=0.016 n=5+4)
PrecompiledBn256Add/cdetrio1-Gas=150-6     1.02µs ± 2%          1.01µs ± 1%               ~     (p=0.286 n=5+5)
PrecompiledBn256Add/cdetrio2-Gas=150-6     1.05µs ± 1%          1.10µs ±20%               ~     (p=0.952 n=5+5)
PrecompiledBn256Add/cdetrio3-Gas=150-6     1.06µs ± 2%          1.05µs ± 4%               ~     (p=0.690 n=5+5)
PrecompiledBn256Add/cdetrio4-Gas=150-6     1.07µs ± 0%          1.10µs ± 2%             +2.21%  (p=0.029 n=4+4)
PrecompiledBn256Add/cdetrio5-Gas=150-6     1.04µs ± 3%          1.03µs ± 4%               ~     (p=0.246 n=5+5)
PrecompiledBn256Add/cdetrio6-Gas=150-6     1.36µs ± 2%          1.45µs ±18%               ~     (p=0.222 n=5+5)
PrecompiledBn256Add/cdetrio7-Gas=150-6     1.37µs ± 2%          1.44µs ±26%               ~     (p=0.841 n=5+5)
PrecompiledBn256Add/cdetrio8-Gas=150-6     1.41µs ± 1%          1.56µs ±18%               ~     (p=0.063 n=5+5)
PrecompiledBn256Add/cdetrio9-Gas=150-6     1.35µs ± 2%          1.38µs ± 1%             +2.19%  (p=0.032 n=5+4)
PrecompiledBn256Add/cdetrio10-Gas=150-6    1.37µs ± 7%          1.47µs ±23%               ~     (p=0.151 n=5+5)
PrecompiledBn256Add/cdetrio11-Gas=150-6    13.9µs ± 1%           5.6µs ± 1%            -59.74%  (p=0.008 n=5+5)
PrecompiledBn256Add/cdetrio12-Gas=150-6    13.8µs ± 1%           5.8µs ±13%            -57.82%  (p=0.008 n=5+5)
PrecompiledBn256Add/cdetrio13-Gas=150-6    13.7µs ± 1%           5.4µs ± 4%            -60.20%  (p=0.008 n=5+5)
PrecompiledBn256Add/cdetrio14-Gas=150-6    2.05µs ± 2%          2.13µs ± 5%               ~     (p=0.056 n=5+5)

It does a real difference in many cases (not all) - particularly these five:

PrecompiledBn256Add/chfast1-Gas=150-6      13.6µs ± 1%           5.1µs ± 1%            -62.24%  (p=0.016 n=5+4)
PrecompiledBn256Add/chfast2-Gas=150-6      13.6µs ± 2%           5.2µs ± 1%            -61.84%  (p=0.016 n=5+4)
PrecompiledBn256Add/cdetrio11-Gas=150-6    13.9µs ± 1%           5.6µs ± 1%            -59.74%  (p=0.008 n=5+5)
PrecompiledBn256Add/cdetrio12-Gas=150-6    13.8µs ± 1%           5.8µs ±13%            -57.82%  (p=0.008 n=5+5)
PrecompiledBn256Add/cdetrio13-Gas=150-6    13.7µs ± 1%           5.4µs ± 4%            -60.20%  (p=0.008 n=5+5)

However, the particular cases where it does make a difference, are the cases which are currently the slowest:

namely, these five:

name                                     old mgas/s    
PrecompiledBn256Add/chfast1-Gas=150-6        11.0 ± 1% 
PrecompiledBn256Add/chfast2-Gas=150-6        11.0 ± 2% 
PrecompiledBn256Add/cdetrio11-Gas=150-6      10.8 ± 1% 
PrecompiledBn256Add/cdetrio12-Gas=150-6      10.9 ± 1% 
PrecompiledBn256Add/cdetrio13-Gas=150-6      11.0 ± 1% 

Which are an order of magnitude worse than others:

PrecompiledBn256Add/cdetrio1-Gas=150-6        147 ± 1% 
PrecompiledBn256Add/cdetrio2-Gas=150-6        142 ± 1% 
PrecompiledBn256Add/cdetrio3-Gas=150-6        142 ± 2% 
PrecompiledBn256Add/cdetrio4-Gas=150-6        140 ± 0% 
PrecompiledBn256Add/cdetrio5-Gas=150-6        144 ± 3% 
PrecompiledBn256Add/cdetrio6-Gas=150-6        110 ± 2% 
PrecompiledBn256Add/cdetrio7-Gas=150-6        110 ± 2% 
PrecompiledBn256Add/cdetrio8-Gas=150-6        107 ± 2% 
PrecompiledBn256Add/cdetrio9-Gas=150-6        111 ± 2% 
PrecompiledBn256Add/cdetrio10-Gas=150-6       110 ± 7% 
PrecompiledBn256Add/cdetrio14-Gas=150-6      73.3 ± 2% 

As stated previously, constant time is not something we're bothered by. As long as this is mathematically correct, and doesn't suffer from a different set of worst-cases that we're as of yet unaware of, I'm all for merging this.

holiman avatar Sep 08 '20 21:09 holiman

Constant time must be the default.

bwesterb avatar Sep 09 '20 13:09 bwesterb

We geth-maintainers aren't cryptographers, and it's difficult for us to validate the proposed changes, other than that we can definitely agree that for us, constant time at the cost of speed, is not desireable.

However, I'd highly appreciate any input from @bwesterb or @armfazh regarding:

  • Is the change correct, as in, assuming the implementation is correct, does the math hold?
  • Would this lead to a new (even worse) set of worst-cases?

I don't expect you to spend a lot of time auditing our code, but if you have already looked at it, perhaps you already have formed opinions about these questions.

holiman avatar Sep 10 '20 10:09 holiman

I created some fuzzers for this change here: https://github.com/MariusVanDerWijden/go-ethereum/tree/var_time_fuzzer Been running them for some time now and they haven't found a difference between the varTime and the constant time implementations.

MariusVanDerWijden avatar Sep 21 '20 14:09 MariusVanDerWijden

@armfazh I think all your concerns has been addressed - none of the variable-time functions are on default paths. Do you think this can/will be merged upstream?

holiman avatar Sep 24 '20 11:09 holiman

@armfazh I think all your concerns has been addressed - none of the variable-time functions are on default paths. Do you think this can/will be merged upstream?

could anyone of you prepare a PR against upstream, so we can do a final review and merge it then,

armfazh avatar Sep 24 '20 20:09 armfazh

I started prepping an upstream PR, but for some reason the test fails. @shamatar do you have any idea why? I just copy-pasted the changes... https://github.com/holiman/bn256/tree/binary_eea_inversion

=== RUN   TestBinaryEAA
    TestBinaryEAA: bn256_test.go:232: results of different inversion do not agree

holiman avatar Sep 28 '20 19:09 holiman

I'll have a look at the morning, sure.

shamatar avatar Sep 28 '20 19:09 shamatar

I've identified a problem: modulus size in the original repo is 256 bits, so the trick with addNocarry and then right shift by 1 bit will not work cause addNocarry actually produces carry. In Ethereum the curve is different and modulus is 254 bits. I can fix it in a free time, but not sure about the time yet.

shamatar avatar Sep 29 '20 11:09 shamatar

@shamatar friendly ping, was this ever looked at?

axic avatar Nov 26 '21 12:11 axic

Hey @axic! I didn't, but the solution is quite simple, so since you have reminded me I should make a direct PR to the Cloudflare's repo, but for Ethereum the BN curve is different than in the original CF repo, so the trick with addNocarry is perfectly safe for Ethereum curve (adding two 254 bit numbers can not overflow 256 bit buffer) and can be used as-is

shamatar avatar Nov 26 '21 22:11 shamatar