secp256k1 Use double-wide types for additions in scalar code

Use double-wide types for additions in scalar code

Open sipa opened this issue 3 years ago • 11 comments

Suggested here by Andy Polyakov.

This is shorter, easier to reason about, more likely to not contain branches, and likely (slightly) faster too.

Aug 07 '20 01:08 sipa

@gmaxwell You may want to test performance/branchlessness on various platforms.

Aug 07 '20 01:08 sipa

Good news: tests pass, including valgrind ct tests.

Bad news:

First is before the patch.

x86_64 64-bit GCC 10.2.1 (asm disabled, obviously) scalar_sqr: min 0.0421us / avg 0.0422us / max 0.0423us scalar_mul: min 0.0410us / avg 0.0410us / max 0.0411us scalar_inverse: min 12.2us / avg 12.2us / max 12.2us scalar_sqr: min 0.0456us / avg 0.0457us / max 0.0457us scalar_mul: min 0.0433us / avg 0.0434us / max 0.0434us scalar_inverse: min 13.2us / avg 13.2us / max 13.2us

x86_64 32-bit GCC 10.2.1 scalar_sqr: min 0.122us / avg 0.122us / max 0.123us scalar_mul: min 0.109us / avg 0.109us / max 0.109us scalar_inverse: min 35.0us / avg 35.0us / max 35.0us scalar_sqr: min 0.0961us / avg 0.0962us / max 0.0965us scalar_mul: min 0.110us / avg 0.110us / max 0.110us scalar_inverse: min 28.6us / avg 28.6us / max 28.6us

ARMv8 64-bit GCC 9.3 scalar_sqr: min 0.209us / avg 0.209us / max 0.209us scalar_mul: min 0.222us / avg 0.222us / max 0.222us scalar_inverse: min 61.1us / avg 61.1us / max 61.2us scalar_sqr: min 0.209us / avg 0.209us / max 0.209us scalar_mul: min 0.237us / avg 0.237us / max 0.237us scalar_inverse: min 61.6us / avg 61.6us / max 61.7us

ARMv8 32-bit GCC 9.3 scalar_sqr: min 0.399us / avg 0.399us / max 0.400us scalar_mul: min 0.388us / avg 0.388us / max 0.388us scalar_inverse: min 115us / avg 115us / max 115us scalar_sqr: min 0.392us / avg 0.393us / max 0.394us scalar_mul: min 0.471us / avg 0.473us / max 0.483us scalar_inverse: min 117us / avg 117us / max 118us

Aug 07 '20 02:08 gmaxwell

@sipa Do you remember #453? It goes in the same direction but it's a more radical approach, and maybe then the right way to do it.

These benchmarks also remind me about my comment in #436. Optimizers seem to like these macros particularly well..

x86_64 32-bit GCC 10.2.1

At least it's faster here. Hm, now I (somewhat seriously) wonder whether we should do the opposite and get rid of uint128_t in the field code, which is the only reason why MSVC does not support our 64bit code. (Even though there are other ways to fix this.)

Aug 07 '20 08:08 real-or-random

Bad news:

Life is compromise. Looking back at originating question the main idea behind suggestion was to leave no incentive for compiler to throw in branch. And double-width additions do that, unlike conditionals. So you have to ask yourself what's more important. Besides, formally speaking a data point from just one compiler isn't that conclusive.

Aug 07 '20 08:08 dot-asm

By the way, if anyone is interested: https://godbolt.org/z/8GThbv (no. 1 before, no. 2 after the patch). It's interesting to see how far llvm-mca's prediction is from reality. @gmaxwell What's the exact x86 CPU that you used to run the benchmark?

Aug 07 '20 09:08 real-or-random

What's the exact x86 CPU that you used to run the benchmark?

EPYC 7742

Aug 07 '20 15:08 gmaxwell

I forgot to mention that I don't think these benchmarks are a showstopper. If we think that the code is better (for readability and no branching), I'm fine with this.

Aug 07 '20 17:08 real-or-random

@real-or-random Agree, though it may inform us about trying variants that perhaps have less performance impact.

Another point is that with the divstep inversion code, scalar mul/sqr performance will be almost entirely irrelevant for inversion (which is I expect the only way scalar operations matter to high-level functions like signing and verification).

Aug 07 '20 17:08 sipa

I think inversion is just interesting to look at because it gives a single number. Scalar mul performance is more important in the context of batch validation, some ZKP etc.

I think it's worth tweaking to see if the performance can be recovered. But also: the ASM is clearly a fair bit faster, so I think that worrying about the exact fastest form of the C code isn't that important.

get rid of uint128_t in the field code,

AFAIK you can't do that without losing access to the 64,64->high64 widening multiply which is important for performance. The above argument I gave for "just use asm" doesn't apply because it's a big difference and there is not likely to be ASM written for ARM8 (or Riscv64, Power9, etc.) any time soon.

Aug 07 '20 18:08 gmaxwell

I think it's worth tweaking to see if the performance can be recovered. But also: the ASM is clearly a fair bit faster, so I think that worrying about the exact fastest form of the C code isn't that important.

Indeed, and then #453 is the way to go.

get rid of uint128_t in the field code,

AFAIK you can't do that without losing access to the 64,64->high64 widening multiply which is important for performance. The above argument I gave for "just use asm" doesn't apply because it's a big difference and there is not likely to be ASM written for ARM8 (or Riscv64, Power9, etc.) any time soon.

Ok, true.

Aug 07 '20 18:08 real-or-random

AFAIK you can't do that without losing access to the 64,64->high64 widening multiply which is important for performance.

Yes, there is no way we can have true 64-bit operation without the ability to do a 64x64->128 bit multiplication in one way or another.

But I don't think that's a problem - even MSVC has _umul128 that effectively implements that; just not as a neat 128-bit data type.

I also think it's orthogonal to the changes here. If we wanted to support non-__int128 platforms, the code would be replaced by some set of configuration-dependent macros that translate to __int128 operations on supporting platforms, and _umul128 on pairs of uint64_t on MSVC - but the changes here would still matter.

Aug 07 '20 20:08 sipa

secp256k1 secp256k1 copied to clipboard

Use double-wide types for additions in scalar code

secp256k1
secp256k1 copied to clipboard