secp256k1 icon indicating copy to clipboard operation
secp256k1 copied to clipboard

Use double-wide types for additions in scalar code

Open sipa opened this issue 3 years ago • 11 comments

Suggested here by Andy Polyakov.

This is shorter, easier to reason about, more likely to not contain branches, and likely (slightly) faster too.

sipa avatar Aug 07 '20 01:08 sipa

@gmaxwell You may want to test performance/branchlessness on various platforms.

sipa avatar Aug 07 '20 01:08 sipa

Good news: tests pass, including valgrind ct tests.

Bad news:

First is before the patch.

x86_64 64-bit GCC 10.2.1 (asm disabled, obviously) scalar_sqr: min 0.0421us / avg 0.0422us / max 0.0423us scalar_mul: min 0.0410us / avg 0.0410us / max 0.0411us scalar_inverse: min 12.2us / avg 12.2us / max 12.2us scalar_sqr: min 0.0456us / avg 0.0457us / max 0.0457us scalar_mul: min 0.0433us / avg 0.0434us / max 0.0434us scalar_inverse: min 13.2us / avg 13.2us / max 13.2us

x86_64 32-bit GCC 10.2.1 scalar_sqr: min 0.122us / avg 0.122us / max 0.123us scalar_mul: min 0.109us / avg 0.109us / max 0.109us scalar_inverse: min 35.0us / avg 35.0us / max 35.0us scalar_sqr: min 0.0961us / avg 0.0962us / max 0.0965us scalar_mul: min 0.110us / avg 0.110us / max 0.110us scalar_inverse: min 28.6us / avg 28.6us / max 28.6us

ARMv8 64-bit GCC 9.3 scalar_sqr: min 0.209us / avg 0.209us / max 0.209us scalar_mul: min 0.222us / avg 0.222us / max 0.222us scalar_inverse: min 61.1us / avg 61.1us / max 61.2us scalar_sqr: min 0.209us / avg 0.209us / max 0.209us scalar_mul: min 0.237us / avg 0.237us / max 0.237us scalar_inverse: min 61.6us / avg 61.6us / max 61.7us

ARMv8 32-bit GCC 9.3 scalar_sqr: min 0.399us / avg 0.399us / max 0.400us scalar_mul: min 0.388us / avg 0.388us / max 0.388us scalar_inverse: min 115us / avg 115us / max 115us scalar_sqr: min 0.392us / avg 0.393us / max 0.394us scalar_mul: min 0.471us / avg 0.473us / max 0.483us scalar_inverse: min 117us / avg 117us / max 118us

gmaxwell avatar Aug 07 '20 02:08 gmaxwell

@sipa Do you remember #453? It goes in the same direction but it's a more radical approach, and maybe then the right way to do it.

These benchmarks also remind me about my comment in #436. Optimizers seem to like these macros particularly well..

x86_64 32-bit GCC 10.2.1

At least it's faster here. Hm, now I (somewhat seriously) wonder whether we should do the opposite and get rid of uint128_t in the field code, which is the only reason why MSVC does not support our 64bit code. (Even though there are other ways to fix this.)

real-or-random avatar Aug 07 '20 08:08 real-or-random

Bad news:

Life is compromise. Looking back at originating question the main idea behind suggestion was to leave no incentive for compiler to throw in branch. And double-width additions do that, unlike conditionals. So you have to ask yourself what's more important. Besides, formally speaking a data point from just one compiler isn't that conclusive.

dot-asm avatar Aug 07 '20 08:08 dot-asm

By the way, if anyone is interested: https://godbolt.org/z/8GThbv (no. 1 before, no. 2 after the patch). It's interesting to see how far llvm-mca's prediction is from reality. @gmaxwell What's the exact x86 CPU that you used to run the benchmark?

real-or-random avatar Aug 07 '20 09:08 real-or-random

What's the exact x86 CPU that you used to run the benchmark?

EPYC 7742

gmaxwell avatar Aug 07 '20 15:08 gmaxwell

I forgot to mention that I don't think these benchmarks are a showstopper. If we think that the code is better (for readability and no branching), I'm fine with this.

real-or-random avatar Aug 07 '20 17:08 real-or-random

@real-or-random Agree, though it may inform us about trying variants that perhaps have less performance impact.

Another point is that with the divstep inversion code, scalar mul/sqr performance will be almost entirely irrelevant for inversion (which is I expect the only way scalar operations matter to high-level functions like signing and verification).

sipa avatar Aug 07 '20 17:08 sipa

I think inversion is just interesting to look at because it gives a single number. Scalar mul performance is more important in the context of batch validation, some ZKP etc.

I think it's worth tweaking to see if the performance can be recovered. But also: the ASM is clearly a fair bit faster, so I think that worrying about the exact fastest form of the C code isn't that important.

get rid of uint128_t in the field code,

AFAIK you can't do that without losing access to the 64,64->high64 widening multiply which is important for performance. The above argument I gave for "just use asm" doesn't apply because it's a big difference and there is not likely to be ASM written for ARM8 (or Riscv64, Power9, etc.) any time soon.

gmaxwell avatar Aug 07 '20 18:08 gmaxwell

I think it's worth tweaking to see if the performance can be recovered. But also: the ASM is clearly a fair bit faster, so I think that worrying about the exact fastest form of the C code isn't that important.

Indeed, and then #453 is the way to go.

get rid of uint128_t in the field code,

AFAIK you can't do that without losing access to the 64,64->high64 widening multiply which is important for performance. The above argument I gave for "just use asm" doesn't apply because it's a big difference and there is not likely to be ASM written for ARM8 (or Riscv64, Power9, etc.) any time soon.

Ok, true.

real-or-random avatar Aug 07 '20 18:08 real-or-random

AFAIK you can't do that without losing access to the 64,64->high64 widening multiply which is important for performance.

Yes, there is no way we can have true 64-bit operation without the ability to do a 64x64->128 bit multiplication in one way or another.

But I don't think that's a problem - even MSVC has _umul128 that effectively implements that; just not as a neat 128-bit data type.

I also think it's orthogonal to the changes here. If we wanted to support non-__int128 platforms, the code would be replaced by some set of configuration-dependent macros that translate to __int128 operations on supporting platforms, and _umul128 on pairs of uint64_t on MSVC - but the changes here would still matter.

sipa avatar Aug 07 '20 20:08 sipa