secp256k1 icon indicating copy to clipboard operation
secp256k1 copied to clipboard

Synthetic int128 type.

Open roconnor-blockstream opened this issue 2 years ago • 19 comments

Abstracts the int128 type and provides an native version, if available, or a implements it using a pair of int64_t's.

This is activated by setting the configuration flag --with-test-override-wide-multiply=int128_struct.

The primary purpose of this PR is to take advantage of MSVC's umulh intrinsic that we can use to simulate an int128 type which MSVC does not have (AFAIU). This PR lays out the groundwork for this level of MSVC support, but doesn't include the configuration logic to enable it yet.

For completeness, and implementation of umulh and mulh are also provided for compilers that support neither the intrinsic nor the int128 type (such as CompCert?). This also opens up the possibility of removing the 32-bit field and scalar implementations should that ever be desired.

roconnor-blockstream avatar Oct 26 '21 14:10 roconnor-blockstream

I'm moving this out of draft stage. The coding is complete. There are a few tasks that remain.

  • [x] Double check that there is no signed integer overflow due to rearrangement of the order of some operations.
  • [x] We need someone to build this on MSVC.
  • [x] Make sure that the refactored code for the native int128 type isn't any slower.

To compile with the synthetic int128 type, you currently need to pass a configuration flag --with-test-override-wide-multiply=int128_struct.

roconnor-blockstream avatar Dec 07 '21 20:12 roconnor-blockstream

* [ ]  We need someone to build this on MSVC.

I have built this with MSVC 2019 on wine (https://github.com/mstorsjo/msvc-wine) and it works. Tests and exhaustive tests pass.

For reference:

 ./configure --disable-benchmark --with-test-override-wide-multiply=int128_struct CC=/opt/msvc/bin/x64/cl.exe CFLAGS="-Za -O2 -w" LD=/opt/msvc/bin/x64/link.exe

For the Arch Linux users, there's an AUR package: https://aur.archlinux.org/packages/msvc-wine-git/

real-or-random avatar Dec 09 '21 19:12 real-or-random

FWIW, bitcoin core has an AppVeyor MSVC/Windows CI environment. Maybe it's worth looking into adding one for libsecp256k1 directly.

sipa avatar Dec 09 '21 19:12 sipa

Indeed. Let me note that also Cirrus apparently offers Windows though I have no idea if it's good and/or reliable: https://cirrus-ci.org/guide/windows/

real-or-random avatar Dec 10 '21 11:12 real-or-random

My above comment about the inclusion stuff was wrong but here's a cleaner version: https://github.com/real-or-random/secp256k1/commits/202201-int128-includes

The first commit simply fixes naming of header guards and should belong to this PR.

The second commit changes to code adheres to what I wrote in #1039:

After this commit, int128.h and int128_impl.h are included as follows:

  • .c files which use int128 include int128_impl.h (after util.h)
  • .h files which use int128 include int128.h (after util.h)

This list is exhaustive. util.h needs to included first because it sets up necessary #defines.

If you want, please pick the second commit, too. Or if you don't want to deal with the C mess, I can create a PR with the second commit on top of yours, and we fix this once your PR has been merged.

real-or-random avatar Feb 01 '22 13:02 real-or-random

I notice that your PR still keeps the USE_*_WIDEMUL_* in util.h. it still seems we cannot really move it out of util.h because it both selects between INT128_STRUCT and INT128_NATIVE, but can also select WIDEMUL_INT64 which isn't really INT128 related.

More specifically this block of CPP code is interpreting the wideMultiply configuration option, which I've designed to have 3 options (by adding that third option).

roconnor-blockstream avatar Feb 14 '22 22:02 roconnor-blockstream

I notice that your PR still keeps the USE_*_WIDEMUL_* in util.h. it still seems we cannot really move it out of util.h because it both selects between INT128_STRUCT and INT128_NATIVE, but can also select WIDEMUL_INT64 which isn't really INT128 related.

Right. I think that's fine.

So this means we have a couple of input #defines (preset in the compiler or set by autoconf, e..g, USE_*_WIDEMUL_*) and we have some CPP logic in file that turns those inputs into "nice" #ifdefs that can then be used in the rest of the code base, e.g., INT128_NATIVE. I believe that's a reasonable way of doing things.

(In the future we should move the CPP logic from util.h to a separate file. It somehow ended up in util.h though it does not belong there...)

real-or-random avatar Feb 17 '22 13:02 real-or-random

I've reviewed the changes in the association of + for signed int128 calculations to look for changes in overflow behavior.

Signed int128 is only used in the src/modinv64_impl.h file. The only changes in association occur in the functions:

  • secp256k1_modinv64_update_de_62
  • secp256k1_modinv64_update_fg_62
  • secp256k1_modinv64_update_fg_62_var

The values being summed are all of the form (int128_t)u * ??? + (int128_t)v * ??? (or similarly for q and r). The only way the new association could cause an overflow (or underflow) is if this previous calculation depended on some cancellation within the above addition term to prevent the final accumulation from itself overflowing. However, the previous calculation does not depend on any cancellation. The only assumed constraint on u and v (resp. q and r) is that |u|+|v| <= 2^62, which implies nothing about the values of u and v causing cancellation.

My conclusion, after the above and re-reviewing the above functions, is that neither the new (or old) order of addition risks overflow.

roconnor-blockstream avatar Feb 28 '22 18:02 roconnor-blockstream

Indeed, we know nothing about the signs of the summands. To convince myself, I also redid the bounds analysis:

Counting value bits (i.e., bits that are not the sign bit):

For each step, cd starts with at most 52 bits. Then we accumulate (int128_t)u * ???, where the factors have at most 62 bits (implied by |u|+|v| <= 2^62*) and 63 bits (trivial bound for int64_t). Same with (int128_t)v * e0. Then we accumulate (int128_t) modinfo->modulus.??? * md, where the factors have at most 62 (modulus is in signed62) and 63 value bits.

We work with signed, so the value with the maximum absolute value representable in B bits is -(1<<B) (with absolute 1<<B).
This means the maximum absolute value cd can take is (1<<52) + (1<<62) * (1<<63) + (1<<62) * (1<<63) + (1<<62) * (1<<63) which is a 127 bit value and thus reprensentable in 127 value bits.

Same is true for ce.

*except u==2^62, which needs 63 bits but then v==0, which also works: (1<<52) + (1<<63) * (1<<63) + 0 * (1<<63) + (1<<62) * (1<<63) is also a 127 bit value.

real-or-random avatar Mar 01 '22 10:03 real-or-random

@roconnor-blockstream Can you rebase this? This will ease benchmarking against master.

real-or-random avatar Mar 01 '22 11:03 real-or-random

Native 128bit performance looks good:

$ SECP256K1_BENCH_ITERS=1000000 ./bench_internal inverse

gcc 11.2, pr:

Benchmark                     ,    Min(us)    ,    Avg(us)    ,    Max(us)

scalar_inverse                ,     2.40      ,     2.41      ,     2.42   
scalar_inverse_var            ,     1.74      ,     1.75      ,     1.81   
field_inverse                 ,     2.38      ,     2.39      ,     2.40   
field_inverse_var             ,     1.73      ,     1.73      ,     1.74

gcc 11.2, master:

Benchmark                     ,    Min(us)    ,    Avg(us)    ,    Max(us)    

scalar_inverse                ,     2.39      ,     2.39      ,     2.41   
scalar_inverse_var            ,     1.77      ,     1.77      ,     1.78   
field_inverse                 ,     2.37      ,     2.38      ,     2.39   
field_inverse_var             ,     1.76      ,     1.77      ,     1.82


clang 13.0.1, pr:
Benchmark                     ,    Min(us)    ,    Avg(us)    ,    Max(us)    

scalar_inverse                ,     2.70      ,     2.70      ,     2.72   
scalar_inverse_var            ,     1.69      ,     1.70      ,     1.70   
field_inverse                 ,     2.69      ,     2.70      ,     2.70   
field_inverse_var             ,     1.69      ,     1.69      ,     1.70   

clang 13.0.1, master:

Benchmark                     ,    Min(us)    ,    Avg(us)    ,    Max(us)    

scalar_inverse                ,     2.69      ,     2.70      ,     2.71   
scalar_inverse_var            ,     1.69      ,     1.69      ,     1.70   
field_inverse                 ,     2.69      ,     2.69      ,     2.69   
field_inverse_var             ,     1.68      ,     1.68      ,     1.68

But this is with asm on, I should have turned it off... Would be nice to see more benchmarks.

real-or-random avatar Mar 01 '22 11:03 real-or-random

Rebased.

roconnor-blockstream avatar Mar 01 '22 16:03 roconnor-blockstream

I've written formal correctness proofs of the functions in the src/int128_struct_impl.h file using VST. The most relevant part is the function specifications which can be found at lines of the form Defintion secp256k1_*_spec : ...

roconnor-blockstream avatar Jun 07 '22 21:06 roconnor-blockstream

This is about 1% slower on my machine with gcc -O2 but about 1.5% faster with gcc -03.

gcc 12.1.0 -O2, x86_64, asm disabled:

Benchmark                     ,    Min(us)    ,    Avg(us)    ,    Max(us)    

master:

ecdsa_verify                  ,    53.7       ,    53.9       ,    54.3    
ecdsa_sign                    ,    39.5       ,    39.7       ,    40.9    


pr:

ecdsa_verify                  ,    54.2       ,    54.4       ,    55.2    
ecdsa_sign                    ,    39.7       ,    39.9       ,    40.6  

gcc 12.1.0 -O3, x86_64, asm disabled:


Benchmark                     ,    Min(us)    ,    Avg(us)    ,    Max(us)    

master:

ecdsa_verify                  ,    52.2       ,    52.5       ,    53.2    
ecdsa_sign                    ,    39.6       ,    39.7       ,    40.3   

pr:

ecdsa_verify                  ,    51.6       ,    51.8       ,    52.4    
ecdsa_sign                    ,    39.5       ,    39.7       ,    40.4  

And it seems that gcc version 12 is finally faster at -O3 than at -O2...

Do we care about the loss in O2? I guess no.

real-or-random avatar Jul 11 '22 19:07 real-or-random

Is the generated asm different?

roconnor-blockstream avatar Jul 11 '22 20:07 roconnor-blockstream

Is the generated asm different?

Yeah, we could check but I imagine it will be a lot of work. It would probably make sense to first run the internal benchmarks and see where we lose performance. And then look at the asm of these functions.

real-or-random avatar Jul 11 '22 22:07 real-or-random

@roconnor-blockstream Did you update your VST proofs to the changes suggested by Pieter? (If no, are planning to do so?)

real-or-random avatar Jul 29 '22 14:07 real-or-random

I have not updated the proofs, but I do plan to do so. I don't suggest waiting for them, but if you prefer to wait that's fine.

I'm working on a second pass through the proofs anyways in an attempt to add more automatic tactics to help me, so it is slow going.

roconnor-blockstream avatar Aug 08 '22 13:08 roconnor-blockstream

While I continue to refactor my proofs to automate the tedious bits, I have now got to the point where the revised code in the int128_struct file is now proved correct.

roconnor-blockstream avatar Aug 29 '22 21:08 roconnor-blockstream

I'm done my pass at refactoring my proof. There isn't too much to say. The new proofs are less tedious than they were before. You are welcome to "browse" the proof at https://htmlpreview.github.io/?https://github.com/ElementsProject/simplicity/blob/027e46ca91dc7eecf5a05707798edf17e72612c3/alectryon/verif_int128_impl.v.html for whatever that's worth. Again, the *_spec definitions are probably the most important bits as they define the pre and post-conditions of the functions that form the statements of what I've proved.

The most notable change is that the specification for secp256k1_i128_to_i64 was amended to remove the precondition that the input be a 64-bit signed integer value. Apparently libsecp256k1 on occasion will cast signed int128 numbers to signed int64 numbers, expecting to truncate the bits. That seems kinda dicey to me, but okay.

roconnor-blockstream avatar Oct 24 '22 21:10 roconnor-blockstream

Apparently libsecp256k1 on occasion will cast signed int128 numbers to signed int64 numbers, expecting to truncate the bits. That seems kinda dicey to me, but okay.

Do you have an example where this is unexpected (according to your impression)?

real-or-random avatar Oct 25 '22 08:10 real-or-random

Looking into it more closely it seems the sketchy cases of casting large signed int128 numbers to int64 is always then immediately masked with & M62, so it isn't so dicey in practice.

To check on this I created a commit that splits out a secp256k1_i128_to_u64 function and adds a VERIFY_CHECK to the secp256k1_i128_to_i64 function. This passes the tests. It will fail the tests with only the VERIFY_CHECK added.

I'm not proposing we add that commit. I only made it for validation purposes. Although we could add it if people really think that is a good idea, but that would be a separate PR.

roconnor-blockstream avatar Oct 25 '22 14:10 roconnor-blockstream

@jonasnick I've merged your tests into one commit.

roconnor-blockstream avatar Nov 07 '22 22:11 roconnor-blockstream

Some benchmarking on ARM64 hardware (AWS a1.large, Linux, GCC 11.2, all with --enable-static --disable-shared):

this PR (a340d9500a9c45e5c261174f48b3eb18b3b3647d):

  • aarch64 (64-bit):
    • --with-test-override-wide-multiply=int64 --with-asm=no:
      • ecdsa_verify , 149.0 , 149.0 , 149.0
      • ecdsa_sign , 98.2 , 98.2 , 98.4
    • --with-test-override-wide-multiply=int128 --with-asm=no:
      • ecdsa_verify , 175.0 , 175.0 , 176.0
      • ecdsa_sign , 101.0 , 101.0 , 102.0
    • --with-test-override-wide-multiply=int128_struct --with-asm=no:
      • ecdsa_verify , 294.0 , 295.0 , 295.0
      • ecdsa_sign , 154.0 , 154.0 , 154.0
  • armhf (32-bit, but running on the same 64-bit hardware):
    • --with-test-override-wide-multiply=int64 --with-asm=no:
      • ecdsa_verify , 301.0 , 301.0 , 301.0
      • ecdsa_sign , 172.0 , 172.0 , 172.0
    • --with-test-override-wide-multiply=int64 --with-asm=arm:
      • ecdsa_verify , 256.0 , 256.0 , 256.0
      • ecdsa_sign , 154.0 , 154.0 , 154.0
    • --with-test-override-wide-multiply=int128_struct --with-asm=no:
      • ecdsa_verify , 477.0 , 478.0 , 481.0
      • ecdsa_sign , 265.0 , 266.0 , 266.0

PR's base commit (694ce8fb2d1fd8a3d641d7c33705691d41a2a860):

  • aarch64 (64-bit):
    • --with-test-override-wide-multiply=int64 --with-asm=no:
      • ecdsa_verify , 149.0 , 149.0 , 149.0
      • ecdsa_sign , 98.2 , 98.3 , 98.3
    • --with-test-override-wide-multiply=int128 --with-asm=no:
      • ecdsa_verify , 175.0 , 175.0 , 176.0
      • ecdsa_sign , 102.0 , 102.0 , 102.0
  • armhf (32-bit):
    • --with-test-override-wide-multiply=int64 --with-asm=no:
      • ecdsa_verify , 300.0 , 301.0 , 301.0
      • ecdsa_sign , 172.0 , 172.0 , 172.0
    • --with-test-override-wide-multiply=int64 --with-asm=arm:
      • ecdsa_verify , 256.0 , 256.0 , 256.0
      • ecdsa_sign , 155.0 , 155.0 , 155.0

So conclusions (for this platform):

  • This PR doesn't make anything worse in terms for existing wide-multiply configurations (int64, int128).
  • int64 is faster than int128, but within int64, 64-bit output is faster than 32-bit output.
  • 32-bit ARM asm code is still faster than pure C.

sipa avatar Nov 07 '22 23:11 sipa

I'm not sure we should care about whether the MSVC 64-bit ARM logic is tested. We clearly can't promise testing on every possible hardware/OS/system/compiler combination, so we need to make some choices. Perhaps this is worth including once we have a release: a list of systems we're testing on, which is a subset of the systems we aim to work on (everything C89 + uint64_t, pretty much).

It's perhaps a bit strange to include code specifically to optimize performance on MSVC 64-bit ARM, without considering that platform tested. We could drop the optimization, but unless there is a concern that having code for optimization present would be misinterpreted as "supported", we could also just leave it for now.

sipa avatar Nov 13 '22 17:11 sipa

@sipa

We could drop the optimization, but unless there is a concern that having code for optimization present would be misinterpreted as "supported", we could also just leave it for now.

I agree, my remark was mostly about adding code paths that have never been run at all.

jonasnick avatar Nov 14 '22 12:11 jonasnick

Presumably we can add configuration options so that our MSVC test infrastructure can at least exercise this code path on X86.

roconnor-blockstream avatar Nov 14 '22 15:11 roconnor-blockstream

@roconnor-blockstream We could add a configure/config flag for forcing the use of _mulh/_umulh over _mul128/_umul128, and set it for one of our MSVC CI configs.

sipa avatar Nov 14 '22 15:11 sipa

I wrote some randomized tests for the int128 functions: https://github.com/sipa/secp256k1/commits/202211_int128

Not everything is covered, but the most tricky functions are.

sipa avatar Nov 14 '22 20:11 sipa

@roconnor-blockstream We could add a configure/config flag for forcing the use of _mulh/_umulh over _mul128/_umul128, and set it for one of our MSVC CI configs.

https://github.com/real-or-random/secp256k1/tree/202211-int128-mulh-override implemented here, ready to be cherry-picked

edit: I've tested this on local MSVC on wine with

./configure --enable-dev-mode --host=x86_64-w64-mingw32 --with-test-override-wide-multiply=int128_struct CPPFLAGS="-DSECP256K1_MSVC_MULH_TEST_OVERRIDE" CC=/opt/msvc/bin/x64/cl CFLAGS="-nologo -diagnosti
cs:caret" LDFLAGS="-XCClinker -nologo -XCClinker -diagnostics:caret" NM="/opt/msvc/bin/x64/dumpbin -symbols -headers" AR="/opt/msvc/bin/x64/lib"

real-or-random avatar Nov 14 '22 22:11 real-or-random