secp256k1 icon indicating copy to clipboard operation
secp256k1 copied to clipboard

WIP: Add schnorrsig batch verification

Open jonasnick opened this issue 4 years ago • 38 comments

This was part of #558 (for 20 months) to demonstrate the advantages of batch verification (see graph), but then removed to simplify #558 because there are still ongoing discussions:

  • [ ] there's @real-or-random's proposal to add synthetic randomness for batch verification (https://github.com/sipa/bips/issues/204)
  • [ ] batch verification fairly well tested, but still wouldn't be comfortable with using this in Bitcoin Core for consensus in its current state because it relies on parts of the lib that are otherwise unused such as scratch spaces and ecmult_multi. Ideally we would have comprehensive fuzz tests for batch verification.
  • [ ] adding chacha20 may not be worth it, because it may only provide a negligible speedup over SHA256 (TODO: test this), plus we're planning to allow overriding the SHA256 implementation at compile time (https://github.com/bitcoin-core/secp256k1/pull/558#issuecomment-619579991).

jonasnick avatar Jun 18 '20 15:06 jonasnick

Time to start rebasing on the nearly complete #558?

gmaxwell avatar Jul 22 '20 22:07 gmaxwell

rebased on master

jonasnick avatar Sep 11 '20 21:09 jonasnick

schnorrsig_sign: min 25.7us / avg 25.8us / max 26.2us
schnorrsig_verify: min 57.5us / avg 57.7us / max 58.0us
schnorrsig_batch_verify_1: min 64.0us / avg 64.3us / max 64.7us
schnorrsig_batch_verify_2: min 50.4us / avg 50.7us / max 51.0us
schnorrsig_batch_verify_4: min 43.8us / avg 43.9us / max 44.1us
schnorrsig_batch_verify_8: min 40.4us / avg 40.5us / max 40.5us
schnorrsig_batch_verify_16: min 38.9us / avg 39.0us / max 39.1us
schnorrsig_batch_verify_32: min 38.2us / avg 38.4us / max 38.7us
schnorrsig_batch_verify_64: min 37.7us / avg 37.8us / max 37.9us
schnorrsig_batch_verify_128: min 35.2us / avg 35.3us / max 35.3us
schnorrsig_batch_verify_256: min 31.9us / avg 32.0us / max 32.1us
schnorrsig_batch_verify_512: min 29.2us / avg 29.4us / max 29.7us
schnorrsig_batch_verify_1024: min 27.5us / avg 27.5us / max 27.5us
schnorrsig_batch_verify_2048: min 25.8us / avg 25.9us / max 26.0us
schnorrsig_batch_verify_4096: min 24.5us / avg 24.7us / max 24.8us
schnorrsig_batch_verify_8192: min 23.5us / avg 23.5us / max 23.6us

jonasnick avatar Sep 11 '20 21:09 jonasnick

It's a bit unfortunate that this API doesn't really lend itself to cleanly supporting combined batches of BIP340 signature and taproot tweaks (which also need an EC multiplication).

Given that this internally builds a batch object anyway, would it be reasonable to have that in the external API as well? So an idea could be that you:

  • Construct an (opaque) batch object
  • Add BIP340 verifications to it, using a variant of secp256k1_schnorrsig_verify that either fails immediately (parsing/decompression failures), or succeeds when the check was added to a batch object.
  • Add tweak checks to it, using a variant of secp256k1_xonly_pubkey_tweak_add_check.
  • In the end, a batch_verify function can be called on the batch to do all checks, and return true or false.

sipa avatar Sep 11 '20 22:09 sipa

It's a bit unfortunate that this API doesn't really lend itself to cleanly supporting combined batches of BIP340 signature and taproot tweaks (which also need an EC multiplication).

Given that this internally builds a batch object anyway, would it be reasonable to have that in the external API as well? So an idea could be that you:

* Construct an (opaque) batch object

* Add BIP340 verifications to it, using a variant of `secp256k1_schnorrsig_verify` that either fails immediately (parsing/decompression failures), or succeeds when the check was added to a batch object.

* Add tweak checks to it, using a variant of `secp256k1_xonly_pubkey_tweak_add_check`.

* In the end, a batch_verify function can be called on the batch to do all checks, and return true or false.

OoO I like that constructions, it allows for lazy batching and verifying only when you're ready, which is also very useful for non-bitcoin applications by verifying things periodically when you have spare CPU time.

elichai avatar Sep 12 '20 10:09 elichai

Sounds like a reasonable plan. In particular, because it would be easy to add functions that manipulate the batch object for other schemes who need an EC mult at the end of verification. The current batch object only holds pointers to the elements, so care must be taken to ensure that they still exist at batch_verify time if this becomes a multi-step process.

jonasnick avatar Sep 12 '20 11:09 jonasnick

I think if it can be avoided it would be best to minimize holding pointers to caller provided objects, except in narrow cases (e.g. scratch)... lifetime management is hard for everyone.

An alternative might be to have a function that takes a sigs countcount and pointers to arrays of pubkeys/signatures/messagehashes, then taproot count, and arrays for those. Less generic, but it would avoid needing to copy the inputs into library provided memory or retain pointers to caller provided objects.

gmaxwell avatar Sep 12 '20 19:09 gmaxwell

@gmaxwell The alternative is probably that the caller is going to do the copying into some batch object on their side instead, so I don't think it's that much of a difference.

I think having the batch object have its own storage is probably better. That may mean that the caller should be able to select a maximum size (and once exceeded, transparently run validation of the already-provided batch?)

sipa avatar Sep 14 '20 01:09 sipa

Sounds fine to me, though I hope it doesn't need 2x the memory to store both the input and the intermediate work. :)

gmaxwell avatar Sep 14 '20 02:09 gmaxwell

I think having the batch object have its own storage is probably better. That may mean that the caller should be able to select a maximum size (and once exceeded, transparently run validation of the already-provided batch?)

I agree but I'm somewhat worried about how, this will probably require the caller to know the approximate size of the batch(or the amount of sigs/tweaks) when starting the batch. I'd love to see if there's some creative C API we can come up with

elichai avatar Sep 14 '20 08:09 elichai

@elichai No, I mean the opposite!

The caller shouldn't need to predict how large the batch will become - if they knew that, they wouldn't need it, as they could just choose to stop after a certain size instead.

What I mean is that the caller gets to set a maximum memory usage limit, and when that limit would be exceeded, adding another entry to the batch just causes the batch validation to run on what was added so far - and remember the outcome of that.

sipa avatar Sep 14 '20 08:09 sipa

If what it processed so far failed, all further calls can be super fast because it's just going to return a fail. :P

gmaxwell avatar Sep 14 '20 08:09 gmaxwell

Taking short circuit evaluation of && to a next level.

sipa avatar Sep 14 '20 08:09 sipa

(and once exceeded, transparently run validation of the already-provided batch?)

I like that :) it gives the caller a tradeoff between memory and CPU while not crippling them if they predicted wrongly the max size

elichai avatar Sep 14 '20 08:09 elichai

How do people feel about the following API:

int secp256k1_start_batch_size(size_t ops);
secp256k1_batch* secp256k1_start_batch(const secp256k1_context* ctx, secp256k1_scratch_space* scratch);
int secp256k1_batch_add_sig(ctx, batch, sig, msg, pubkey);
int secp256k1_batch_add_xpubkey_tweak_add_check(ctx, batch, parity, tweaked_pubkey, pubkey, tweak);
int secp256k1_batch_verify(ctx, batch);

All the add functions for secp256k1_batch will use something like that:

if (batch.len == batch.scratch_capacity) {
    if (batch.failed) {return;}
    batch.failed = !secp256k1_batch_verify(ctx, batch);
    // clear the rest of the state
}
// add to batch
batch.len++
return
}

(all the names are subject to bikeshedding)

elichai avatar Mar 29 '21 09:03 elichai

I like this idea of batch verifying in an add function if the scratch space is full. It'll need quite a bit of refactoring in ecmult_multi to separate out scratch space allocation. @elichai that matches my understanding of the approach and looks good to me. What does secp256k1_start_batch_size do?

jonasnick avatar Mar 29 '21 13:03 jonasnick

I'm starting to think the ecmult_multi_var is slightly too narrow of an interface to be used for batch verification. Currently it does a "Multi-multiply: R = inp_g_sc * G + sum_i ni * Ai." But what I think we want is one that does "Multi-multiply: R = (sum_i gi) * G + sum_i ni * Ai." So that we can stream a series of equations to be batch verfied without needing to add up all the G coefficents in advance.

We have (attempted) this nice streamable API for ecmult_multi_var, but what's the point of it if we just have to allocate a new buffer for all the inputs upfront?

roconnor-blockstream avatar Mar 29 '21 13:03 roconnor-blockstream

What does secp256k1_start_batch_size do?

Tells you the size of the scratch space required for the amount of signatures/tweaks you want to batch

elichai avatar Mar 29 '21 14:03 elichai

Barring such an enhanced ecmult_multi_var interface I would propose the following API for batch verification:

typedef int (secp256k1_batch_verify_gi_callback)(secp256k1_scalar *gi, size_t idx, void *data);
typedef int (secp256k1_batch_verify_callback)(secp256k1_scalar *na, secp256k1_scalar *nb, secp256k1_ge *pta, secp256k1_ge *ptb, size_t idx, void *data);

/* Verifies na_i*A_i = nb_i*B_i + ng_i * G for all i < n (with high probability). */
static int secp256k1_batch_verify(ctx, scratch, secp256k1_batch_verify_gi_callback cb_gi, secp256k1_batch_verify_callback  cb, void *cbdata, size_t n);

secp256k1_batch_data_gi_from_sig(secp256k1_scalar *gi, sig);
secp256k1_batch_data_from_sig(secp256k1_scalar *na, secp256k1_scalar *nb, secp256k1_ge *pta, secp256k1_ge *ptb, sig, msg, pubkey);

secp256k1_batch_data_gi_from_xpubkey_tweak(secp256k1_scalar *gi);
secp256k1_batch_data_from_xpubkey_tweak(secp256k1_scalar *na, secp256k1_scalar *nb, secp256k1_ge *pta, secp256k1_ge *ptb, parity, tweaked_pubkey, pubkey, tweak);

Edit: There are a couple of possible variations here. We could drop the na scalar values, and instead verify A_i = nb_i*B_i + ng_i * G (though I think adding the na is fine as it comes nearly for free). We could also rearrange the verification equation to verify 0 = na_i*A_i + nb_i*B_i + ng_i * G or 0 = A_i + nb_i*B_i + ng_i * G. I don't have any strong feelings about these variants.

roconnor-blockstream avatar Mar 29 '21 14:03 roconnor-blockstream

My proposal was based on the idea that batch_verify must use secp256k1_ecmult_multi_var, but this line of thinking was wrong. batch_verify can call secp256k1_ecmult_pippenger_wnaf and friends directly. I withdraw my proposal until I give things more consideration.

roconnor-blockstream avatar Mar 29 '21 15:03 roconnor-blockstream

@roconnor-blockstream I don't see what the issue is with the multi-multiplication interface. The batch interface can do the aggregation of scalars before calling the multi-multiplication code.

I also don't think we should be exposing a public interface for arbitrary EC operations/verifications. This library aims for a high-level interface of protocols.

sipa avatar Mar 29 '21 16:03 sipa

The issue is that if everything is done naively the following happens:

  1. Batch verification allocates a buffer, or it is allocated by the caller.
  2. Batch verification runs point decompression the data from the signatures and tweaks, filling the buffer until it is full, copying this data from their own working copy of signature and tweak data.
  3. A chacha seed is computed by scanning this buffered data (note that the BIP-340 specification for the chacha seed as written doesn't support mixing tweaks with signature data, so some liberty must be taken here).
  4. The buffered data is scanned again to compute the ng scalar value to be passed to ecmult_mult.
  5. ecmult_multi is called with this ng scalar value, the buffered data, and a custom callback to lookup points and scalars from this buffered data, and multiply it by the appropriate chacha coefficient.
  6. ecmult_multi calls either secp256k1_ecmult_pippenger_batch and/or secp256k1_ecmult_strauss_batch
  7. In either case yet another buffer is allocated, to hold yet another copy of the points which is filled in by calling the callback which simply copies from the previous allocated buffer.
  8. The result of ecmult_multi is tested for infinity.

This naive approach involves three copies of an entire batch of points in working memory simultaneously:

  1. Group elements in compressed form from signature data, and public keys, and tap-tweak public keys that the user is starting from.
  2. a copy with decompressed points for the naive batch validation implementation itself (an alternative implementation could maybe keep this copy of points compressed, but it still needs to be buffered.)
  3. another copy of decompressed points for either the secp256k1_ecmult_pippenger_batch and/or secp256k1_ecmult_strauss_batch, depending on which on one ends up being used.

Having 3 simultaneous copies of a rather large amount of the same data just to conform to the existing ecmult_multi API doesn't seem reasonable.

roconnor-blockstream avatar Mar 29 '21 18:03 roconnor-blockstream

I don't recall what this implementation does but at least at one point batch validation was implemented without duplicated buffering, reusing the scratch space for both the queue and the working space for the multi-exp.

ecmult_multi is not a public api, it is entirely internal to the library and not exported (all things which are exposed are annotated with SECP256K1_API)-- and I think half the motivation for how it is particularly structured was so that the wnaf pippenger could be shimmed into a set of existing tests. If its interface needs changes or the layout of the scratch space needs to change to avoid making extra copies then that is probably a perfectly reasonable thing to do, but is also a behind the scenes optimization that shouldn't change the public interface.

gmaxwell avatar Mar 30 '21 00:03 gmaxwell

As @elichai noted on IRC, this is an unfair comparison because the pre-rebase benchmark was without endomorphism. So here's pre-rebase with endo enabled:

$ ./bench_schnorrsig 
schnorrsig_sign: min 28.3us / avg 28.6us / max 28.9us
schnorrsig_verify: min 45.2us / avg 45.7us / max 46.4us
schnorrsig_batch_verify_1: min 50.3us / avg 50.7us / max 51.2us
schnorrsig_batch_verify_2: min 46.3us / avg 46.7us / max 47.0us
schnorrsig_batch_verify_4: min 42.9us / avg 43.2us / max 43.5us
schnorrsig_batch_verify_8: min 41.5us / avg 41.6us / max 41.9us
schnorrsig_batch_verify_16: min 41.8us / avg 41.9us / max 42.0us
schnorrsig_batch_verify_32: min 41.4us / avg 41.6us / max 41.7us
schnorrsig_batch_verify_64: min 38.9us / avg 39.2us / max 39.6us
schnorrsig_batch_verify_128: min 35.7us / avg 35.7us / max 35.8us
schnorrsig_batch_verify_256: min 32.4us / avg 32.9us / max 33.7us
schnorrsig_batch_verify_512: min 30.5us / avg 30.6us / max 30.7us
schnorrsig_batch_verify_1024: min 28.5us / avg 28.6us / max 28.7us
schnorrsig_batch_verify_2048: min 27.1us / avg 27.3us / max 27.6us
schnorrsig_batch_verify_4096: min 25.9us / avg 26.4us / max 26.6us
schnorrsig_batch_verify_8192: min 26.1us / avg 26.2us / max 26.3us

EDIT: I can not explain this performance regression right now, here's the pre rebase branch I've used.

jonasnick avatar Mar 30 '21 22:03 jonasnick

I can't reproduce those benchmark results.

All numbers on AMD Ryzen Threadripper 2950X 16-Core Processor, GCC 10.2.1.

old pre-safegcd branch with endo enabled and gmp enabled:

schnorrsig_sign: min 29.2us / avg 29.4us / max 30.0us
schnorrsig_verify: min 48.4us / avg 48.7us / max 49.2us
schnorrsig_batch_verify_1: min 55.0us / avg 55.2us / max 55.6us
schnorrsig_batch_verify_2: min 50.3us / avg 50.4us / max 50.4us
schnorrsig_batch_verify_4: min 46.7us / avg 46.8us / max 46.9us
schnorrsig_batch_verify_8: min 44.6us / avg 44.7us / max 44.7us
schnorrsig_batch_verify_16: min 44.2us / avg 44.3us / max 44.5us
schnorrsig_batch_verify_32: min 43.5us / avg 43.6us / max 43.6us
schnorrsig_batch_verify_64: min 41.1us / avg 41.1us / max 41.2us
schnorrsig_batch_verify_128: min 37.8us / avg 37.8us / max 37.9us
schnorrsig_batch_verify_256: min 33.9us / avg 34.2us / max 34.4us
schnorrsig_batch_verify_512: min 32.0us / avg 32.0us / max 32.0us
schnorrsig_batch_verify_1024: min 29.8us / avg 29.9us / max 30.0us
schnorrsig_batch_verify_2048: min 28.3us / avg 28.4us / max 28.4us
schnorrsig_batch_verify_4096: min 27.1us / avg 27.2us / max 27.3us
schnorrsig_batch_verify_8192: min 27.1us / avg 27.2us / max 27.3us

old pre-safegcd branch with endo enabled and gmp disabled:

schnorrsig_sign: min 29.1us / avg 29.2us / max 29.4us
schnorrsig_verify: min 52.2us / avg 52.4us / max 52.8us
schnorrsig_batch_verify_1: min 55.0us / avg 55.2us / max 55.4us
schnorrsig_batch_verify_2: min 50.6us / avg 50.7us / max 50.9us
schnorrsig_batch_verify_4: min 47.0us / avg 47.3us / max 47.5us
schnorrsig_batch_verify_8: min 44.9us / avg 44.9us / max 44.9us
schnorrsig_batch_verify_16: min 44.8us / avg 45.0us / max 45.4us
schnorrsig_batch_verify_32: min 43.5us / avg 43.6us / max 43.6us
schnorrsig_batch_verify_64: min 41.2us / avg 41.3us / max 41.3us
schnorrsig_batch_verify_128: min 37.8us / avg 37.8us / max 37.9us
schnorrsig_batch_verify_256: min 34.2us / avg 34.3us / max 34.4us
schnorrsig_batch_verify_512: min 32.4us / avg 33.1us / max 33.9us
schnorrsig_batch_verify_1024: min 30.3us / avg 30.4us / max 30.6us
schnorrsig_batch_verify_2048: min 28.3us / avg 28.4us / max 28.6us
schnorrsig_batch_verify_4096: min 27.1us / avg 27.3us / max 27.5us
schnorrsig_batch_verify_8192: min 27.1us / avg 27.2us / max 27.3us

new branch (endo and gmp are gone):

schnorrsig_sign: min 26.3us / avg 26.5us / max 26.7us
schnorrsig_verify: min 48.0us / avg 48.3us / max 48.6us
schnorrsig_batch_verify_1: min 55.1us / avg 55.3us / max 55.3us
schnorrsig_batch_verify_2: min 50.6us / avg 50.7us / max 50.9us
schnorrsig_batch_verify_4: min 47.0us / avg 47.2us / max 47.5us
schnorrsig_batch_verify_8: min 44.6us / avg 44.8us / max 45.0us
schnorrsig_batch_verify_16: min 44.3us / avg 44.6us / max 44.7us
schnorrsig_batch_verify_32: min 43.6us / avg 43.7us / max 43.8us
schnorrsig_batch_verify_64: min 41.2us / avg 41.4us / max 41.6us
schnorrsig_batch_verify_128: min 37.7us / avg 38.1us / max 38.8us
schnorrsig_batch_verify_256: min 34.0us / avg 34.2us / max 34.5us
schnorrsig_batch_verify_512: min 31.9us / avg 32.2us / max 32.4us
schnorrsig_batch_verify_1024: min 30.1us / avg 30.2us / max 30.3us
schnorrsig_batch_verify_2048: min 28.3us / avg 28.5us / max 28.7us
schnorrsig_batch_verify_4096: min 27.2us / avg 27.2us / max 27.3us
schnorrsig_batch_verify_8192: min 27.2us / avg 27.3us / max 27.5us

sipa avatar Mar 31 '21 17:03 sipa

Similar results with GCC 7.5.0 on the same hardware

pre-safegcd, with endo, with gmp:

schnorrsig_sign: min 28.8us / avg 29.0us / max 29.8us
schnorrsig_verify: min 48.3us / avg 48.6us / max 48.9us
schnorrsig_batch_verify_1: min 54.6us / avg 54.8us / max 55.1us
schnorrsig_batch_verify_2: min 50.1us / avg 50.2us / max 50.4us
schnorrsig_batch_verify_4: min 46.7us / avg 46.8us / max 46.9us
schnorrsig_batch_verify_8: min 44.3us / avg 44.3us / max 44.3us
schnorrsig_batch_verify_16: min 44.0us / avg 44.1us / max 44.2us
schnorrsig_batch_verify_32: min 43.4us / avg 43.8us / max 44.7us
schnorrsig_batch_verify_64: min 41.1us / avg 41.2us / max 41.4us
schnorrsig_batch_verify_128: min 37.5us / avg 37.6us / max 37.6us
schnorrsig_batch_verify_256: min 33.8us / avg 33.9us / max 33.9us
schnorrsig_batch_verify_512: min 31.8us / avg 31.9us / max 32.0us
schnorrsig_batch_verify_1024: min 29.7us / avg 29.8us / max 29.9us
schnorrsig_batch_verify_2048: min 28.2us / avg 28.2us / max 28.3us
schnorrsig_batch_verify_4096: min 27.0us / avg 27.2us / max 27.3us
schnorrsig_batch_verify_8192: min 27.0us / avg 27.1us / max 27.2us

pre-safegcd, with endo, without gmp:

schnorrsig_sign: min 29.0us / avg 29.2us / max 29.6us
schnorrsig_verify: min 51.9us / avg 52.5us / max 54.2us
schnorrsig_batch_verify_1: min 54.7us / avg 55.0us / max 55.5us
schnorrsig_batch_verify_2: min 50.2us / avg 50.4us / max 50.7us
schnorrsig_batch_verify_4: min 46.7us / avg 46.9us / max 47.1us
schnorrsig_batch_verify_8: min 44.4us / avg 44.5us / max 44.6us
schnorrsig_batch_verify_16: min 44.0us / avg 44.1us / max 44.2us
schnorrsig_batch_verify_32: min 43.3us / avg 43.4us / max 43.5us
schnorrsig_batch_verify_64: min 40.9us / avg 40.9us / max 41.0us
schnorrsig_batch_verify_128: min 37.5us / avg 37.5us / max 37.7us
schnorrsig_batch_verify_256: min 33.9us / avg 34.4us / max 34.8us
schnorrsig_batch_verify_512: min 31.8us / avg 31.8us / max 32.0us
schnorrsig_batch_verify_1024: min 29.6us / avg 29.7us / max 29.7us
schnorrsig_batch_verify_2048: min 28.1us / avg 28.2us / max 28.4us
schnorrsig_batch_verify_4096: min 26.9us / avg 27.0us / max 27.1us
schnorrsig_batch_verify_8192: min 27.0us / avg 27.2us / max 27.3us

post-safegcd:

schnorrsig_sign: min 25.9us / avg 26.0us / max 26.3us
schnorrsig_verify: min 47.9us / avg 48.1us / max 48.4us
schnorrsig_batch_verify_1: min 54.7us / avg 54.9us / max 55.0us
schnorrsig_batch_verify_2: min 50.2us / avg 50.4us / max 50.6us
schnorrsig_batch_verify_4: min 46.8us / avg 48.2us / max 50.6us
schnorrsig_batch_verify_8: min 44.5us / avg 45.1us / max 45.5us
schnorrsig_batch_verify_16: min 44.0us / avg 44.2us / max 44.5us
schnorrsig_batch_verify_32: min 43.6us / avg 43.6us / max 43.7us
schnorrsig_batch_verify_64: min 40.9us / avg 41.0us / max 41.2us
schnorrsig_batch_verify_128: min 37.6us / avg 37.9us / max 38.4us
schnorrsig_batch_verify_256: min 33.7us / avg 33.8us / max 34.1us
schnorrsig_batch_verify_512: min 31.8us / avg 32.1us / max 32.4us
schnorrsig_batch_verify_1024: min 29.6us / avg 29.6us / max 29.7us
schnorrsig_batch_verify_2048: min 28.1us / avg 28.2us / max 28.2us
schnorrsig_batch_verify_4096: min 27.0us / avg 27.3us / max 27.7us
schnorrsig_batch_verify_8192: min 27.1us / avg 27.2us / max 27.5us

sipa avatar Mar 31 '21 17:03 sipa

Did benchmarks on a i7-7820HQ CPU with clock fixed at 2.6 Ghz.

I do indeed observe a small regression on some GCC versions (7,8,10), but on clang it appears to go the other way around. I don't think there is much reason for concern here - we know there are variations in performance between compiler versions, and it's to be expected that different code will affect different compilers differently:

pre-safegcd ENDO=on GMP=off CC=gcc-7
schnorrsig_sign: min 38.5us / avg 38.5us / max 38.6us
schnorrsig_verify: min 66.4us / avg 66.6us / max 67.1us
schnorrsig_batch_verify_1: min 70.2us / avg 70.3us / max 70.5us
schnorrsig_batch_verify_2: min 64.0us / avg 64.0us / max 64.1us
schnorrsig_batch_verify_4: min 59.6us / avg 59.6us / max 59.7us
schnorrsig_batch_verify_8: min 57.2us / avg 57.2us / max 57.3us
schnorrsig_batch_verify_16: min 57.5us / avg 57.6us / max 57.6us
schnorrsig_batch_verify_32: min 57.1us / avg 57.3us / max 57.5us
schnorrsig_batch_verify_64: min 53.5us / avg 53.6us / max 53.6us
schnorrsig_batch_verify_128: min 49.1us / avg 49.1us / max 49.2us
schnorrsig_batch_verify_256: min 44.4us / avg 44.5us / max 44.5us
schnorrsig_batch_verify_512: min 42.1us / avg 42.1us / max 42.2us
schnorrsig_batch_verify_1024: min 39.3us / avg 39.4us / max 39.4us
schnorrsig_batch_verify_2048: min 37.4us / avg 37.5us / max 37.6us
schnorrsig_batch_verify_4096: min 36.0us / avg 36.0us / max 36.2us
schnorrsig_batch_verify_8192: min 36.0us / avg 36.0us / max 36.1us

post-safegcd CC=gcc-7
schnorrsig_sign: min 35.3us / avg 35.3us / max 35.5us
schnorrsig_verify: min 61.9us / avg 62.0us / max 62.3us
schnorrsig_batch_verify_1: min 70.6us / avg 70.7us / max 70.8us
schnorrsig_batch_verify_2: min 64.3us / avg 64.3us / max 64.4us
schnorrsig_batch_verify_4: min 59.7us / avg 59.8us / max 59.9us
schnorrsig_batch_verify_8: min 57.2us / avg 57.3us / max 57.4us
schnorrsig_batch_verify_16: min 57.7us / avg 57.7us / max 57.8us
schnorrsig_batch_verify_32: min 57.3us / avg 57.4us / max 57.5us
schnorrsig_batch_verify_64: min 53.6us / avg 53.7us / max 53.7us
schnorrsig_batch_verify_128: min 49.2us / avg 49.2us / max 49.3us
schnorrsig_batch_verify_256: min 44.5us / avg 44.5us / max 44.6us
schnorrsig_batch_verify_512: min 42.1us / avg 42.2us / max 42.3us
schnorrsig_batch_verify_1024: min 39.4us / avg 39.4us / max 39.5us
schnorrsig_batch_verify_2048: min 37.5us / avg 37.6us / max 37.7us
schnorrsig_batch_verify_4096: min 36.0us / avg 36.1us / max 36.2us
schnorrsig_batch_verify_8192: min 36.1us / avg 36.1us / max 36.2us


pre-safegcd ENDO=on GMP=off CC=gcc-8
schnorrsig_sign: min 38.3us / avg 38.4us / max 38.5us
schnorrsig_verify: min 66.7us / avg 66.8us / max 67.0us
schnorrsig_batch_verify_1: min 70.5us / avg 70.6us / max 70.7us
schnorrsig_batch_verify_2: min 64.3us / avg 64.4us / max 64.5us
schnorrsig_batch_verify_4: min 59.8us / avg 59.9us / max 60.0us
schnorrsig_batch_verify_8: min 57.4us / avg 57.5us / max 57.6us
schnorrsig_batch_verify_16: min 57.9us / avg 58.0us / max 58.1us
schnorrsig_batch_verify_32: min 57.4us / avg 57.5us / max 57.6us
schnorrsig_batch_verify_64: min 53.8us / avg 53.9us / max 53.9us
schnorrsig_batch_verify_128: min 49.3us / avg 49.4us / max 49.5us
schnorrsig_batch_verify_256: min 44.6us / avg 44.7us / max 44.7us
schnorrsig_batch_verify_512: min 42.3us / avg 42.3us / max 42.4us
schnorrsig_batch_verify_1024: min 39.5us / avg 39.5us / max 39.6us
schnorrsig_batch_verify_2048: min 37.6us / avg 37.7us / max 37.8us
schnorrsig_batch_verify_4096: min 36.1us / avg 36.2us / max 36.4us
schnorrsig_batch_verify_8192: min 36.2us / avg 36.3us / max 36.3us

post-safegcd CC=gcc-8
schnorrsig_sign: min 35.0us / avg 35.1us / max 35.2us
schnorrsig_verify: min 62.0us / avg 62.1us / max 62.6us
schnorrsig_batch_verify_1: min 70.7us / avg 70.7us / max 70.7us
schnorrsig_batch_verify_2: min 64.3us / avg 64.4us / max 64.4us
schnorrsig_batch_verify_4: min 59.8us / avg 59.9us / max 60.1us
schnorrsig_batch_verify_8: min 57.4us / avg 57.5us / max 57.5us
schnorrsig_batch_verify_16: min 57.9us / avg 57.9us / max 57.9us
schnorrsig_batch_verify_32: min 57.5us / avg 57.6us / max 57.6us
schnorrsig_batch_verify_64: min 53.8us / avg 53.8us / max 53.9us
schnorrsig_batch_verify_128: min 49.3us / avg 49.3us / max 49.4us
schnorrsig_batch_verify_256: min 44.6us / avg 44.7us / max 44.7us
schnorrsig_batch_verify_512: min 42.3us / avg 42.3us / max 42.3us
schnorrsig_batch_verify_1024: min 39.5us / avg 39.5us / max 39.6us
schnorrsig_batch_verify_2048: min 37.6us / avg 37.6us / max 37.7us
schnorrsig_batch_verify_4096: min 36.1us / avg 36.2us / max 36.3us
schnorrsig_batch_verify_8192: min 36.2us / avg 36.2us / max 36.2us


pre-safegcd ENDO=on GMP=off CC=gcc-9
schnorrsig_sign: min 38.4us / avg 38.4us / max 38.6us
schnorrsig_verify: min 66.8us / avg 66.9us / max 67.1us
schnorrsig_batch_verify_1: min 70.6us / avg 70.6us / max 70.7us
schnorrsig_batch_verify_2: min 64.5us / avg 64.5us / max 64.6us
schnorrsig_batch_verify_4: min 59.8us / avg 59.8us / max 59.9us
schnorrsig_batch_verify_8: min 57.4us / avg 57.5us / max 57.5us
schnorrsig_batch_verify_16: min 57.9us / avg 57.9us / max 58.0us
schnorrsig_batch_verify_32: min 57.5us / avg 57.5us / max 57.6us
schnorrsig_batch_verify_64: min 53.7us / avg 53.7us / max 53.8us
schnorrsig_batch_verify_128: min 49.3us / avg 49.3us / max 49.3us
schnorrsig_batch_verify_256: min 44.6us / avg 44.6us / max 44.6us
schnorrsig_batch_verify_512: min 42.2us / avg 42.2us / max 42.2us
schnorrsig_batch_verify_1024: min 39.4us / avg 39.4us / max 39.5us
schnorrsig_batch_verify_2048: min 37.5us / avg 37.6us / max 37.6us
schnorrsig_batch_verify_4096: min 36.1us / avg 36.2us / max 36.3us
schnorrsig_batch_verify_8192: min 36.2us / avg 36.2us / max 36.2us

post-safegcd CC=gcc-9
schnorrsig_sign: min 35.0us / avg 35.0us / max 35.2us
schnorrsig_verify: min 62.1us / avg 62.2us / max 62.9us
schnorrsig_batch_verify_1: min 70.6us / avg 70.7us / max 70.9us
schnorrsig_batch_verify_2: min 64.2us / avg 64.2us / max 64.3us
schnorrsig_batch_verify_4: min 59.6us / avg 59.6us / max 59.7us
schnorrsig_batch_verify_8: min 57.3us / avg 57.5us / max 57.7us
schnorrsig_batch_verify_16: min 57.9us / avg 57.9us / max 58.0us
schnorrsig_batch_verify_32: min 57.4us / avg 57.5us / max 57.5us
schnorrsig_batch_verify_64: min 53.9us / avg 53.9us / max 53.9us
schnorrsig_batch_verify_128: min 49.3us / avg 49.3us / max 49.4us
schnorrsig_batch_verify_256: min 44.6us / avg 44.6us / max 44.6us
schnorrsig_batch_verify_512: min 42.2us / avg 42.2us / max 42.2us
schnorrsig_batch_verify_1024: min 39.4us / avg 39.4us / max 39.5us
schnorrsig_batch_verify_2048: min 37.5us / avg 37.6us / max 37.7us
schnorrsig_batch_verify_4096: min 36.1us / avg 36.1us / max 36.2us
schnorrsig_batch_verify_8192: min 36.1us / avg 36.1us / max 36.2us


pre-safegcd ENDO=on GMP=off CC=gcc-10
schnorrsig_sign: min 39.3us / avg 39.4us / max 39.5us
schnorrsig_verify: min 66.6us / avg 66.6us / max 66.8us
schnorrsig_batch_verify_1: min 70.3us / avg 70.3us / max 70.4us
schnorrsig_batch_verify_2: min 64.2us / avg 64.2us / max 64.2us
schnorrsig_batch_verify_4: min 59.8us / avg 59.8us / max 59.8us
schnorrsig_batch_verify_8: min 57.3us / avg 57.3us / max 57.3us
schnorrsig_batch_verify_16: min 57.7us / avg 57.7us / max 57.8us
schnorrsig_batch_verify_32: min 57.3us / avg 57.3us / max 57.4us
schnorrsig_batch_verify_64: min 53.7us / avg 53.8us / max 53.8us
schnorrsig_batch_verify_128: min 49.4us / avg 49.4us / max 49.4us
schnorrsig_batch_verify_256: min 44.6us / avg 44.6us / max 44.7us
schnorrsig_batch_verify_512: min 42.2us / avg 42.3us / max 42.4us
schnorrsig_batch_verify_1024: min 39.4us / avg 39.5us / max 39.6us
schnorrsig_batch_verify_2048: min 37.6us / avg 37.6us / max 37.7us
schnorrsig_batch_verify_4096: min 36.1us / avg 36.2us / max 36.3us
schnorrsig_batch_verify_8192: min 36.2us / avg 36.2us / max 36.3us

post-safegcd CC=gcc-10
schnorrsig_sign: min 35.9us / avg 35.9us / max 36.2us
schnorrsig_verify: min 61.9us / avg 61.9us / max 62.1us
schnorrsig_batch_verify_1: min 70.5us / avg 70.5us / max 70.5us
schnorrsig_batch_verify_2: min 64.4us / avg 64.4us / max 64.4us
schnorrsig_batch_verify_4: min 60.1us / avg 60.1us / max 60.2us
schnorrsig_batch_verify_8: min 57.7us / avg 57.7us / max 57.7us
schnorrsig_batch_verify_16: min 57.8us / avg 57.9us / max 57.9us
schnorrsig_batch_verify_32: min 57.4us / avg 57.4us / max 57.5us
schnorrsig_batch_verify_64: min 53.7us / avg 53.7us / max 53.8us
schnorrsig_batch_verify_128: min 49.3us / avg 49.3us / max 49.4us
schnorrsig_batch_verify_256: min 44.6us / avg 44.6us / max 44.6us
schnorrsig_batch_verify_512: min 42.2us / avg 42.2us / max 42.3us
schnorrsig_batch_verify_1024: min 39.4us / avg 39.5us / max 39.5us
schnorrsig_batch_verify_2048: min 37.6us / avg 37.6us / max 37.7us
schnorrsig_batch_verify_4096: min 36.1us / avg 36.3us / max 36.5us
schnorrsig_batch_verify_8192: min 36.2us / avg 36.2us / max 36.2us


pre-safegcd ENDO=on GMP=off CC=clang-8
schnorrsig_sign: min 35.8us / avg 35.9us / max 36.1us
schnorrsig_verify: min 66.4us / avg 66.4us / max 66.6us
schnorrsig_batch_verify_1: min 70.6us / avg 70.7us / max 70.7us
schnorrsig_batch_verify_2: min 63.6us / avg 63.7us / max 63.8us
schnorrsig_batch_verify_4: min 58.8us / avg 58.8us / max 58.8us
schnorrsig_batch_verify_8: min 56.3us / avg 56.4us / max 56.4us
schnorrsig_batch_verify_16: min 56.6us / avg 56.7us / max 56.9us
schnorrsig_batch_verify_32: min 56.5us / avg 56.6us / max 56.6us
schnorrsig_batch_verify_64: min 52.5us / avg 52.5us / max 52.6us
schnorrsig_batch_verify_128: min 48.1us / avg 48.2us / max 48.2us
schnorrsig_batch_verify_256: min 43.6us / avg 43.6us / max 43.7us
schnorrsig_batch_verify_512: min 41.3us / avg 41.4us / max 41.4us
schnorrsig_batch_verify_1024: min 38.6us / avg 38.6us / max 38.7us
schnorrsig_batch_verify_2048: min 36.8us / avg 36.9us / max 36.9us
schnorrsig_batch_verify_4096: min 35.4us / avg 35.5us / max 35.6us
schnorrsig_batch_verify_8192: min 35.4us / avg 35.5us / max 35.5us

post-safegcd CC=clang-8
schnorrsig_sign: min 32.5us / avg 32.5us / max 32.6us
schnorrsig_verify: min 61.6us / avg 61.7us / max 62.3us
schnorrsig_batch_verify_1: min 69.8us / avg 69.8us / max 69.9us
schnorrsig_batch_verify_2: min 63.0us / avg 63.1us / max 63.1us
schnorrsig_batch_verify_4: min 58.3us / avg 58.3us / max 58.4us
schnorrsig_batch_verify_8: min 55.9us / avg 55.9us / max 56.0us
schnorrsig_batch_verify_16: min 56.4us / avg 56.4us / max 56.5us
schnorrsig_batch_verify_32: min 56.4us / avg 56.4us / max 56.5us
schnorrsig_batch_verify_64: min 52.6us / avg 52.7us / max 52.7us
schnorrsig_batch_verify_128: min 48.3us / avg 48.3us / max 48.3us
schnorrsig_batch_verify_256: min 43.7us / avg 43.8us / max 43.8us
schnorrsig_batch_verify_512: min 41.4us / avg 41.5us / max 41.5us
schnorrsig_batch_verify_1024: min 38.7us / avg 38.7us / max 38.8us
schnorrsig_batch_verify_2048: min 36.9us / avg 37.0us / max 37.1us
schnorrsig_batch_verify_4096: min 35.4us / avg 35.5us / max 35.6us
schnorrsig_batch_verify_8192: min 35.4us / avg 35.5us / max 35.5us


pre-safegcd ENDO=on GMP=off CC=clang-9
schnorrsig_sign: min 37.8us / avg 37.9us / max 38.5us
schnorrsig_verify: min 66.5us / avg 66.6us / max 67.4us
schnorrsig_batch_verify_1: min 70.0us / avg 70.1us / max 70.1us
schnorrsig_batch_verify_2: min 63.1us / avg 63.2us / max 63.3us
schnorrsig_batch_verify_4: min 58.4us / avg 58.5us / max 58.6us
schnorrsig_batch_verify_8: min 55.6us / avg 55.7us / max 55.9us
schnorrsig_batch_verify_16: min 56.1us / avg 56.2us / max 56.3us
schnorrsig_batch_verify_32: min 56.2us / avg 56.2us / max 56.3us
schnorrsig_batch_verify_64: min 52.5us / avg 52.5us / max 52.6us
schnorrsig_batch_verify_128: min 48.2us / avg 48.2us / max 48.2us
schnorrsig_batch_verify_256: min 43.7us / avg 43.7us / max 43.7us
schnorrsig_batch_verify_512: min 41.4us / avg 41.4us / max 41.4us
schnorrsig_batch_verify_1024: min 38.6us / avg 38.7us / max 38.7us
schnorrsig_batch_verify_2048: min 36.8us / avg 36.9us / max 37.0us
schnorrsig_batch_verify_4096: min 35.4us / avg 35.5us / max 35.6us
schnorrsig_batch_verify_8192: min 35.5us / avg 35.5us / max 35.5us

post-safegcd CC=clang-9
schnorrsig_sign: min 34.5us / avg 34.5us / max 34.6us
schnorrsig_verify: min 61.5us / avg 61.6us / max 61.8us
schnorrsig_batch_verify_1: min 69.8us / avg 69.8us / max 69.9us
schnorrsig_batch_verify_2: min 63.3us / avg 63.3us / max 63.4us
schnorrsig_batch_verify_4: min 58.6us / avg 58.6us / max 58.7us
schnorrsig_batch_verify_8: min 55.8us / avg 55.9us / max 55.9us
schnorrsig_batch_verify_16: min 56.3us / avg 56.3us / max 56.4us
schnorrsig_batch_verify_32: min 56.4us / avg 56.4us / max 56.5us
schnorrsig_batch_verify_64: min 52.5us / avg 52.5us / max 52.5us
schnorrsig_batch_verify_128: min 48.2us / avg 48.2us / max 48.2us
schnorrsig_batch_verify_256: min 43.7us / avg 43.7us / max 43.7us
schnorrsig_batch_verify_512: min 41.4us / avg 41.4us / max 41.5us
schnorrsig_batch_verify_1024: min 38.7us / avg 38.7us / max 38.8us
schnorrsig_batch_verify_2048: min 36.8us / avg 36.9us / max 37.0us
schnorrsig_batch_verify_4096: min 35.4us / avg 35.5us / max 35.6us
schnorrsig_batch_verify_8192: min 35.4us / avg 35.5us / max 35.6us


pre-safegcd ENDO=on GMP=off CC=clang-10
schnorrsig_sign: min 37.6us / avg 37.6us / max 37.7us
schnorrsig_verify: min 66.5us / avg 66.6us / max 66.7us
schnorrsig_batch_verify_1: min 70.8us / avg 70.9us / max 70.9us
schnorrsig_batch_verify_2: min 63.7us / avg 63.7us / max 63.8us
schnorrsig_batch_verify_4: min 58.7us / avg 58.8us / max 58.8us
schnorrsig_batch_verify_8: min 56.1us / avg 56.2us / max 56.3us
schnorrsig_batch_verify_16: min 56.6us / avg 56.6us / max 56.7us
schnorrsig_batch_verify_32: min 56.5us / avg 56.5us / max 56.6us
schnorrsig_batch_verify_64: min 52.5us / avg 52.5us / max 52.6us
schnorrsig_batch_verify_128: min 48.2us / avg 48.3us / max 48.3us
schnorrsig_batch_verify_256: min 43.7us / avg 43.7us / max 43.7us
schnorrsig_batch_verify_512: min 41.4us / avg 41.4us / max 41.4us
schnorrsig_batch_verify_1024: min 38.7us / avg 38.7us / max 38.8us
schnorrsig_batch_verify_2048: min 36.9us / avg 36.9us / max 37.0us
schnorrsig_batch_verify_4096: min 35.4us / avg 35.5us / max 35.6us
schnorrsig_batch_verify_8192: min 35.5us / avg 35.5us / max 35.5us

post-safegcd CC=clang-10
schnorrsig_sign: min 34.3us / avg 34.4us / max 34.4us
schnorrsig_verify: min 61.7us / avg 61.7us / max 61.9us
schnorrsig_batch_verify_1: min 69.8us / avg 69.8us / max 69.9us
schnorrsig_batch_verify_2: min 63.0us / avg 63.1us / max 63.1us
schnorrsig_batch_verify_4: min 58.3us / avg 58.3us / max 58.4us
schnorrsig_batch_verify_8: min 55.5us / avg 55.6us / max 55.7us
schnorrsig_batch_verify_16: min 55.9us / avg 56.0us / max 56.1us
schnorrsig_batch_verify_32: min 56.1us / avg 56.2us / max 56.3us
schnorrsig_batch_verify_64: min 52.6us / avg 52.6us / max 52.7us
schnorrsig_batch_verify_128: min 48.2us / avg 48.3us / max 48.4us
schnorrsig_batch_verify_256: min 43.7us / avg 43.8us / max 43.8us
schnorrsig_batch_verify_512: min 41.4us / avg 41.5us / max 41.5us
schnorrsig_batch_verify_1024: min 38.7us / avg 38.7us / max 38.8us
schnorrsig_batch_verify_2048: min 36.9us / avg 36.9us / max 37.1us
schnorrsig_batch_verify_4096: min 35.4us / avg 35.5us / max 35.7us
schnorrsig_batch_verify_8192: min 35.5us / avg 35.5us / max 35.6us


pre-safegcd ENDO=on GMP=off CC=clang-11
schnorrsig_sign: min 37.5us / avg 37.5us / max 37.6us
schnorrsig_verify: min 66.3us / avg 66.3us / max 66.5us
schnorrsig_batch_verify_1: min 70.4us / avg 70.4us / max 70.5us
schnorrsig_batch_verify_2: min 63.3us / avg 63.3us / max 63.4us
schnorrsig_batch_verify_4: min 58.5us / avg 58.5us / max 58.6us
schnorrsig_batch_verify_8: min 55.6us / avg 55.6us / max 55.7us
schnorrsig_batch_verify_16: min 56.1us / avg 56.2us / max 56.3us
schnorrsig_batch_verify_32: min 56.3us / avg 56.4us / max 56.4us
schnorrsig_batch_verify_64: min 52.5us / avg 52.5us / max 52.6us
schnorrsig_batch_verify_128: min 48.2us / avg 48.2us / max 48.2us
schnorrsig_batch_verify_256: min 43.6us / avg 43.7us / max 43.7us
schnorrsig_batch_verify_512: min 41.4us / avg 41.4us / max 41.4us
schnorrsig_batch_verify_1024: min 38.6us / avg 38.7us / max 38.7us
schnorrsig_batch_verify_2048: min 36.8us / avg 36.9us / max 36.9us
schnorrsig_batch_verify_4096: min 35.4us / avg 35.4us / max 35.5us
schnorrsig_batch_verify_8192: min 35.4us / avg 35.5us / max 35.5us

post-safegcd CC=clang-11
schnorrsig_sign: min 34.3us / avg 34.4us / max 34.4us
schnorrsig_verify: min 61.3us / avg 61.5us / max 61.7us
schnorrsig_batch_verify_1: min 69.5us / avg 69.5us / max 69.6us
schnorrsig_batch_verify_2: min 62.7us / avg 62.8us / max 62.8us
schnorrsig_batch_verify_4: min 58.1us / avg 58.2us / max 58.3us
schnorrsig_batch_verify_8: min 55.9us / avg 56.0us / max 56.1us
schnorrsig_batch_verify_16: min 56.5us / avg 56.6us / max 56.6us
schnorrsig_batch_verify_32: min 56.2us / avg 56.3us / max 56.3us
schnorrsig_batch_verify_64: min 52.6us / avg 53.4us / max 55.1us
schnorrsig_batch_verify_128: min 48.2us / avg 48.8us / max 49.6us
schnorrsig_batch_verify_256: min 43.7us / avg 43.8us / max 43.8us
schnorrsig_batch_verify_512: min 41.4us / avg 41.6us / max 41.7us
schnorrsig_batch_verify_1024: min 38.7us / avg 38.8us / max 38.8us
schnorrsig_batch_verify_2048: min 36.9us / avg 37.0us / max 37.0us
schnorrsig_batch_verify_4096: min 35.4us / avg 35.5us / max 35.6us
schnorrsig_batch_verify_8192: min 35.4us / avg 35.5us / max 35.5us

sipa avatar Mar 31 '21 21:03 sipa

I noticed a higher variance in my benchmark than I was used to and re-run the experiment in a more controlled environment (gcc 10.2.0). I did not find a performance degradation post-rebase anymore. Single schnorrsig_verify was fastest post-rebase compared to pre-rebase (with endo, bignum=gmp and bignum=no) and batch verify was very similar across the three configurations.

jonasnick avatar Apr 01 '21 13:04 jonasnick

I added a commit to reduce the batch verification randomizers to 128 bits. This gives up to a 9% speedup.

jonasnick avatar Apr 07 '21 21:04 jonasnick

I'm intending to remove the batch verification speedup graph from BIP-340 and instead place it in libsecp's doc directory. Therefore, I added a commit that allows recreating said graph (originally proposed for BIP-340).

I removed the log fit from the graph and instead increased the granularity. The shape of the graph may change again once the optimal pippenger threshold/windows are updated to reflect the latest improvements.

jonasnick avatar May 17 '21 14:05 jonasnick