research-lab icon indicating copy to clipboard operation
research-lab copied to clipboard

Seraphis Performance Results

Open UkoeHB opened this issue 2 years ago • 16 comments

UPDATE: See this comment for most current results.

Seraphis Performance Results

Below I display and discuss performance results from several transaction protocol mock-ups (CLSAG, Triptych, Seraphis-Concise, Seraphis-Merge, Seraphis-Squashed), collected during one test run. The purpose of this report is to inform engineering/design decisions around a potential real-world implementation of Seraphis.

Preliminaries

Test Context

The test was run single-threaded on a zenith2alpha motherboard with an AMD Ryzen Threadripper 3970X 32-Core processor and 256GB RAM. It was run on my Seraphis perf test branch at commit e0620b4f71faa20e69afcac6206a4180102f251d, and started at 2021-11-09 : 18:02:12 UTC (according to the machine's clock). The test command was ./build/Linux/seraphis_perf/release/tests/performance_tests/performance_tests --filter=\*mock_tx\* --stats --loop-multiplier=10 --timings-database=/home/user/seraphis_perf/test_results/perf_3.txt > /home/user/seraphis_perf/test_results/stdout_perftest_3.txt.

Terminology

  • Enote (a.k.a. 'output' or 'txo'): A small message containing an amount and a recipient address.
  • Input: A transaction input (an enote being spent).
  • Output: A transaction output (a new enote).
  • Reference set size: The number of old enotes referenced by an input's membership proof. A membership proof demonstrates that a transaction spends an enote that exists in the ledger. If the reference set size is >1, then proof verifiers will not know which of the referenced enotes is being spent (just that the spent enote is one of the referenced enotes).
  • Reference set size decomposition: ref_set_size = n^m, where n is the 'decomposition base'. This detail is relevant to Grootle membership proofs (used in Triptych, Seraphis, Lelantus-Spark), which require n and m to be integers.

Test Questions

There were a number of questions I wanted to answer with this test.

  • How are tx verification/size costs affected by changing the reference set size? What is the impact on those costs if transactions are batch-verified?
    • Bulletproofs+ proofs can be batch-verified for non-trivial savings.
  • How are tx verification/size costs of Grootle-based tx protocols affected by changing the ref set size decomposition base (n)?
  • Do any Grootle-based protocols have relative advantages compared to each other when increasing the number of transaction inputs (all else equal)?
    • CLSAG is trivially worse-performing than Grootle proofs for equivalent reference set sizes (i.e. the 'all else equal'), so it was not considered here.
  • Are there any advantages to 'splitting' Bulletproofs+ range proofs within transactions? What is the interaction between range-proof splitting and batch-verifying transactions?
    • A BP+ range proof can aggregate multiple range proofs in a single structure, or those proofs can be separated into different structures.
  • Note: All protocol types are ~equivalent as a function of outputs, so that variable is not of great interest.

Results

Experiment 1: reference set size (no batching)

Fixed parameters: 2-input/2-output, no BP+ splitting, no tx verification batching. Note that the verification plot is logarithmic in the y-axis.

refset_2series_ver refset_2series_size

Experiment 2: reference set size (25 tx per batch)

Fixed parameters: 2-input/2-output, no BP+ splitting, 25 tx per batch (normalized to cost/tx). Note that the verification plot is logarithmic in the y-axis.

refset_2series_25batch_ver

Experiement 3: reference set size decomposition

Fixed parameters: 2-input/2-output, no BP+ splitting, no tx verification batching.

decomp_concise_ver decomp_concise_size

Experiment 4: inputs

Fixed parameters: 2-output, reference set decomposition 2^8, no BP+ splitting, no tx verification batching.

  • For squashed tx: batching shown in legend.

These results are normalized to 1 input for each protocol type.

inputs_ver inputs_size

Experiment 5: BP+ splitting (no batching)

Fixed parameters: 2-input, reference set decomposition 2^8, 1 tx per batch (normalized to cost/tx).

outputs_concise_1batch_ver outputs_concise_1batch_size

Experiment 6: BP+ splitting (25 tx per batch)

Fixed parameters: 2-input, reference set decomposition 2^8, 25 tx per batch (normalized to cost/tx).

outputs_concise_25batch_ver

Discussion

My key take-aways:

  • Compared to CLSAG, other protocols are about 4x more efficient in terms of verification costs. However, Seraphis-Squashed stands out as noticeably more efficient than the other options, especially when batching is taken into account.
  • There is no benefit to using a reference set size decomposition base other than n = 2 or n = 3. I believe 2^6 = 64, 3^4 = 81, and 2^7 = 128 are the best candidates for a reference set size using one of the new variants, taking CLSAG with ref_set_size = 16 as a baseline for comparison (16 is likely to be the reference set size after Monero's next hardfork).
  • Seraphis-Squashed scales significantly better with rising input counts than the other protocols, especially when batching is included.
  • Seraphis-Merge saves 96 bytes per input compared to Seraphis-Concise and Seraphis-Squashed, but has negligible relative impact on verification times. The cost of Seraphis-Merge over other variants is it becomes impossible to do 'collaborative funding', where independent parties provide inputs to the same transaction.
  • While BP+ splitting has noticeable verification-cost-reduction effects when there is no batching, those differences are amortized (even reversed slightly) when batching is included. Since BP+ splitting causes larger transactions, and since performance/size analysis is geared toward the 'upper end' of transaction throughput (when batching can/should be applied to all transactions, since blocks will contain many transactions), I don't recommend implementing BP+ splitting.

UkoeHB avatar Nov 11 '21 20:11 UkoeHB

How do those Seraphis "merge", "concise" and "squashed" variants differ? An ELI5 would be something for crypto-challenged people like me :)

rbrunner7 avatar Nov 12 '21 18:11 rbrunner7

@rbrunner7

  • Merge
    • membership proof: concise grootle
    • ownership proof (and proof that key images are correct): merged Seraphis composition proof (one proof for all inputs)
  • Concise
    • membership proof: concise grootle
    • ownership proof (and proof that key images are correct): separate Seraphis composition proofs (one proof per input)
  • Squashed
    • membership proof: simple grootle (concise vs plain doesn't matter)
    • ownership proof (and proof that key images are correct): separate Seraphis composition proofs (one proof per input)
    • extra: BP+ range proofs for inputs

Basically, 'concise' is the plain one, 'merge' is slightly more efficient but you have to sign all inputs at the same time (tx author must own all funds spent by the tx, different from other variants where multiple people can fund a tx), and 'squashed' allows simpler membership proofs at the cost of needing to make a range proof for each input.

Squashed can also use the merged composition proof, I just separated 'merge' into its own tx type for comparisons.

UkoeHB avatar Nov 12 '21 18:11 UkoeHB

@UkoeHB

other variants where multiple people can fund a tx

Are there any wallets that allow for this via gui? This seems to be just a theoretical benefit. (multiple people funding a transaction vs multiple transactions just seems to save on fees and make the history a bit cleaner?)

boogerlad avatar Nov 12 '21 20:11 boogerlad

Are there any wallets that allow for this via gui? This seems to be just a theoretical benefit. (multiple people funding a transaction vs multiple transactions just seems to save on fees and make the history a bit cleaner?)

It is not possible with the current protocol. One example of the technique's use is the BCH crowdfunding system.

UkoeHB avatar Nov 12 '21 20:11 UkoeHB

I think Seraphis-Merge for collaborative funding sounds so interesting.

ghost avatar Nov 12 '21 23:11 ghost

I think Seraphis-Merge for collaborative funding sounds so interesting.

Seraphis-Merge prevents collaborative funding (which Seraphis-Concise/Seraphis-Squashed can do). It allows a bit smaller tx (96 bytes fewer per tx input).

UkoeHB avatar Nov 12 '21 23:11 UkoeHB

This seems to be just a theoretical benefit.

@boogerlad @garth-xmr Collaborative funding is very real on the BCH blockchain. Over 9,000 BCH has been contributed to 85 projects through their Flipstarter system. At this point Flipstarter is the main funding mechanism for BCH development. You could almost say that it saved BCH from a devtax (the devtax advocates forked off anyway and their coin is now called eCash).

BCH's AnyoneCanPay special transaction type allow this permissionless, noncustodial, and self-hosted funding mechanism. For now, Monero has mostly relied on the CCS funding system, which is good but is also permissioned, centralized, and custodial. @emergent-reasons from BCH may be able to explain more. @plowsof recently sought and received funding through a Flipstarter campaign.

Rucknium avatar Nov 13 '21 03:11 Rucknium

Old introduction article to Flipstarter. It's fundamentally the same thing as @mikehearn's Lighthouse. It's an entirely self-hosted, open source funding solution that uses trustless assurance contracts (all or nothing) where pledgers' money stays fully within their control until the moment that the funding transaction pulls together all the pledged inputs. I don't want to derail the discussion, so please feel free to contact me on the internet under variations of "emergent_reasons".

emergent-reasons avatar Nov 13 '21 13:11 emergent-reasons

One design concern to keep in mind: Seraphis-Squashed requires range proofs for inputs. This may or may not entail limits on input counts (i.e. the 16-output limit was imposed when Bulletproofs were introduced).

Maybe someone can comment on why Bulletproofs led to a 16-output limit. For example, there could be a DDOS vector where a max-in/out transaction causes batch verification to slow down significantly (i.e. greatly increases the average verification cost across a batch).

UPDATE: I did some testing and found that per-proof verification improves with batching even if you combine many small-aggregate proofs (e.g. aggregates of 2 proofs) with few large-aggregate proofs (e.g. aggregates of 128 proofs). Basically, batching does not open a DDOS vector. However, it is still necessary to impose a limit on the number of tx inputs, since BP+ has a config limit on the number of proofs you can aggregate. One reasonable limit might be 112 inputs, 16 outputs (for a per-tx maximum of 128, which is a power of 2).

UPDATE2: The issue with large aggregations is, if a large proof aggregation (e.g. tx with many inputs) has no other large proofs to batch-verify with, then the per-range-proof verification cost of that large aggregation will be significantly higher than if it could be batched (~3-5x higher). This is the basic reason I investigated range proof splitting (so rare large aggregate proofs can be split into smaller proofs that will benefit more from batching).

UkoeHB avatar Nov 17 '21 18:11 UkoeHB

One question that comes to mind is the impact of GPU verification on verification cost per tx especially with large batch sizes 25 txs, 100 txs etc. if a performance improvement of 50x or more over a single core CPU can be achieved, this would be a material improvement that could enable for example a 256 ring size in Seraphis. Here is an example of performance improvements of 28.4x over a 8 core CPU and 8.4x over a 32 core CPU. https://www.dataversity.net/what-are-gpus-and-why-do-data-scientists-love-them/

ArticMine avatar Jan 20 '22 20:01 ArticMine

While GPUs might significantly improve verification times, I have a couple concerns.

  1. What is the engineering effort required to implement a GPU verifier?
  2. I think the Seraphis upper tx througput limit would be ~10 TPS, with ~3.2 MB blocks (32 MB over 10 mins, for comparison to Bitcoin's 1 MB-per-10min-block that handles 7 TPS). TPS is constrained by the ability of common hardware to re-validate the blockchain. At ~10ms to verify a Seraphis tx with 128 ring size on a high-end CPU, you can verify 100 tx per second. That's maybe 30-70 on low-to-medium-end hardware. If the rate of new tx is >= the rate of verification, then a computer can never catch up. Therefore the upper limit on network throughput should be an order of magnitude lower than the rate of verification (~20 TPS for 11-member CLSAG, ~13 TPS for 64-member ring, ~10 TPS for 128-member ring, ~6 TPS for 256-member ring). If we allow GPUs to increase that limit, then we are basically saying only people with GPUs can fully validate the chain (and only people with GPUs can run full nodes).

UkoeHB avatar Jan 21 '22 16:01 UkoeHB

only people with GPUs can run full nodes

Well that would be a non starter.

Can this code be optimized in assembler and any extensions leveraged?

Hueristic avatar Jan 22 '22 19:01 Hueristic

@UkoeHB Could you clarify this comment?:

I think the Seraphis upper tx througput limit would be ~10 TPS...

Is this assuming use of a single core of a CPU or multiple cores? And if multiple cores, how many?

Rucknium avatar Feb 04 '22 00:02 Rucknium

@Rucknium Yes that estimate is based on single-threaded verification.

UkoeHB avatar Feb 04 '22 01:02 UkoeHB

Results of timing tests on experiments 1-4 on my medium-end-ish core i7 1.8 GHz + 32gb RAM, based on commit a63e39c6604d4ac493acee0f1c923dbefd50cca3.

To my eye, the relative efficiency gains seem fairly close.


Experiment 1: reference set size (no batching)

Fixed parameters: 2-input/2-output, no BP+ splitting, no tx verification batching. Note that the verification plot is logarithmic in the y-axis.

Experiment 1

Experiment 2: reference set size (25 tx per batch)

Fixed parameters: 2-input/2-output, no BP+ splitting, 25 tx per batch (normalized to cost/tx). Note that the verification plot is logarithmic in the y-axis.

Experiment 2

Experiement 3: reference set size decomposition

Fixed parameters: 2-input/2-output, no BP+ splitting, no tx verification batching.

Experiment 3

Experiment 4: inputs

Fixed parameters: 2-output, reference set decomposition 2^8, no BP+ splitting, no tx verification batching.

For squashed tx: batching shown in legend.

These results are normalized to 1 input for each protocol type.

Experiment 4

j-berman avatar Feb 05 '22 06:02 j-berman

Here are updated test results as of commit 86b253ebb9a6e47d621cc38e162b252c50b6fd46 (updated with small optimization: combine BP+ and grootle multiexponentiations into one operation).

There were two optimizations to grootle proofs:

  1. use the A/B optimization from section 1.3 of this paper to allow smaller proofs (thanks to a certain someone for pointing this out!)
  2. batch-verify a small part of the proof (~5-10% speedup I think)

I also added an Sp-Plain tx type, which uses the grootle proof style that Spark recommends (I used 2-byte weights for aggregating keys during verification).

Note that I removed the BP+-splitting experiments since the prior results suggested it wasn't a worthwhile approach.

Results

Experiment 1: reference set size (no batching)

Fixed parameters: 2-input/2-output, no tx verification batching. Note that the verification plot is logarithmic in the y-axis.

refset_1batch_ver refset_1batch_size

Experiment 2: reference set size (25 tx per batch)

Fixed parameters: 2-input/2-output, 25 tx per batch (normalized to cost/tx). Note that the verification plot is logarithmic in the y-axis.

refset_25batch_ver

Experiement 3: reference set size decomposition

Fixed parameters: 2-input/2-output, no tx verification batching.

decomp_ver decomp_size

Experiment 4: inputs

Fixed parameters: 2-output, reference set decomposition 2^7, no tx verification batching.

  • For squashed tx: batching shown in legend.

These results are normalized to 1 input for each protocol type.

inputs_ver inputs_size

Experiment 5: 16 in/out batching

Fixed parameters: 16 inputs, 16 outputs, decomp 2^7

16inout_ver

Discussion

  • Partial batching of grootle proofs also means combining the verification of many proofs in one multiexponentiation, which provides a speedup just on its own. It is the combination of these factors (partial batching, larger multiexponentiations) that accounts for the unexpectedly significant speedups seen here.
  • The grootle proof verification optimization means Sp-Squashed at 128 ring size can be batch-verified faster than CLSAG with 16 ring size.
  • The 'plain' grootle style used in Spark is barely faster than Sp-Concise, at the cost of noticeably larger transactions.

UkoeHB avatar Feb 21 '22 20:02 UkoeHB