bitcoin icon indicating copy to clipboard operation
bitcoin copied to clipboard

[IBD] specialize CheckBlock's input & coinbase checks

Open l0rinc opened this issue 10 months ago • 21 comments
trafficstars

This change is part of [IBD] - Tracking PR for speeding up Initial Block Download

Summary

CheckBlock's latency is critical for efficiently validating correct inputs during transaction validation, including mempool acceptance and new block creation.

This PR improves performance and maintainability by introducing the following changes:

  • Simplified checks for the most common cases (1 or 2 inputs - 70-90% of transactions have a single input).
  • Optimized the general case by replacing std::set with sorted std::vector for improved locality.
  • Simplified Null prevout checks from linear to constant time.

Context, concerns

As noted by @sipa in https://github.com/bitcoin/bitcoin/pull/31682#pullrequestreview-2570273162, this is consensus-critical code (especially since we've had an actual inflation bug related to duplicate checking). However, the changes here are very confined to a single function, easily reviewed, and I have a general preference for avoiding std::set when sorted std::vector works too as well.

The goal of this change is to make block validation faster via a low-risk alternative.

Measurements

C++ compiler .......................... AppleClang 16.0.0.16000026

Before:

ns/block block/s err% total benchmark
372,743.63 2,682.81 1.1% 10.99 CheckBlockBench
ns/op op/s err% total benchmark
3,304,694.54 302.60 0.5% 11.05 DuplicateInputs
9,585.10 104,328.55 0.1% 11.03 ProcessTransactionBench

After:

ns/block block/s err% total benchmark
179,971.00 5,556.45 0.3% 11.02 CheckBlockBench
ns/op op/s err% total benchmark
963,177.98 1,038.23 1.7% 10.92 DuplicateInputs
9,410.90 106,259.75 0.3% 11.01 ProcessTransactionBench
  • 2.07x faster CheckBlockBench
  • 3.4x faster DuplicateInputs
  • 1.8% faster ProcessTransactionBench

C++ compiler .......................... GNU 13.3.0

Before:

ns/block block/s err% ins/block cyc/block IPC bra/block miss% total benchmark
1,096,261.84 912.19 0.1% 7,963,390.88 3,487,375.26 2.283 1,266,941.00 1.8% 11.03 CheckBlockBench
ns/op op/s err% ins/op cyc/op IPC bra/op miss% total benchmark
8,366,309.48 119.53 0.0% 23,865,177.67 26,620,160.23 0.897 5,972,887.41 4.0% 10.78 DuplicateInputs
56,199.57 17,793.73 0.1% 229,263.01 178,766.31 1.282 15,509.97 0.5% 10.91 ProcessTransactionBench

After:

ns/block block/s err% ins/block cyc/block IPC bra/block miss% total benchmark
834,855.94 1,197.81 0.0% 6,518,548.86 2,656,039.78 2.454 919,160.84 1.5% 10.78 CheckBlockBench
ns/op op/s err% ins/op cyc/op IPC bra/op miss% total benchmark
4,261,492.75 234.66 0.0% 17,379,823.40 13,559,793.33 1.282 4,265,714.28 3.4% 11.00 DuplicateInputs
55,819.53 17,914.88 0.1% 227,828.15 177,520.09 1.283 15,184.36 0.4% 10.91 ProcessTransactionBench
  • 31% faster CheckBlockBench
  • 2x faster DuplicateInputs
  • 0.5% faster ProcessTransactionBench

While the point of the change isn't necessarily to speed up IBD, but you can see the measurements at https://github.com/bitcoin/bitcoin/pull/31682#issuecomment-2606765689


Related PRs:

  • https://github.com/bitcoin/bitcoin/pull/14397 - very similar solution, I only found it after pushing the PR (was closed for inactivity)
  • https://github.com/bitcoin/bitcoin/pull/14837 - a faster, but a lot more complex alternative (was closed for complexity and inactivity).

l0rinc avatar Jan 18 '25 15:01 l0rinc

The following sections might be updated with supplementary metadata relevant to reviewers and maintainers.

Code Coverage & Benchmarks

For details see: https://corecheck.dev/bitcoin/bitcoin/pulls/31682.

Reviews

See the guideline for information on the review process.

Type Reviewers
Concept NACK 1440000bytes
Concept ACK sipa
Approach ACK Raimo33

If your review is incorrectly listed, please react with 👎 to this comment and the bot will ignore it on the next update.

Conflicts

Reviewers, this pull request conflicts with the following ones:

  • #33629 (Cluster mempool by sdaftuar)
  • #32729 (test,refactor: extract script template helpers & widen sigop count coverage by l0rinc)
  • #32554 (bench: replace embedded raw block with configurable block generator by l0rinc)
  • #31868 ([IBD] specialize block serialization by l0rinc)

If you consider this pull request important, please also help to review the conflicting pull requests. Ideally, start with the one that should be merged first.

DrahtBot avatar Jan 18 '25 15:01 DrahtBot

🚧 At least one of the CI tasks failed. Debug: https://github.com/bitcoin/bitcoin/runs/35818754067

Hints

Try to run the tests locally, according to the documentation. However, a CI failure may still happen due to a number of reasons, for example:

  • Possibly due to a silent merge conflict (the changes in this pull request being incompatible with the current code in the target branch). If so, make sure to rebase on the latest commit of the target branch.

  • A sanitizer issue, which can only be found by compiling with the sanitizer and running the affected test.

  • An intermittent issue.

Leave a comment here, if you need help tracking down a confusing failure.

DrahtBot avatar Jan 18 '25 15:01 DrahtBot

How much of an improvement does this translate into for a user? The bar should be pretty high for touching such critical code.

darosior avatar Jan 19 '25 01:01 darosior

The bar should be pretty high for touching such critical code.

Absolutely, every change should have an extremely high bar here. For context, compared to the changes described in https://bitcoincore.org/en/2018/09/20/notice, this update doesn’t involve caching or reusing previously calculated values elsewhere. Instead, it operates entirely within the method, maintaining a black-box approach that ensures the exact same behavior. To validate this, I’ve added an extensive suite of tests that thoroughly cover these cases.

How much of an improvement does this translate into for a user?

The primary goal of this change is to improve block validation performance while maintaining a low risk profile. The PR description includes detailed measurements of the improvements. If there are additional benchmarks or scenarios you’d like me to measure, please let me know, and I’ll be happy to provide them.

l0rinc avatar Jan 19 '25 18:01 l0rinc

it is called whenever a tx is validated (so also during mempool acceptance)

Updated the description, thanks.

how much this improves transaction validation as a whole

I can only optimize small parts of these, step-by-step. This was the next part I found that could be cheaply sped up (having preexisting tests, benchmarks - signaling that others were also interested in it).

I have created a chainman.ProcessTransaction benchmark locally (will commit separately, not strictly related to this PR) to check the overall effect, but as you've stated, this is a small part of the whole process, so I regularly got roughly ~1% speedup only. Not huge, but we have to decide if it's proportional to the risk or not.

ProcessTransaction results
git checkout bf090135d01076aa6d4fbfe7a185682dd6f9f489 >/dev/null 2>&1 \
&& cmake -B build -DBUILD_BENCH=ON -DCMAKE_BUILD_TYPE=Release >/dev/null 2>&1 \
&& cmake --build build -j$(nproc) >/dev/null 2>&1 \
&& build/src/bench/bench_bitcoin -filter='ProcessTransactionBench' -min-time=10000 \
&& git checkout f25b91cd5a50e1035bbdb884ab11858702fadf33 >/dev/null 2>&1 \
&& cmake -B build -DBUILD_BENCH=ON -DCMAKE_BUILD_TYPE=Release >/dev/null 2>&1 \
&& cmake --build build -j$(nproc) >/dev/null 2>&1 \
&& build/src/bench/bench_bitcoin -filter='ProcessTransactionBench' -min-time=10000
ns/op op/s err% ins/op cyc/op IPC bra/op miss% total benchmark
56,570.90 17,676.93 0.1% 229,118.71 179,945.97 1.273 15,499.48 0.5% 11.03 ProcessTransactionBench
ns/op op/s err% ins/op cyc/op IPC bra/op miss% total benchmark
56,152.78 17,808.56 0.0% 228,009.18 178,596.02 1.277 15,224.76 0.4% 10.92 ProcessTransactionBench

Similarly, I ran a few reindex-chainstates (as proxies for a stable IBD) and - as posted in the description -, got a similar ~1% speedup (2 runs per bench, 19923 & 19933 seconds before, and 19681 and 19737 seconds after). Not earth-shattering globally, but I think the risk (given all the tests and benchmarks) is relatively low as well.

l0rinc avatar Jan 20 '25 18:01 l0rinc

I regularly got roughly ~1% speedup only. Not huge, but we have to decide if it's proportional to the risk or not.

got a similar ~1% speedup (2 runs per bench, 19923 & 19933 seconds before, and 19681 and 19737 seconds after).

NACK

1440000bytes avatar Jan 20 '25 20:01 1440000bytes

@1440000bytes, this behavior isn't helpful - you're just reiterating what I've already explained, without suggesting workable solutions.

Anyway, I've created an alternative PR that focuses solely on the tests and benchmarks, excluding the controversial optimizations. It validates the affected code and removes microbenchmarks that overemphasize this segment's significance, replacing it with a higher-level one, as suggested by @mzumsande.

l0rinc avatar Jan 21 '25 14:01 l0rinc

@1440000bytes, this behavior isn't helpful - you're just reiterating what I've already explained, without suggesting workable solutions.

I don't think optimization is worth changing consensus code.

1440000bytes avatar Jan 21 '25 14:01 1440000bytes

~I pushed a new version that is based on https://github.com/bitcoin/bitcoin/pull/31699 - to separate the testing/benching from the consensus change. This will remain in draft to gather comments.~


I've also retriggered an IBD (simulated via multiple -reindex-chainstate runs for stability). Here I've compared the performance of this PR after applying my other IBD optimizations (https://github.com/bitcoin/bitcoin/pull/31144 and https://github.com/bitcoin/bitcoin/pull/31645) to measure the future potential of this change. I ran it until 800k block this time to avoid the validation overhead.

Benchmark results
hyperfine --runs 2 --parameter-list COMMIT dcc3fe7f9c4b68a4b4622ef958a3262cd8a4b08a,053ccbefa3563e776932bb847aae9c302d07de61 --prepare 'rm -rfd /mnt/my_storage/BitcoinData/debug.log build/ && git checkout {COMMIT} && cmake -B build -DCMAKE_BUILD_TYPE=Release -DBUILD_UTIL=OFF -DBUILD_TX=OFF -DBUILD_TESTS=OFF -DENABLE_WALLET=OFF -DINSTALL_MAN=OFF && cmake --build build -j$(nproc)' --cleanup 'mv /mnt/my_storage/BitcoinData/debug.log /mnt/my_storage/logs/debug-{COMMIT}-$(date +%s.log)' 'COMMIT={COMMIT} ./build/src/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=800000 -dbcache=30000 -reindex-chainstate -connect=0 -printtoconsole=0'
Benchmark 1: COMMIT=dcc3fe7f9c4b68a4b4622ef958a3262cd8a4b08a ./build/src/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=800000 -dbcache=30000 -reindex-chainstate -connect=0 -printtoconsole=0
  Time (mean ± σ):     12537.220 s ±  7.591 s    [User: 13362.903 s, System: 523.533 s]
  Range (min … max):   12531.852 s … 12542.588 s    2 runs
 
Benchmark 2: COMMIT=053ccbefa3563e776932bb847aae9c302d07de61 ./build/src/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=800000 -dbcache=30000 -reindex-chainstate -connect=0 -printtoconsole=0
  Time (mean ± σ):     12251.387 s ± 52.389 s    [User: 13169.814 s, System: 543.757 s]
  Range (min … max):   12214.343 s … 12288.432 s    2 runs
 
Summary
  COMMIT=053ccbefa3563e776932bb847aae9c302d07de61 ./build/src/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=800000 -dbcache=30000 -reindex-chainstate -connect=0 -printtoconsole=0 ran
    1.02 ± 0.00 times faster than COMMIT=dcc3fe7f9c4b68a4b4622ef958a3262cd8a4b08a ./build/src/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=800000 -dbcache=30000 -reindex-chainstate -connect=0 -printtoconsole=0

This reveals a ~2.5% speedup (taking the best runs out of 2).

l0rinc avatar Jan 22 '25 09:01 l0rinc

NACK https://github.com/bitcoin/bitcoin/commit/0a40b8c8a09344c1ef9167be2f357da7a551d65d

Reason: https://github.com/bitcoin/bitcoin/pull/31682#issuecomment-2604902068

1440000bytes avatar Jan 22 '25 20:01 1440000bytes

I don't like your behavior to keep pushing for minimal changes to affect consensus code

1440000bytes avatar Jan 22 '25 20:01 1440000bytes

@1440000bytes keep your comments on topic. the topic is technical. comments about people are not relevant.

pinheadmz avatar Jan 22 '25 20:01 pinheadmz

@1440000bytes keep your comments on topic. the topic is technical. comments about people are not relevant.

It is on topic and response to https://github.com/bitcoin/bitcoin/pull/31682#issuecomment-2604837850

I have reviewed technically before others.

1440000bytes avatar Jan 22 '25 20:01 1440000bytes

Everyone needs to cool down on this pull request. 24 hour bans are next for users that mention people and not code. After that its 7 day bans.

pinheadmz avatar Jan 22 '25 20:01 pinheadmz

Merge this pull request if my comments are so off topic.

1440000bytes avatar Jan 22 '25 21:01 1440000bytes

Interesting how std::vector is almost always the best performing container, even when uniqueness or sorting is required.

jmoik avatar Jan 23 '25 11:01 jmoik

std::vector is probably less likely to have bugs that std::set, so there's an improvement in that regard as well.

I feel like we shouldn't need to fully sort, though. Perhaps a custom check would improve performance more significantly? (OTOH, the "normal flow" scenario might end up being a full sort anyway, unsure)

luke-jr avatar Jan 23 '25 15:01 luke-jr

(Side-note: In theory std::flat_set constructor (4) (https://en.cppreference.com/w/cpp/container/flat_set/flat_set) could be used as a drop-in replacement to do the sort+uniq, but this requires C++23 and isn't even implemented in any released standard lib and would only turn two calls into one.)

maflcko avatar Jan 24 '25 10:01 maflcko

I feel like we shouldn't need to fully sort, though

We likely wouldn't need that just for duplicate checking, but this seems "good enough", since it's a built-in function - especially now that I have integrated it with null checks as well.

The code now separates (in many tiny commits to ease review) validations by input sizes:

  • 1 input - it's either coinbase, in which case we don't need duplicate checks and don't need to check that none of the inputs are null, or if it's not, we're done with validation. Given that this is the most common case currently, seemed like an important case to tread separately.
  • 2 inputs - we check that the two are not equal and that they're not null
  • otherwise we sort the prevouts, check duplicates via built-in (possibly SIMD optimized) method and check for nulls at the head of the sorted values in constant time (we could also check NULL during sortedPrevouts collection but that would change the validation order for duplicates & nulls).

I've also extended the tests to cover random null check validations.

l0rinc avatar Jan 25 '25 14:01 l0rinc

I experimented a bit more and it looks like I got another 17% speed improvement (on block413567 on my linux machine) over your changes @l0rinc , by using a faster comparison operator and doing direct comparisons for relatively small input sizes. Can I create a new pull request for that?

jmoik avatar Jan 30 '25 14:01 jmoik

by using a faster comparison operator and doing direct comparisons for relatively small input sizes.

Quadratic comparison would likely be faster than sorting for very small input sizes (e.g. <10, I guess), but it would complicate the code and testing considerably for a case that is likely not representative. I also thought of that but I want the code to be faster while not being more difficult to comprehend. But feel free to challenge my bias by pushing a different PR with those changes and let the best approach win! You can cherry-pick it on top of these changes if you want.

l0rinc avatar Jan 30 '25 15:01 l0rinc

Rebased to fix conflicts with https://github.com/bitcoin/bitcoin/pull/33116

l0rinc avatar Aug 13 '25 23:08 l0rinc

Concept ACK.

will review soon. I'm a fan of the simpler std::vector approach.

Raimo33 avatar Sep 08 '25 04:09 Raimo33

The bar should be pretty high for touching such critical code.

I agree but I argue the changes proposed are small enough that this should be seriously considered for merge. the diff makes it hard to grasp, but few actually changed, and for the best. I'm debated on cae4bb877f9ef1350825ed42a9b97090d5920d39, but for the rest, Concept ACK, Approach ACK, ACK d28302fa2c09c2eac3daad7116dc9812fe01928b.

I've ran the tests and double checked the logic.

taskset -c 1 ./bench_bitcoin --filter="(ProcessTransactionBench)" --min-time=10000

Before:

ns/op op/s err% total benchmark
52,941.54 18,888.76 0.1% 10.79 ProcessTransactionBench

After:

ns/op op/s err% total benchmark
52,110.83 19,189.87 0.1% 10.78 ProcessTransactionBench

Without cae4bb877f9ef1350825ed42a9b97090d5920d39:

- else if (tx.vin.size() == 2) {
-       if (tx.vin[0].prevout == tx.vin[1].prevout) {
-           return state.Invalid(TxValidationResult::TX_CONSENSUS, "bad-txns-inputs-duplicate");
-       }
-       if (tx.vin[0].prevout.IsNull() || tx.vin[1].prevout.IsNull()) {
-           return state.Invalid(TxValidationResult::TX_CONSENSUS, "bad-txns-prevout-null");
-       }
-}
ns/op op/s err% total benchmark
52,778.63 18,947.06 0.1% 10.77 ProcessTransactionBench

+1.6% faster only +0.3% faster without cae4bb877f9ef1350825ed42a9b97090d5920d39

Looks like the gain comes from handling txs with 1-2 inputs separately.

Raimo33 avatar Sep 08 '25 06:09 Raimo33