bitcoin
bitcoin copied to clipboard
[IBD] specialize CheckBlock's input & coinbase checks
This change is part of [IBD] - Tracking PR for speeding up Initial Block Download
Summary
CheckBlock's latency is critical for efficiently validating correct inputs during transaction validation, including mempool acceptance and new block creation.
This PR improves performance and maintainability by introducing the following changes:
- Simplified checks for the most common cases (1 or 2 inputs - 70-90% of transactions have a single input).
- Optimized the general case by replacing
std::setwith sortedstd::vectorfor improved locality. - Simplified Null prevout checks from linear to constant time.
Context, concerns
As noted by @sipa in https://github.com/bitcoin/bitcoin/pull/31682#pullrequestreview-2570273162, this is consensus-critical code (especially since we've had an actual inflation bug related to duplicate checking).
However, the changes here are very confined to a single function, easily reviewed, and I have a general preference for avoiding std::set when sorted std::vector works too as well.
The goal of this change is to make block validation faster via a low-risk alternative.
Measurements
C++ compiler .......................... AppleClang 16.0.0.16000026
Before:
| ns/block | block/s | err% | total | benchmark |
|---|---|---|---|---|
| 372,743.63 | 2,682.81 | 1.1% | 10.99 | CheckBlockBench |
| ns/op | op/s | err% | total | benchmark |
|---|---|---|---|---|
| 3,304,694.54 | 302.60 | 0.5% | 11.05 | DuplicateInputs |
| 9,585.10 | 104,328.55 | 0.1% | 11.03 | ProcessTransactionBench |
After:
| ns/block | block/s | err% | total | benchmark |
|---|---|---|---|---|
| 179,971.00 | 5,556.45 | 0.3% | 11.02 | CheckBlockBench |
| ns/op | op/s | err% | total | benchmark |
|---|---|---|---|---|
| 963,177.98 | 1,038.23 | 1.7% | 10.92 | DuplicateInputs |
| 9,410.90 | 106,259.75 | 0.3% | 11.01 | ProcessTransactionBench |
- 2.07x faster
CheckBlockBench - 3.4x faster
DuplicateInputs - 1.8% faster
ProcessTransactionBench
C++ compiler .......................... GNU 13.3.0
Before:
| ns/block | block/s | err% | ins/block | cyc/block | IPC | bra/block | miss% | total | benchmark |
|---|---|---|---|---|---|---|---|---|---|
| 1,096,261.84 | 912.19 | 0.1% | 7,963,390.88 | 3,487,375.26 | 2.283 | 1,266,941.00 | 1.8% | 11.03 | CheckBlockBench |
| ns/op | op/s | err% | ins/op | cyc/op | IPC | bra/op | miss% | total | benchmark |
|---|---|---|---|---|---|---|---|---|---|
| 8,366,309.48 | 119.53 | 0.0% | 23,865,177.67 | 26,620,160.23 | 0.897 | 5,972,887.41 | 4.0% | 10.78 | DuplicateInputs |
| 56,199.57 | 17,793.73 | 0.1% | 229,263.01 | 178,766.31 | 1.282 | 15,509.97 | 0.5% | 10.91 | ProcessTransactionBench |
After:
| ns/block | block/s | err% | ins/block | cyc/block | IPC | bra/block | miss% | total | benchmark |
|---|---|---|---|---|---|---|---|---|---|
| 834,855.94 | 1,197.81 | 0.0% | 6,518,548.86 | 2,656,039.78 | 2.454 | 919,160.84 | 1.5% | 10.78 | CheckBlockBench |
| ns/op | op/s | err% | ins/op | cyc/op | IPC | bra/op | miss% | total | benchmark |
|---|---|---|---|---|---|---|---|---|---|
| 4,261,492.75 | 234.66 | 0.0% | 17,379,823.40 | 13,559,793.33 | 1.282 | 4,265,714.28 | 3.4% | 11.00 | DuplicateInputs |
| 55,819.53 | 17,914.88 | 0.1% | 227,828.15 | 177,520.09 | 1.283 | 15,184.36 | 0.4% | 10.91 | ProcessTransactionBench |
- 31% faster
CheckBlockBench - 2x faster
DuplicateInputs - 0.5% faster
ProcessTransactionBench
While the point of the change isn't necessarily to speed up IBD, but you can see the measurements at https://github.com/bitcoin/bitcoin/pull/31682#issuecomment-2606765689
Related PRs:
- https://github.com/bitcoin/bitcoin/pull/14397 - very similar solution, I only found it after pushing the PR (was closed for inactivity)
- https://github.com/bitcoin/bitcoin/pull/14837 - a faster, but a lot more complex alternative (was closed for complexity and inactivity).
The following sections might be updated with supplementary metadata relevant to reviewers and maintainers.
Code Coverage & Benchmarks
For details see: https://corecheck.dev/bitcoin/bitcoin/pulls/31682.
Reviews
See the guideline for information on the review process.
| Type | Reviewers |
|---|---|
| Concept NACK | 1440000bytes |
| Concept ACK | sipa |
| Approach ACK | Raimo33 |
If your review is incorrectly listed, please react with 👎 to this comment and the bot will ignore it on the next update.
Conflicts
Reviewers, this pull request conflicts with the following ones:
- #33629 (Cluster mempool by sdaftuar)
- #32729 (test,refactor: extract script template helpers & widen sigop count coverage by l0rinc)
- #32554 (bench: replace embedded raw block with configurable block generator by l0rinc)
- #31868 ([IBD] specialize block serialization by l0rinc)
If you consider this pull request important, please also help to review the conflicting pull requests. Ideally, start with the one that should be merged first.
🚧 At least one of the CI tasks failed. Debug: https://github.com/bitcoin/bitcoin/runs/35818754067
Hints
Try to run the tests locally, according to the documentation. However, a CI failure may still happen due to a number of reasons, for example:
-
Possibly due to a silent merge conflict (the changes in this pull request being incompatible with the current code in the target branch). If so, make sure to rebase on the latest commit of the target branch.
-
A sanitizer issue, which can only be found by compiling with the sanitizer and running the affected test.
-
An intermittent issue.
Leave a comment here, if you need help tracking down a confusing failure.
How much of an improvement does this translate into for a user? The bar should be pretty high for touching such critical code.
The bar should be pretty high for touching such critical code.
Absolutely, every change should have an extremely high bar here. For context, compared to the changes described in https://bitcoincore.org/en/2018/09/20/notice, this update doesn’t involve caching or reusing previously calculated values elsewhere. Instead, it operates entirely within the method, maintaining a black-box approach that ensures the exact same behavior. To validate this, I’ve added an extensive suite of tests that thoroughly cover these cases.
How much of an improvement does this translate into for a user?
The primary goal of this change is to improve block validation performance while maintaining a low risk profile. The PR description includes detailed measurements of the improvements. If there are additional benchmarks or scenarios you’d like me to measure, please let me know, and I’ll be happy to provide them.
it is called whenever a tx is validated (so also during mempool acceptance)
Updated the description, thanks.
how much this improves transaction validation as a whole
I can only optimize small parts of these, step-by-step. This was the next part I found that could be cheaply sped up (having preexisting tests, benchmarks - signaling that others were also interested in it).
I have created a chainman.ProcessTransaction benchmark locally (will commit separately, not strictly related to this PR) to check the overall effect, but as you've stated, this is a small part of the whole process, so I regularly got roughly ~1% speedup only. Not huge, but we have to decide if it's proportional to the risk or not.
ProcessTransaction results
git checkout bf090135d01076aa6d4fbfe7a185682dd6f9f489 >/dev/null 2>&1 \
&& cmake -B build -DBUILD_BENCH=ON -DCMAKE_BUILD_TYPE=Release >/dev/null 2>&1 \
&& cmake --build build -j$(nproc) >/dev/null 2>&1 \
&& build/src/bench/bench_bitcoin -filter='ProcessTransactionBench' -min-time=10000 \
&& git checkout f25b91cd5a50e1035bbdb884ab11858702fadf33 >/dev/null 2>&1 \
&& cmake -B build -DBUILD_BENCH=ON -DCMAKE_BUILD_TYPE=Release >/dev/null 2>&1 \
&& cmake --build build -j$(nproc) >/dev/null 2>&1 \
&& build/src/bench/bench_bitcoin -filter='ProcessTransactionBench' -min-time=10000
| ns/op | op/s | err% | ins/op | cyc/op | IPC | bra/op | miss% | total | benchmark |
|---|---|---|---|---|---|---|---|---|---|
| 56,570.90 | 17,676.93 | 0.1% | 229,118.71 | 179,945.97 | 1.273 | 15,499.48 | 0.5% | 11.03 | ProcessTransactionBench |
| ns/op | op/s | err% | ins/op | cyc/op | IPC | bra/op | miss% | total | benchmark |
|---|---|---|---|---|---|---|---|---|---|
| 56,152.78 | 17,808.56 | 0.0% | 228,009.18 | 178,596.02 | 1.277 | 15,224.76 | 0.4% | 10.92 | ProcessTransactionBench |
Similarly, I ran a few reindex-chainstates (as proxies for a stable IBD) and - as posted in the description -, got a similar ~1% speedup (2 runs per bench, 19923 & 19933 seconds before, and 19681 and 19737 seconds after).
Not earth-shattering globally, but I think the risk (given all the tests and benchmarks) is relatively low as well.
I regularly got roughly ~1% speedup only. Not huge, but we have to decide if it's proportional to the risk or not.
got a similar ~1% speedup (2 runs per bench, 19923 & 19933 seconds before, and 19681 and 19737 seconds after).
NACK
@1440000bytes, this behavior isn't helpful - you're just reiterating what I've already explained, without suggesting workable solutions.
Anyway, I've created an alternative PR that focuses solely on the tests and benchmarks, excluding the controversial optimizations. It validates the affected code and removes microbenchmarks that overemphasize this segment's significance, replacing it with a higher-level one, as suggested by @mzumsande.
@1440000bytes, this behavior isn't helpful - you're just reiterating what I've already explained, without suggesting workable solutions.
I don't think optimization is worth changing consensus code.
~I pushed a new version that is based on https://github.com/bitcoin/bitcoin/pull/31699 - to separate the testing/benching from the consensus change. This will remain in draft to gather comments.~
I've also retriggered an IBD (simulated via multiple -reindex-chainstate runs for stability).
Here I've compared the performance of this PR after applying my other IBD optimizations (https://github.com/bitcoin/bitcoin/pull/31144 and https://github.com/bitcoin/bitcoin/pull/31645) to measure the future potential of this change. I ran it until 800k block this time to avoid the validation overhead.
Benchmark results
hyperfine --runs 2 --parameter-list COMMIT dcc3fe7f9c4b68a4b4622ef958a3262cd8a4b08a,053ccbefa3563e776932bb847aae9c302d07de61 --prepare 'rm -rfd /mnt/my_storage/BitcoinData/debug.log build/ && git checkout {COMMIT} && cmake -B build -DCMAKE_BUILD_TYPE=Release -DBUILD_UTIL=OFF -DBUILD_TX=OFF -DBUILD_TESTS=OFF -DENABLE_WALLET=OFF -DINSTALL_MAN=OFF && cmake --build build -j$(nproc)' --cleanup 'mv /mnt/my_storage/BitcoinData/debug.log /mnt/my_storage/logs/debug-{COMMIT}-$(date +%s.log)' 'COMMIT={COMMIT} ./build/src/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=800000 -dbcache=30000 -reindex-chainstate -connect=0 -printtoconsole=0'
Benchmark 1: COMMIT=dcc3fe7f9c4b68a4b4622ef958a3262cd8a4b08a ./build/src/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=800000 -dbcache=30000 -reindex-chainstate -connect=0 -printtoconsole=0
Time (mean ± σ): 12537.220 s ± 7.591 s [User: 13362.903 s, System: 523.533 s]
Range (min … max): 12531.852 s … 12542.588 s 2 runs
Benchmark 2: COMMIT=053ccbefa3563e776932bb847aae9c302d07de61 ./build/src/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=800000 -dbcache=30000 -reindex-chainstate -connect=0 -printtoconsole=0
Time (mean ± σ): 12251.387 s ± 52.389 s [User: 13169.814 s, System: 543.757 s]
Range (min … max): 12214.343 s … 12288.432 s 2 runs
Summary
COMMIT=053ccbefa3563e776932bb847aae9c302d07de61 ./build/src/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=800000 -dbcache=30000 -reindex-chainstate -connect=0 -printtoconsole=0 ran
1.02 ± 0.00 times faster than COMMIT=dcc3fe7f9c4b68a4b4622ef958a3262cd8a4b08a ./build/src/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=800000 -dbcache=30000 -reindex-chainstate -connect=0 -printtoconsole=0
This reveals a ~2.5% speedup (taking the best runs out of 2).
NACK https://github.com/bitcoin/bitcoin/commit/0a40b8c8a09344c1ef9167be2f357da7a551d65d
Reason: https://github.com/bitcoin/bitcoin/pull/31682#issuecomment-2604902068
I don't like your behavior to keep pushing for minimal changes to affect consensus code
@1440000bytes keep your comments on topic. the topic is technical. comments about people are not relevant.
@1440000bytes keep your comments on topic. the topic is technical. comments about people are not relevant.
It is on topic and response to https://github.com/bitcoin/bitcoin/pull/31682#issuecomment-2604837850
I have reviewed technically before others.
Everyone needs to cool down on this pull request. 24 hour bans are next for users that mention people and not code. After that its 7 day bans.
Merge this pull request if my comments are so off topic.
Interesting how std::vector is almost always the best performing container, even when uniqueness or sorting is required.
std::vector is probably less likely to have bugs that std::set, so there's an improvement in that regard as well.
I feel like we shouldn't need to fully sort, though. Perhaps a custom check would improve performance more significantly? (OTOH, the "normal flow" scenario might end up being a full sort anyway, unsure)
(Side-note: In theory std::flat_set constructor (4) (https://en.cppreference.com/w/cpp/container/flat_set/flat_set) could be used as a drop-in replacement to do the sort+uniq, but this requires C++23 and isn't even implemented in any released standard lib and would only turn two calls into one.)
I feel like we shouldn't need to fully sort, though
We likely wouldn't need that just for duplicate checking, but this seems "good enough", since it's a built-in function - especially now that I have integrated it with null checks as well.
The code now separates (in many tiny commits to ease review) validations by input sizes:
- 1 input - it's either coinbase, in which case we don't need duplicate checks and don't need to check that none of the inputs are null, or if it's not, we're done with validation. Given that this is the most common case currently, seemed like an important case to tread separately.
- 2 inputs - we check that the two are not equal and that they're not null
- otherwise we sort the prevouts, check duplicates via built-in (possibly SIMD optimized) method and check for nulls at the head of the sorted values in constant time (we could also check NULL during
sortedPrevoutscollection but that would change the validation order for duplicates & nulls).
I've also extended the tests to cover random null check validations.
I experimented a bit more and it looks like I got another 17% speed improvement (on block413567 on my linux machine) over your changes @l0rinc , by using a faster comparison operator and doing direct comparisons for relatively small input sizes. Can I create a new pull request for that?
by using a faster comparison operator and doing direct comparisons for relatively small input sizes.
Quadratic comparison would likely be faster than sorting for very small input sizes (e.g. <10, I guess), but it would complicate the code and testing considerably for a case that is likely not representative. I also thought of that but I want the code to be faster while not being more difficult to comprehend. But feel free to challenge my bias by pushing a different PR with those changes and let the best approach win! You can cherry-pick it on top of these changes if you want.
Rebased to fix conflicts with https://github.com/bitcoin/bitcoin/pull/33116
Concept ACK.
will review soon. I'm a fan of the simpler std::vector approach.
The bar should be pretty high for touching such critical code.
I agree but I argue the changes proposed are small enough that this should be seriously considered for merge. the diff makes it hard to grasp, but few actually changed, and for the best. I'm debated on cae4bb877f9ef1350825ed42a9b97090d5920d39, but for the rest, Concept ACK, Approach ACK, ACK d28302fa2c09c2eac3daad7116dc9812fe01928b.
I've ran the tests and double checked the logic.
taskset -c 1 ./bench_bitcoin --filter="(ProcessTransactionBench)" --min-time=10000
Before:
| ns/op | op/s | err% | total | benchmark |
|---|---|---|---|---|
| 52,941.54 | 18,888.76 | 0.1% | 10.79 | ProcessTransactionBench |
After:
| ns/op | op/s | err% | total | benchmark |
|---|---|---|---|---|
| 52,110.83 | 19,189.87 | 0.1% | 10.78 | ProcessTransactionBench |
Without cae4bb877f9ef1350825ed42a9b97090d5920d39:
- else if (tx.vin.size() == 2) {
- if (tx.vin[0].prevout == tx.vin[1].prevout) {
- return state.Invalid(TxValidationResult::TX_CONSENSUS, "bad-txns-inputs-duplicate");
- }
- if (tx.vin[0].prevout.IsNull() || tx.vin[1].prevout.IsNull()) {
- return state.Invalid(TxValidationResult::TX_CONSENSUS, "bad-txns-prevout-null");
- }
-}
| ns/op | op/s | err% | total | benchmark |
|---|---|---|---|---|
| 52,778.63 | 18,947.06 | 0.1% | 10.77 | ProcessTransactionBench |
+1.6% faster only +0.3% faster without cae4bb877f9ef1350825ed42a9b97090d5920d39
Looks like the gain comes from handling txs with 1-2 inputs separately.