bitcoin
bitcoin copied to clipboard
optimization: Reduce cache lookups in CCoinsViewCache::FetchCoin
Enhanced efficiency and readability of CCoinsViewCache::FetchCoin by replacing separate find() and emplace() calls with a single try_emplace(), reducing map lookups and potential insertions.
AssembleBlock shows FetchCoin as one of its bottlenecks:
These changes result in a modest performance improvement:
./src/bench/bench_bitcoin --filter='AssembleBlock' --min-time=10000
before:
| ns/op | op/s | err% | total | benchmark |
|---|---|---|---|---|
| 156,160.70 | 6,403.66 | 0.6% | 10.91 | AssembleBlock |
after:
| ns/op | op/s | err% | total | benchmark |
|---|---|---|---|---|
| 152,971.97 | 6,537.15 | 0.2% | 10.95 | AssembleBlock |
Further benchmarks: https://github.com/bitcoin/bitcoin/pull/30326#issuecomment-2188378721
The following sections might be updated with supplementary metadata relevant to reviewers and maintainers.
Code Coverage
For detailed information about the code coverage, see the test coverage report.
Reviews
See the guideline for information on the review process.
| Type | Reviewers |
|---|---|
| ACK | andrewtoth, sipa, achow101 |
| Stale ACK | luke-jr |
If your review is incorrectly listed, please react with 👎 to this comment and the bot will ignore it on the next update.
Conflicts
No conflicts as of last run.
These changes result in a modest performance improvement:
I ran the benchmark on an aarch64 machine, and saw mostly worse performance when using the latest GCC, or Clang, with libstdc++. There was a small improvement when using Clang and libc++.
GCC 14.1.1 20240620 (Red Hat 14.1.1-6)
Master
| ns/op | op/s | err% | ins/op | cyc/op | IPC | bra/op | miss% | total | benchmark
|--------------------:|--------------------:|--------:|----------------:|----------------:|-------:|---------------:|--------:|----------:|:----------
| 298,908.49 | 3,345.51 | 0.1% | 2,175,461.97 | 883,257.19 | 2.463 | 314,883.84 | 0.5% | 10.99 | `AssembleBlock`
PR
| ns/op | op/s | err% | ins/op | cyc/op | IPC | bra/op | miss% | total | benchmark
|--------------------:|--------------------:|--------:|----------------:|----------------:|-------:|---------------:|--------:|----------:|:----------
| 317,848.00 | 3,146.16 | 0.5% | 2,243,862.65 | 918,949.53 | 2.442 | 316,609.60 | 0.5% | 11.00 | `AssembleBlock`
Clang 18.1.6 (Fedora 18.1.6-4.fc41)
Master
| ns/op | op/s | err% | ins/op | cyc/op | IPC | bra/op | miss% | total | benchmark
|--------------------:|--------------------:|--------:|----------------:|----------------:|-------:|---------------:|--------:|----------:|:----------
| 290,702.08 | 3,439.95 | 0.2% | 2,047,925.12 | 858,324.72 | 2.386 | 290,750.85 | 0.6% | 10.99 | `AssembleBlock`
PR
| ns/op | op/s | err% | ins/op | cyc/op | IPC | bra/op | miss% | total | benchmark
|--------------------:|--------------------:|--------:|----------------:|----------------:|-------:|---------------:|--------:|----------:|:----------
| 301,625.77 | 3,315.37 | 0.2% | 2,138,789.21 | 891,494.06 | 2.399 | 294,669.20 | 0.5% | 10.96 | `AssembleBlock`
Clang libc++
Master
| ns/op | op/s | err% | ins/op | cyc/op | IPC | bra/op | miss% | total | benchmark
|--------------------:|--------------------:|--------:|----------------:|----------------:|-------:|---------------:|--------:|----------:|:----------
| 285,978.35 | 3,496.77 | 0.2% | 1,960,603.15 | 845,960.63 | 2.318 | 294,904.09 | 0.6% | 10.98 | `AssembleBlock`
PR
| ns/op | op/s | err% | ins/op | cyc/op | IPC | bra/op | miss% | total | benchmark
|--------------------:|--------------------:|--------:|----------------:|----------------:|-------:|---------------:|--------:|----------:|:----------
| 279,828.66 | 3,573.62 | 0.1% | 1,944,499.08 | 825,776.61 | 2.355 | 296,404.49 | 0.6% | 11.01 | `AssembleBlock`
Thanks for checking @fanquake!
As the profiler showed, this is only 5% of AssembleBlock, and the benchmark varied a lot on my machine as well, it might be why you also see the discrepancy.
Do you think I should add a dedicated benchmark for this only?
I think that rather than trying to micro-optimise these somewhat sensitive functions, if you're interested in benchmarking / performance related changes, your time might be better spent reviewing Pull Requests like #28280 (which your changes conflict with).
I'd also note that all of your benchmarking / profiling seems to be being done on an arm64 macOS machine, using (I assume) Apple Clang as the compiler, and libc++ as the standard library. So when you create changes like these, even if you see improvments locally, they aren't necessarily going to be improvements for the compiler/std lib we primarily care about (because it's used to build the majority of our release binaries), which is GCC and libstdc++.
If you can benchmark pre-assumevalid IBD I think you'll see FetchCoins have a bigger impact than in AssembleBlock. It may need an actual -reindex though, I don't know if we have a benchmark with realistic workload.
Thanks for the compiler hints @fanquake, I had a lot of problems with MacOSX14.sdk, so was using Apple clang version 15.0.0 (clang-1500.3.9.4). Might give GCC and libstdc++ another try if I can find a readme of how to do that on a mac properly, everything just failed catastrophically so far.
And thanks for the IBD hint @sipa, by limiting `-stopatheight=200000` and disabling unrelated validations, the target method was indeed the main bottleneck (~20% of time spent):
Index: src/validation.cpp
IDEA additional info:
Subsystem: com.intellij.openapi.diff.impl.patch.CharsetEP
<+>UTF-8
===================================================================
diff --git a/src/validation.cpp b/src/validation.cpp
--- a/src/validation.cpp (revision a114ced7192204963f19abb09db05d6be14cbf61)
+++ b/src/validation.cpp (date 1719320825375)
@@ -2133,7 +2133,7 @@
bool cacheFullScriptStore, PrecomputedTransactionData& txdata,
std::vector<CScriptCheck>* pvChecks)
{
- if (tx.IsCoinBase()) return true;
+ return true;
if (pvChecks) {
pvChecks->reserve(tx.vin.size());
I ran bitcoind -reindex-chainstate -assumevalid=0 -par=1 -stopatheight=200000 for 20 iterations (skipped the first in both as a warmup) as such:
git checkout master && make -j10 >/dev/null 2>&1 && for i in {1..20}; do (time (./src/bitcoind -reindex-chainstate -assumevalid=0 -par=1 -stopatheight=200000 >/dev/null 2>&1)) 2>&1 | awk '{print $(NF-1)}'; done; git checkout paplorinc/FetchCoin && make -j10 >/dev/null 2>&1 && for i in {1..20}; do (time (./src/bitcoind -reindex-chainstate -assumevalid=0 -par=1 -stopatheight=200000 >/dev/null 2>&1)) 2>&1 | awk '{print $(NF-1)}'; done
which reveals a small but measurable difference between the two runs (~4%):
Is there any other test or benchmark or change you'd like me to do?
your time might be better spent reviewing Pull Requests like https://github.com/bitcoin/bitcoin/pull/28280
Thanks for the hint @fanquake, it's a tough PR, I have added my two sats to it: https://github.com/bitcoin/bitcoin/pull/28280#pullrequestreview-2138727793
Just a style nit, but I think if you return early on if (!inserted) it would make a cleaner diff and make it easier to see the benefit of this change.
diff --git a/src/coins.cpp b/src/coins.cpp
index a4e4d4ad32..9e8072ba02 100644
--- a/src/coins.cpp
+++ b/src/coins.cpp
@@ -41,20 +41,20 @@ size_t CCoinsViewCache::DynamicMemoryUsage() const {
}
CCoinsMap::iterator CCoinsViewCache::FetchCoin(const COutPoint &outpoint) const {
- CCoinsMap::iterator it = cacheCoins.find(outpoint);
- if (it != cacheCoins.end())
+ const auto [it, inserted] = cacheCoins.try_emplace(outpoint);
+ if (!inserted)
return it;
- Coin tmp;
- if (!base->GetCoin(outpoint, tmp))
+ if (!base->GetCoin(outpoint, it->second.coin)) {
+ cacheCoins.erase(it);
return cacheCoins.end();
- CCoinsMap::iterator ret = cacheCoins.emplace(std::piecewise_construct, std::forward_as_tuple(outpoint), std::forward_as_tuple(std::move(tmp))).first;
- if (ret->second.coin.IsSpent()) {
+ }
+ if (it->second.coin.IsSpent()) {
// The parent only has an empty entry for this outpoint; we can consider our
// version as fresh.
- ret->second.flags = CCoinsCacheEntry::FRESH;
+ it->second.flags = CCoinsCacheEntry::FRESH;
}
- cachedCoinsUsage += ret->second.coin.DynamicMemoryUsage();
- return ret;
+ cachedCoinsUsage += it->second.coin.DynamicMemoryUsage();
+ return it;
}
bool CCoinsViewCache::GetCoin(const COutPoint &outpoint, Coin &coin) const {
Thanks for the review @andrewtoth.
I'm not sure I see the advantage of inverting the if (inserted) { call, just to have two return it; calls after.
But I did rename it to ret, since we're getting the most likely return value immediately now.
disabling unrelated validations
I don't think @sipa said to disable validations. "benchmark pre-assumevalid IBD" means testing IBD (or reindex) before the assumevalid block, which right now is block 824,000. By setting -assumevalid=0 that means you disabled using assumevalid entirely, so you will do signature checks from genesis.
limiting
-stopatheight=200000
With this limit you should not see any benefit from this change. Using default dbcache settings the cache is not flushed until around block 250000 for the first time. This means that all cache entries will exist in the coinsCache map when calling FetchCoin so it will never enter the if (inserted) { block.
I think you should benchmark bitcoind -reindex-chainstate -stopatheight=820000 or so. And you should not change anything in validation.cpp, that could skew the benchmark.
I'm not sure I see the advantage of inverting the if (inserted) { call, just to have two return it; calls after.
It was because the diff is then +9/-9 and no lines change indentation, and the current diff is +12/-13 with indentation changing. So just a style nit, no advantage to the code. Also, after your change of renaming it to ret, the change I suggest would have a diff of only +6/-6 and would no longer conflict with #28280
Thanks for the review @andrewtoth and @luke-jr!
I think you should benchmark
bitcoind -reindex-chainstate -stopatheight=820000
I don't have that much space yet, but ran it with:
sudo hyperfine \
--runs 3 \
--show-output \
--export-markdown bench.md \
--parameter-list commit bd5d16,00052ec \
--prepare "sudo purge; sync; git checkout {commit}; git reset --hard; make -j10" \
"./src/bitcoind -reindex-chainstate -stopatheight=400000 -printtoconsole=0"
resulting in
| Command | Mean [s] | Min [s] | Max [s] | Relative |
|---|---|---|---|---|
./src/bitcoind -reindex-chainstate -stopatheight=400000 -printtoconsole=0 (commit = bd5d16) |
1012.260 ± 1.206 | 1011.416 | 1013.641 | 1.02 ± 0.00 |
./src/bitcoind -reindex-chainstate -stopatheight=400000 -printtoconsole=0 (commit = 00052ec) |
994.387 ± 3.077 | 991.018 | 997.050 | 1.00 |
i.e. a modest 2% speedup.
would no longer conflict with https://github.com/bitcoin/bitcoin/pull/28280
good point, if more changes will be needed, I'll do that!
positive lookup path is much more likely, especially during IBD.
Thanks Luke, that was my thinking as well. I'll do a full IBD this week to see how it behaves (once I free up some extra space locally).
Thanks a lot for checking, @andrewtoth!
🚧 At least one of the CI tasks failed. Debug: https://github.com/bitcoin/bitcoin/runs/27072041125
Hints
Make sure to run all tests locally, according to the documentation.
The failure may happen due to a number of reasons, for example:
-
Possibly due to a silent merge conflict (the changes in this pull request being incompatible with the current code in the target branch). If so, make sure to rebase on the latest commit of the target branch.
-
A sanitizer issue, which can only be found by compiling with the sanitizer and running the affected test.
-
An intermittent issue.
Leave a comment here, if you need help tracking down a confusing failure.
Rebased after https://github.com/bitcoin/bitcoin/pull/28280
utACK 204ca67bba263018374fe86d7a6867362d09536f
ACK 204ca67bba263018374fe86d7a6867362d09536f