bitcoin icon indicating copy to clipboard operation
bitcoin copied to clipboard

optimization: Reduce cache lookups in CCoinsViewCache::FetchCoin

Open l0rinc opened this issue 1 year ago • 7 comments

Enhanced efficiency and readability of CCoinsViewCache::FetchCoin by replacing separate find() and emplace() calls with a single try_emplace(), reducing map lookups and potential insertions.

AssembleBlock shows FetchCoin as one of its bottlenecks:

These changes result in a modest performance improvement:

./src/bench/bench_bitcoin --filter='AssembleBlock' --min-time=10000

before:

ns/op op/s err% total benchmark
156,160.70 6,403.66 0.6% 10.91 AssembleBlock

after:

ns/op op/s err% total benchmark
152,971.97 6,537.15 0.2% 10.95 AssembleBlock

Further benchmarks: https://github.com/bitcoin/bitcoin/pull/30326#issuecomment-2188378721

l0rinc avatar Jun 23 '24 22:06 l0rinc

The following sections might be updated with supplementary metadata relevant to reviewers and maintainers.

Code Coverage

For detailed information about the code coverage, see the test coverage report.

Reviews

See the guideline for information on the review process.

Type Reviewers
ACK andrewtoth, sipa, achow101
Stale ACK luke-jr

If your review is incorrectly listed, please react with 👎 to this comment and the bot will ignore it on the next update.

Conflicts

No conflicts as of last run.

DrahtBot avatar Jun 23 '24 22:06 DrahtBot

These changes result in a modest performance improvement:

I ran the benchmark on an aarch64 machine, and saw mostly worse performance when using the latest GCC, or Clang, with libstdc++. There was a small improvement when using Clang and libc++.

GCC 14.1.1 20240620 (Red Hat 14.1.1-6)

Master

|               ns/op |                op/s |    err% |          ins/op |          cyc/op |    IPC |         bra/op |   miss% |     total | benchmark
|--------------------:|--------------------:|--------:|----------------:|----------------:|-------:|---------------:|--------:|----------:|:----------
|          298,908.49 |            3,345.51 |    0.1% |    2,175,461.97 |      883,257.19 |  2.463 |     314,883.84 |    0.5% |     10.99 | `AssembleBlock`

PR

|               ns/op |                op/s |    err% |          ins/op |          cyc/op |    IPC |         bra/op |   miss% |     total | benchmark
|--------------------:|--------------------:|--------:|----------------:|----------------:|-------:|---------------:|--------:|----------:|:----------
|          317,848.00 |            3,146.16 |    0.5% |    2,243,862.65 |      918,949.53 |  2.442 |     316,609.60 |    0.5% |     11.00 | `AssembleBlock`

Clang 18.1.6 (Fedora 18.1.6-4.fc41)

Master

|               ns/op |                op/s |    err% |          ins/op |          cyc/op |    IPC |         bra/op |   miss% |     total | benchmark
|--------------------:|--------------------:|--------:|----------------:|----------------:|-------:|---------------:|--------:|----------:|:----------
|          290,702.08 |            3,439.95 |    0.2% |    2,047,925.12 |      858,324.72 |  2.386 |     290,750.85 |    0.6% |     10.99 | `AssembleBlock`

PR

|               ns/op |                op/s |    err% |          ins/op |          cyc/op |    IPC |         bra/op |   miss% |     total | benchmark
|--------------------:|--------------------:|--------:|----------------:|----------------:|-------:|---------------:|--------:|----------:|:----------
|          301,625.77 |            3,315.37 |    0.2% |    2,138,789.21 |      891,494.06 |  2.399 |     294,669.20 |    0.5% |     10.96 | `AssembleBlock`

Clang libc++

Master

|               ns/op |                op/s |    err% |          ins/op |          cyc/op |    IPC |         bra/op |   miss% |     total | benchmark
|--------------------:|--------------------:|--------:|----------------:|----------------:|-------:|---------------:|--------:|----------:|:----------
|          285,978.35 |            3,496.77 |    0.2% |    1,960,603.15 |      845,960.63 |  2.318 |     294,904.09 |    0.6% |     10.98 | `AssembleBlock`

PR

|               ns/op |                op/s |    err% |          ins/op |          cyc/op |    IPC |         bra/op |   miss% |     total | benchmark
|--------------------:|--------------------:|--------:|----------------:|----------------:|-------:|---------------:|--------:|----------:|:----------
|          279,828.66 |            3,573.62 |    0.1% |    1,944,499.08 |      825,776.61 |  2.355 |     296,404.49 |    0.6% |     11.01 | `AssembleBlock`

fanquake avatar Jun 24 '24 14:06 fanquake

Thanks for checking @fanquake! As the profiler showed, this is only 5% of AssembleBlock, and the benchmark varied a lot on my machine as well, it might be why you also see the discrepancy. Do you think I should add a dedicated benchmark for this only?

l0rinc avatar Jun 24 '24 14:06 l0rinc

I think that rather than trying to micro-optimise these somewhat sensitive functions, if you're interested in benchmarking / performance related changes, your time might be better spent reviewing Pull Requests like #28280 (which your changes conflict with).

I'd also note that all of your benchmarking / profiling seems to be being done on an arm64 macOS machine, using (I assume) Apple Clang as the compiler, and libc++ as the standard library. So when you create changes like these, even if you see improvments locally, they aren't necessarily going to be improvements for the compiler/std lib we primarily care about (because it's used to build the majority of our release binaries), which is GCC and libstdc++.

fanquake avatar Jun 24 '24 14:06 fanquake

If you can benchmark pre-assumevalid IBD I think you'll see FetchCoins have a bigger impact than in AssembleBlock. It may need an actual -reindex though, I don't know if we have a benchmark with realistic workload.

sipa avatar Jun 24 '24 17:06 sipa

Thanks for the compiler hints @fanquake, I had a lot of problems with MacOSX14.sdk, so was using Apple clang version 15.0.0 (clang-1500.3.9.4). Might give GCC and libstdc++ another try if I can find a readme of how to do that on a mac properly, everything just failed catastrophically so far.

And thanks for the IBD hint @sipa, by limiting `-stopatheight=200000` and disabling unrelated validations, the target method was indeed the main bottleneck (~20% of time spent):
Index: src/validation.cpp
IDEA additional info:
Subsystem: com.intellij.openapi.diff.impl.patch.CharsetEP
<+>UTF-8
===================================================================
diff --git a/src/validation.cpp b/src/validation.cpp
--- a/src/validation.cpp	(revision a114ced7192204963f19abb09db05d6be14cbf61)
+++ b/src/validation.cpp	(date 1719320825375)
@@ -2133,7 +2133,7 @@
                        bool cacheFullScriptStore, PrecomputedTransactionData& txdata,
                        std::vector<CScriptCheck>* pvChecks)
 {
-    if (tx.IsCoinBase()) return true;
+    return true;
 
     if (pvChecks) {
         pvChecks->reserve(tx.vin.size());

I ran bitcoind -reindex-chainstate -assumevalid=0 -par=1 -stopatheight=200000 for 20 iterations (skipped the first in both as a warmup) as such:

git checkout master && make -j10 >/dev/null 2>&1 && for i in {1..20}; do (time (./src/bitcoind -reindex-chainstate -assumevalid=0 -par=1 -stopatheight=200000 >/dev/null 2>&1)) 2>&1 | awk '{print $(NF-1)}'; done; git checkout paplorinc/FetchCoin && make -j10 >/dev/null 2>&1 && for i in {1..20}; do (time (./src/bitcoind -reindex-chainstate -assumevalid=0 -par=1 -stopatheight=200000 >/dev/null 2>&1)) 2>&1 | awk '{print $(NF-1)}'; done

which reveals a small but measurable difference between the two runs (~4%):

Is there any other test or benchmark or change you'd like me to do?

l0rinc avatar Jun 25 '24 09:06 l0rinc

your time might be better spent reviewing Pull Requests like https://github.com/bitcoin/bitcoin/pull/28280

Thanks for the hint @fanquake, it's a tough PR, I have added my two sats to it: https://github.com/bitcoin/bitcoin/pull/28280#pullrequestreview-2138727793

l0rinc avatar Jun 28 '24 16:06 l0rinc

Just a style nit, but I think if you return early on if (!inserted) it would make a cleaner diff and make it easier to see the benefit of this change.

diff --git a/src/coins.cpp b/src/coins.cpp
index a4e4d4ad32..9e8072ba02 100644
--- a/src/coins.cpp
+++ b/src/coins.cpp
@@ -41,20 +41,20 @@ size_t CCoinsViewCache::DynamicMemoryUsage() const {
 }
 
 CCoinsMap::iterator CCoinsViewCache::FetchCoin(const COutPoint &outpoint) const {
-    CCoinsMap::iterator it = cacheCoins.find(outpoint);
-    if (it != cacheCoins.end())
+    const auto [it, inserted] = cacheCoins.try_emplace(outpoint);
+    if (!inserted)
         return it;
-    Coin tmp;
-    if (!base->GetCoin(outpoint, tmp))
+    if (!base->GetCoin(outpoint, it->second.coin)) {
+        cacheCoins.erase(it);
         return cacheCoins.end();
-    CCoinsMap::iterator ret = cacheCoins.emplace(std::piecewise_construct, std::forward_as_tuple(outpoint), std::forward_as_tuple(std::move(tmp))).first;
-    if (ret->second.coin.IsSpent()) {
+    }
+    if (it->second.coin.IsSpent()) {
         // The parent only has an empty entry for this outpoint; we can consider our
         // version as fresh.
-        ret->second.flags = CCoinsCacheEntry::FRESH;
+        it->second.flags = CCoinsCacheEntry::FRESH;
     }
-    cachedCoinsUsage += ret->second.coin.DynamicMemoryUsage();
-    return ret;
+    cachedCoinsUsage += it->second.coin.DynamicMemoryUsage();
+    return it;
 }
 
 bool CCoinsViewCache::GetCoin(const COutPoint &outpoint, Coin &coin) const {

andrewtoth avatar Jul 04 '24 22:07 andrewtoth

Thanks for the review @andrewtoth. I'm not sure I see the advantage of inverting the if (inserted) { call, just to have two return it; calls after. But I did rename it to ret, since we're getting the most likely return value immediately now.

l0rinc avatar Jul 05 '24 06:07 l0rinc

disabling unrelated validations

I don't think @sipa said to disable validations. "benchmark pre-assumevalid IBD" means testing IBD (or reindex) before the assumevalid block, which right now is block 824,000. By setting -assumevalid=0 that means you disabled using assumevalid entirely, so you will do signature checks from genesis.

limiting -stopatheight=200000

With this limit you should not see any benefit from this change. Using default dbcache settings the cache is not flushed until around block 250000 for the first time. This means that all cache entries will exist in the coinsCache map when calling FetchCoin so it will never enter the if (inserted) { block.

I think you should benchmark bitcoind -reindex-chainstate -stopatheight=820000 or so. And you should not change anything in validation.cpp, that could skew the benchmark.

I'm not sure I see the advantage of inverting the if (inserted) { call, just to have two return it; calls after.

It was because the diff is then +9/-9 and no lines change indentation, and the current diff is +12/-13 with indentation changing. So just a style nit, no advantage to the code. Also, after your change of renaming it to ret, the change I suggest would have a diff of only +6/-6 and would no longer conflict with #28280

andrewtoth avatar Jul 05 '24 13:07 andrewtoth

Thanks for the review @andrewtoth and @luke-jr!

I think you should benchmark bitcoind -reindex-chainstate -stopatheight=820000

I don't have that much space yet, but ran it with:

sudo hyperfine \
--runs 3 \
--show-output \
--export-markdown bench.md \
--parameter-list commit bd5d16,00052ec \
--prepare "sudo purge; sync; git checkout {commit}; git reset --hard; make -j10" \
"./src/bitcoind -reindex-chainstate -stopatheight=400000 -printtoconsole=0"

resulting in

Command Mean [s] Min [s] Max [s] Relative
./src/bitcoind -reindex-chainstate -stopatheight=400000 -printtoconsole=0 (commit = bd5d16) 1012.260 ± 1.206 1011.416 1013.641 1.02 ± 0.00
./src/bitcoind -reindex-chainstate -stopatheight=400000 -printtoconsole=0 (commit = 00052ec) 994.387 ± 3.077 991.018 997.050 1.00

i.e. a modest 2% speedup.

would no longer conflict with https://github.com/bitcoin/bitcoin/pull/28280

good point, if more changes will be needed, I'll do that!

positive lookup path is much more likely, especially during IBD.

Thanks Luke, that was my thinking as well. I'll do a full IBD this week to see how it behaves (once I free up some extra space locally).

l0rinc avatar Jul 07 '24 11:07 l0rinc

Thanks a lot for checking, @andrewtoth!

l0rinc avatar Jul 07 '24 19:07 l0rinc

🚧 At least one of the CI tasks failed. Debug: https://github.com/bitcoin/bitcoin/runs/27072041125

Hints

Make sure to run all tests locally, according to the documentation.

The failure may happen due to a number of reasons, for example:

  • Possibly due to a silent merge conflict (the changes in this pull request being incompatible with the current code in the target branch). If so, make sure to rebase on the latest commit of the target branch.

  • A sanitizer issue, which can only be found by compiling with the sanitizer and running the affected test.

  • An intermittent issue.

Leave a comment here, if you need help tracking down a confusing failure.

DrahtBot avatar Jul 22 '24 11:07 DrahtBot

Rebased after https://github.com/bitcoin/bitcoin/pull/28280

l0rinc avatar Aug 08 '24 20:08 l0rinc

utACK 204ca67bba263018374fe86d7a6867362d09536f

sipa avatar Aug 09 '24 20:08 sipa

ACK 204ca67bba263018374fe86d7a6867362d09536f

achow101 avatar Aug 12 '24 20:08 achow101