cosmos-sdk
cosmos-sdk copied to clipboard
perf: speed up crisis invariant check (juno genesis)
Description
Looking at the time it takes to handle the crisis invariant check when starting a node, the vast majority of time taken by the invariant check is due to the can-withdraw
function, which takes ages because it sequentially loops every validator and every delegator. I switched the looping to be concurrent instead, resulting in a pretty huge speedup.
This check isn't part of consensus or anything, so non-deterministic concurrency should be fair game here seeing as how rewards are independent between validators.
I'm not super familiar with the internals of x/crisis, but it doesn't seem to me that I broke anything in doing this - behaviour may differ since the code will panic on the first invariant error it encounters (which may be different between executions if there are multiple), but I feel this is a compromise that we should be willing to make. Still, I might be missing something here, so feel free to correct me if that isn't the case.
When testing using Pheonix-2 genesis file
On my 24-core AMD Ryzen 9 3900x: Before:
TBD - will update when it finishes
After:
12:27PM INF starting node with ABCI Tendermint in-process
12:28PM INF Starting multiAppConn service impl=multiAppConn module=proxy
12:28PM INF Starting localClient service connection=query impl=localClient module=abci-client
12:28PM INF Starting localClient service connection=snapshot impl=localClient module=abci-client
12:28PM INF Starting localClient service connection=mempool impl=localClient module=abci-client
12:28PM INF Starting localClient service connection=consensus impl=localClient module=abci-client
12:28PM INF Starting EventBus service impl=EventBus module=events
12:28PM INF Starting PubSub service impl=PubSub module=pubsub
12:28PM INF Starting IndexerService service impl=IndexerService module=txindex
12:28PM INF ABCI Handshake App Info hash= height=0 module=consensus protocol-version=0 software-version=v9.0.0
12:28PM INF ABCI Replay Blocks appHeight=0 module=consensus stateHeight=0 storeHeight=0
12:28PM INF asserting crisis invariants inv=1/11 module=x/crisis name=distribution/nonnegative-outstanding
12:28PM INF asserting crisis invariants inv=2/11 module=x/crisis name=distribution/can-withdraw
12:29PM INF asserting crisis invariants inv=3/11 module=x/crisis name=distribution/reference-count
12:29PM INF asserting crisis invariants inv=4/11 module=x/crisis name=distribution/module-account
12:29PM INF asserting crisis invariants inv=5/11 module=x/crisis name=bank/nonnegative-outstanding
12:29PM INF asserting crisis invariants inv=6/11 module=x/crisis name=bank/total-supply
12:29PM INF asserting crisis invariants inv=7/11 module=x/crisis name=gov/module-account
12:29PM INF asserting crisis invariants inv=8/11 module=x/crisis name=staking/module-accounts
12:29PM INF asserting crisis invariants inv=9/11 module=x/crisis name=staking/nonnegative-power
12:29PM INF asserting crisis invariants inv=10/11 module=x/crisis name=staking/positive-delegation
12:29PM INF asserting crisis invariants inv=11/11 module=x/crisis name=staking/delegator-shares
12:29PM INF asserted all invariants duration=49032.084869 height=4136532 module=x/crisis
Author Checklist
All items are required. Please add a note to the item if the item is not applicable and please add links to any relevant follow up issues.
I have...
- [x] included the correct type prefix in the PR title
- [ ] added
!
to the type prefix if API or client breaking change - [x] targeted the correct branch (see PR Targeting)
- [ ] provided a link to the relevant issue or specification
- [ ] followed the guidelines for building modules
- [ ] included the necessary unit and integration tests
- [ ] added a changelog entry to
CHANGELOG.md
- [x] included comments for documenting Go code
- [ ] updated the relevant documentation or specification
- [ ] reviewed "Files changed" and left comments if necessary
- [ ] confirmed all CI checks have passed
Reviewers Checklist
All items are required. Please add a note if the item is not applicable and please add your handle next to the items reviewed if you only reviewed selected items.
I have...
- [ ] confirmed the correct type prefix in the PR title
- [ ] confirmed
!
in the type prefix if API or client breaking change - [ ] confirmed all author checklist items have been addressed
- [ ] reviewed state machine logic
- [ ] reviewed API design and naming
- [ ] reviewed documentation is accurate
- [ ] reviewed tests and test coverage
- [ ] manually tested (if applicable)
This is a substantially different approach taken from #12885 (which I don't really understand and fear is too complicated).
I feel like running these calls in parallel should be OK. Can you please post benchmarks pre/post?
Yeah, it's actually pretty cool to see how different the approaches were.
I posted the bench of the after in the PR (the entire crisis check took about a minute). I gave up on running the before check after about an hour, but I'll let it run for a bit and update the PR later on when it's over
@Cashmaney I meant native go benchmarks that should you the ns/ms delta where you just run the invariant function against some state, preferably Juno's :)
@alexanderbez You mean like BenchmarkInvariants? The problem is taking a real-world state and creating such a complex context in a test. Is there an easy way to pull a genesis file to simapp and run a test on that? I went with a simpler approach - debug prints inside the function😅
Post:
2022/08/10 17:00:11 CanWithdrawInvariant check took 53.20056586s
Pre:
Now on 1 hour+ ...
I also have the two binaries here - https://github.com/Cashmaney/cosmos-sdk/releases/tag/concurrent-invariant-check
@Cashmaney yes, take the genesis state and load the file into memory and populate the distribution and staking keepers. Then you can run benchmarks off of that.
thank you
Couldn't figure out how to get the simapp working properly with the Juno state, so I just ran another real-world bench on gaia with the cosmoshub-4 genesis state:
Post fix:
2022/08/10 22:24:55 CanWithdrawInvariant took: 10.956378769s
Pre fix:
2022/08/10 23:47:04 CanWithdrawInvariant check took 1h17m46.518379309s
hey @Cashmaney, nice solution!
From my testing, the primary bottleneck in the original code is creating the iterator in IterateValidatorSlashEventsBetween
. It takes about 180ms in the original, single-threaded code. My solution focused on caching those results.
This solution doesn't directly address that problem, but seems to have fixed it anyway. I checked how long those calls take in this PR and its well under 1ms.
I don't know why the call behavior would change so drastically by wrapping parent calls in go routines. I wonder if anyone can shed some light on what's going on here.
hey @Cashmaney, nice solution!
From my testing, the primary bottleneck in the original code is creating the iterator in
IterateValidatorSlashEventsBetween
. It takes about 180ms in the original, single-threaded code. My solution focused on caching those results.This solution doesn't directly address that problem, but seems to have fixed it anyway. I checked how long those calls take in this PR and its well under 1ms.
I don't know why the call behavior would change so drastically by wrapping parent calls in go routines. I wonder if anyone can shed some light on what's going on here.
Hey. Thanks - it was really funny to see two radically different solutions posted basically at the same time.
You seemed to have dived a bit deeper than I did. I'm not even sure how and where that function is called from, so I'll leave it to someone more knowledgeable in both the sdk and Go to weigh in on that
@Cashmaney there's no need to create a SimApp
, you just need to benchmark CanWithdrawInvariant
and to do that you only need a Keeper
. I would just create a keeper and call InitGenesis
on it perhaps?
I'm in favor of https://github.com/cosmos/cosmos-sdk/pull/12885 since it's a general case solution. @alexanderbez I don't think we should be afraid to improve old code, I made an analysis of the cache iterator and it seems OK to me.
@kocubinski to be fair, @alexanderbez saw a much different version of the code that was merely a workaround for the actual problem. The assessment of it being too complicated was probably correct.
a much different version of the code that was merely a workaround for the actual problem
Ah, I see there was a force push. This is my first look at the PR, nice simplification.
This solution doesn't directly address that problem, but seems to have fixed it anyway. I checked how long those calls take in this PR and its well under 1ms.
I'd like to uncover why we see these perf improvements here too, but in end I don't see why we can't merge both.
@kocubinski Based on my testing, it seems the goroutines help the code find data that will shrink/clear the unsorted cache, allowing the rest of the processing to go much more quickly. How fast that happens depends on the number of cpu cores, and varies slightly from run to run.
@kocubinski @alexanderbez do we still want this pr as well?
@marbar3778 yeah I don't see why not actually. Let me give this one quick look again, and we can merge 👍
Any objections @kocubinski?
LGTM! Just a few minor nit comments
Fixed:)
Would like another approval prior to merge :)
Oh and teeny nit... I will soon post a PR that fumpts this :)
https://github.com/Cashmaney/cosmos-sdk/pull/1
@alexanderbez hey, I checked on codeql.
Answer is no.
https://github.com/github/codeql/issues/9298
Codecov Report
Merging #12886 (bad3bab) into main (7728516) will decrease coverage by
0.93%
. The diff coverage is0.00%
.
@@ Coverage Diff @@
## main #12886 +/- ##
==========================================
- Coverage 54.10% 53.16% -0.94%
==========================================
Files 662 641 -21
Lines 56426 54885 -1541
==========================================
- Hits 30528 29181 -1347
+ Misses 23515 23378 -137
+ Partials 2383 2326 -57
@alexanderbez are we merging this?
We already merged https://github.com/cosmos/cosmos-sdk/pull/12885 -- we can close this PR.