perf: Juno genesis
Description
This PR fixes a bounty by the Juno team. Juno's invariant checks took 10 hours during their most recent chain halt. This PR cuts that down to 1 minute. See https://github.com/CosmosContracts/bounties#improve-speed-of-invariant-checks.
The root problem is deep in the "can-withraw" invariant check, which calls this repeatedly: https://github.com/cosmos/cosmos-sdk/blob/main/x/distribution/keeper/store.go#L337. Each call takes about 180ms on my machine. It's called once per delegation, which for Juno's genesis file means millions of calls.
The solution presented here is to query these slash events up front, and pass the data down the call stack. Doing so results in the function only being called once per validator, which is where the 10 hour to 1 minute improvement comes from.
This is very much a draft PR, and I welcome feedback on how you would like to see this implemented. For this PoC, I simply copoied the functions with the appropriate changes - I'm not sure if you'd prefer it as an optional parameter to existing functions or some other strategy.
If you're interested in seeing the results in action, you can use https://github.com/blazeroni/juno-bounty/tree/v9.0.0-invariants.
Author Checklist
All items are required. Please add a note to the item if the item is not applicable and please add links to any relevant follow up issues.
I have...
- [ ] included the correct type prefix in the PR title
- [ ] added
!to the type prefix if API or client breaking change - [ ] targeted the correct branch (see PR Targeting)
- [ ] provided a link to the relevant issue or specification
- [ ] followed the guidelines for building modules
- [ ] included the necessary unit and integration tests
- [ ] added a changelog entry to
CHANGELOG.md - [ ] included comments for documenting Go code
- [ ] updated the relevant documentation or specification
- [ ] reviewed "Files changed" and left comments if necessary
- [ ] confirmed all CI checks have passed
Reviewers Checklist
All items are required. Please add a note if the item is not applicable and please add your handle next to the items reviewed if you only reviewed selected items.
I have...
- [ ] confirmed the correct type prefix in the PR title
- [ ] confirmed
!in the type prefix if API or client breaking change - [ ] confirmed all author checklist items have been addressed
- [ ] reviewed state machine logic
- [ ] reviewed API design and naming
- [ ] reviewed documentation is accurate
- [ ] reviewed tests and test coverage
- [ ] manually tested (if applicable)
I dug deeper and found the root problem. Replaced all of the previous workaround with a real fix and should also help in other situations besides just the Juno genesis issue
@odeke-em not sure if you've started looking at this, but the latest code is much different than the previous version. Just a heads up.
Added a benchmark which illustrates the problem. Finishes in a few seconds with the changes in the PR. Takes minutes without it.
@odeke-em thanks for the feedback! I've updated the benchmark and added it as a comment. I had to reduce the size of the original benchmark (from 200k to 10k unsorted values) since the before benchmark was crashing with values much higher than what I used. Regardless, the difference between before/after is still extreme.
Impressive results. I've measured similar duration for crisis invariants on similar hardware. Moreover since this PR changes CacheKV performance, the whole startup from a large genesis.json improves.
Without this change the invariants took roughly 14 hours to complete, and there is a stage of replaying blocks and other initialization that took an additional two hours. Validators were reporting 16+ hours to begin dialing peers.
With this change the entire initialization process completes in around 5 minutes.
@kocubinski yes, your analysis matches my own.
To test/verify this, I used a modified version of the latest Juno production release, available here: https://github.com/blazeroni/juno-bounty/commits/v9.0.0-invariants. That version depends on a modified v0.45.6 Cosmos SDK, with the only change being the changes in this PR. Instructions to obtain the Juno genesis file are here: https://github.com/CosmosContracts/incident-response/blob/main/28-July-22/genesis.md. You'll just need to install, init, replace genesis, and start.
Cheers!
Codecov Report
Merging #12885 (e2acc3f) into main (8e9e330) will decrease coverage by
1.00%. The diff coverage is100.00%.
@@ Coverage Diff @@
## main #12885 +/- ##
==========================================
- Coverage 56.58% 55.58% -1.01%
==========================================
Files 697 647 -50
Lines 59285 54900 -4385
==========================================
- Hits 33545 30514 -3031
+ Misses 22909 21921 -988
+ Partials 2831 2465 -366
@marbar3778 do we want this backported?
@marbar3778 do we want this backported?
would be nice. I cant see why its not backwards compatible