cosmos-sdk icon indicating copy to clipboard operation
cosmos-sdk copied to clipboard

perf: Juno genesis

Open blazeroni opened this issue 2 years ago • 1 comments

Description

This PR fixes a bounty by the Juno team. Juno's invariant checks took 10 hours during their most recent chain halt. This PR cuts that down to 1 minute. See https://github.com/CosmosContracts/bounties#improve-speed-of-invariant-checks.

The root problem is deep in the "can-withraw" invariant check, which calls this repeatedly: https://github.com/cosmos/cosmos-sdk/blob/main/x/distribution/keeper/store.go#L337. Each call takes about 180ms on my machine. It's called once per delegation, which for Juno's genesis file means millions of calls.

The solution presented here is to query these slash events up front, and pass the data down the call stack. Doing so results in the function only being called once per validator, which is where the 10 hour to 1 minute improvement comes from.

This is very much a draft PR, and I welcome feedback on how you would like to see this implemented. For this PoC, I simply copoied the functions with the appropriate changes - I'm not sure if you'd prefer it as an optional parameter to existing functions or some other strategy.

If you're interested in seeing the results in action, you can use https://github.com/blazeroni/juno-bounty/tree/v9.0.0-invariants.


Author Checklist

All items are required. Please add a note to the item if the item is not applicable and please add links to any relevant follow up issues.

I have...

  • [ ] included the correct type prefix in the PR title
  • [ ] added ! to the type prefix if API or client breaking change
  • [ ] targeted the correct branch (see PR Targeting)
  • [ ] provided a link to the relevant issue or specification
  • [ ] followed the guidelines for building modules
  • [ ] included the necessary unit and integration tests
  • [ ] added a changelog entry to CHANGELOG.md
  • [ ] included comments for documenting Go code
  • [ ] updated the relevant documentation or specification
  • [ ] reviewed "Files changed" and left comments if necessary
  • [ ] confirmed all CI checks have passed

Reviewers Checklist

All items are required. Please add a note if the item is not applicable and please add your handle next to the items reviewed if you only reviewed selected items.

I have...

  • [ ] confirmed the correct type prefix in the PR title
  • [ ] confirmed ! in the type prefix if API or client breaking change
  • [ ] confirmed all author checklist items have been addressed
  • [ ] reviewed state machine logic
  • [ ] reviewed API design and naming
  • [ ] reviewed documentation is accurate
  • [ ] reviewed tests and test coverage
  • [ ] manually tested (if applicable)

blazeroni avatar Aug 10 '22 10:08 blazeroni

I dug deeper and found the root problem. Replaced all of the previous workaround with a real fix and should also help in other situations besides just the Juno genesis issue

blazeroni avatar Aug 11 '22 23:08 blazeroni

@odeke-em not sure if you've started looking at this, but the latest code is much different than the previous version. Just a heads up.

blazeroni avatar Aug 11 '22 23:08 blazeroni

Added a benchmark which illustrates the problem. Finishes in a few seconds with the changes in the PR. Takes minutes without it.

blazeroni avatar Aug 12 '22 01:08 blazeroni

@odeke-em thanks for the feedback! I've updated the benchmark and added it as a comment. I had to reduce the size of the original benchmark (from 200k to 10k unsorted values) since the before benchmark was crashing with values much higher than what I used. Regardless, the difference between before/after is still extreme.

blazeroni avatar Aug 16 '22 03:08 blazeroni

Impressive results. I've measured similar duration for crisis invariants on similar hardware. Moreover since this PR changes CacheKV performance, the whole startup from a large genesis.json improves.

Without this change the invariants took roughly 14 hours to complete, and there is a stage of replaying blocks and other initialization that took an additional two hours. Validators were reporting 16+ hours to begin dialing peers.

With this change the entire initialization process completes in around 5 minutes.

joeabbey avatar Aug 16 '22 15:08 joeabbey

@kocubinski yes, your analysis matches my own.

To test/verify this, I used a modified version of the latest Juno production release, available here: https://github.com/blazeroni/juno-bounty/commits/v9.0.0-invariants. That version depends on a modified v0.45.6 Cosmos SDK, with the only change being the changes in this PR. Instructions to obtain the Juno genesis file are here: https://github.com/CosmosContracts/incident-response/blob/main/28-July-22/genesis.md. You'll just need to install, init, replace genesis, and start.

Cheers!

blazeroni avatar Aug 16 '22 21:08 blazeroni

Codecov Report

Merging #12885 (e2acc3f) into main (8e9e330) will decrease coverage by 1.00%. The diff coverage is 100.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##             main   #12885      +/-   ##
==========================================
- Coverage   56.58%   55.58%   -1.01%     
==========================================
  Files         697      647      -50     
  Lines       59285    54900    -4385     
==========================================
- Hits        33545    30514    -3031     
+ Misses      22909    21921     -988     
+ Partials     2831     2465     -366     
Impacted Files Coverage Δ
store/cachekv/store.go 86.87% <100.00%> (+0.25%) :arrow_up:
orm/model/ormdb/file.go
orm/model/ormtable/index_impl.go
orm/model/ormdb/json.go
db/memdb/db.go
orm/model/ormtable/singleton.go
orm/model/ormtable/filter.go
orm/model/ormtable/table_impl.go
db/memdb/iterator.go
orm/encoding/ormfield/enum.go
... and 41 more

codecov[bot] avatar Aug 16 '22 23:08 codecov[bot]

@marbar3778 do we want this backported?

alexanderbez avatar Aug 18 '22 13:08 alexanderbez

@marbar3778 do we want this backported?

would be nice. I cant see why its not backwards compatible

tac0turtle avatar Aug 18 '22 13:08 tac0turtle