cosmos-sdk icon indicating copy to clipboard operation
cosmos-sdk copied to clipboard

perf: speed up crisis invariant check (juno genesis)

Open Cashmaney opened this issue 2 years ago • 8 comments

Description

Looking at the time it takes to handle the crisis invariant check when starting a node, the vast majority of time taken by the invariant check is due to the can-withdraw function, which takes ages because it sequentially loops every validator and every delegator. I switched the looping to be concurrent instead, resulting in a pretty huge speedup.

This check isn't part of consensus or anything, so non-deterministic concurrency should be fair game here seeing as how rewards are independent between validators.

I'm not super familiar with the internals of x/crisis, but it doesn't seem to me that I broke anything in doing this - behaviour may differ since the code will panic on the first invariant error it encounters (which may be different between executions if there are multiple), but I feel this is a compromise that we should be willing to make. Still, I might be missing something here, so feel free to correct me if that isn't the case.

When testing using Pheonix-2 genesis file

On my 24-core AMD Ryzen 9 3900x: Before:

TBD - will update when it finishes

After:

12:27PM INF starting node with ABCI Tendermint in-process
12:28PM INF Starting multiAppConn service impl=multiAppConn module=proxy
12:28PM INF Starting localClient service connection=query impl=localClient module=abci-client
12:28PM INF Starting localClient service connection=snapshot impl=localClient module=abci-client
12:28PM INF Starting localClient service connection=mempool impl=localClient module=abci-client
12:28PM INF Starting localClient service connection=consensus impl=localClient module=abci-client
12:28PM INF Starting EventBus service impl=EventBus module=events
12:28PM INF Starting PubSub service impl=PubSub module=pubsub
12:28PM INF Starting IndexerService service impl=IndexerService module=txindex
12:28PM INF ABCI Handshake App Info hash= height=0 module=consensus protocol-version=0 software-version=v9.0.0
12:28PM INF ABCI Replay Blocks appHeight=0 module=consensus stateHeight=0 storeHeight=0
12:28PM INF asserting crisis invariants inv=1/11 module=x/crisis name=distribution/nonnegative-outstanding
12:28PM INF asserting crisis invariants inv=2/11 module=x/crisis name=distribution/can-withdraw
12:29PM INF asserting crisis invariants inv=3/11 module=x/crisis name=distribution/reference-count
12:29PM INF asserting crisis invariants inv=4/11 module=x/crisis name=distribution/module-account
12:29PM INF asserting crisis invariants inv=5/11 module=x/crisis name=bank/nonnegative-outstanding
12:29PM INF asserting crisis invariants inv=6/11 module=x/crisis name=bank/total-supply
12:29PM INF asserting crisis invariants inv=7/11 module=x/crisis name=gov/module-account
12:29PM INF asserting crisis invariants inv=8/11 module=x/crisis name=staking/module-accounts
12:29PM INF asserting crisis invariants inv=9/11 module=x/crisis name=staking/nonnegative-power
12:29PM INF asserting crisis invariants inv=10/11 module=x/crisis name=staking/positive-delegation
12:29PM INF asserting crisis invariants inv=11/11 module=x/crisis name=staking/delegator-shares
12:29PM INF asserted all invariants duration=49032.084869 height=4136532 module=x/crisis

Author Checklist

All items are required. Please add a note to the item if the item is not applicable and please add links to any relevant follow up issues.

I have...

  • [x] included the correct type prefix in the PR title
  • [ ] added ! to the type prefix if API or client breaking change
  • [x] targeted the correct branch (see PR Targeting)
  • [ ] provided a link to the relevant issue or specification
  • [ ] followed the guidelines for building modules
  • [ ] included the necessary unit and integration tests
  • [ ] added a changelog entry to CHANGELOG.md
  • [x] included comments for documenting Go code
  • [ ] updated the relevant documentation or specification
  • [ ] reviewed "Files changed" and left comments if necessary
  • [ ] confirmed all CI checks have passed

Reviewers Checklist

All items are required. Please add a note if the item is not applicable and please add your handle next to the items reviewed if you only reviewed selected items.

I have...

  • [ ] confirmed the correct type prefix in the PR title
  • [ ] confirmed ! in the type prefix if API or client breaking change
  • [ ] confirmed all author checklist items have been addressed
  • [ ] reviewed state machine logic
  • [ ] reviewed API design and naming
  • [ ] reviewed documentation is accurate
  • [ ] reviewed tests and test coverage
  • [ ] manually tested (if applicable)

Cashmaney avatar Aug 10 '22 10:08 Cashmaney

This is a substantially different approach taken from #12885 (which I don't really understand and fear is too complicated).

I feel like running these calls in parallel should be OK. Can you please post benchmarks pre/post?

alexanderbez avatar Aug 10 '22 13:08 alexanderbez

Yeah, it's actually pretty cool to see how different the approaches were.

I posted the bench of the after in the PR (the entire crisis check took about a minute). I gave up on running the before check after about an hour, but I'll let it run for a bit and update the PR later on when it's over

Cashmaney avatar Aug 10 '22 13:08 Cashmaney

@Cashmaney I meant native go benchmarks that should you the ns/ms delta where you just run the invariant function against some state, preferably Juno's :)

alexanderbez avatar Aug 10 '22 13:08 alexanderbez

@alexanderbez You mean like BenchmarkInvariants? The problem is taking a real-world state and creating such a complex context in a test. Is there an easy way to pull a genesis file to simapp and run a test on that? I went with a simpler approach - debug prints inside the function😅

Post:

2022/08/10 17:00:11 CanWithdrawInvariant check took 53.20056586s

Pre:

Now on 1 hour+ ...

I also have the two binaries here - https://github.com/Cashmaney/cosmos-sdk/releases/tag/concurrent-invariant-check

Cashmaney avatar Aug 10 '22 14:08 Cashmaney

@Cashmaney yes, take the genesis state and load the file into memory and populate the distribution and staking keepers. Then you can run benchmarks off of that.

alexanderbez avatar Aug 10 '22 14:08 alexanderbez

thank you

faddat avatar Aug 10 '22 15:08 faddat

Couldn't figure out how to get the simapp working properly with the Juno state, so I just ran another real-world bench on gaia with the cosmoshub-4 genesis state:

Post fix:

2022/08/10 22:24:55 CanWithdrawInvariant took: 10.956378769s

Pre fix:

2022/08/10 23:47:04 CanWithdrawInvariant check took 1h17m46.518379309s

Cashmaney avatar Aug 10 '22 20:08 Cashmaney

hey @Cashmaney, nice solution!

From my testing, the primary bottleneck in the original code is creating the iterator in IterateValidatorSlashEventsBetween. It takes about 180ms in the original, single-threaded code. My solution focused on caching those results.

This solution doesn't directly address that problem, but seems to have fixed it anyway. I checked how long those calls take in this PR and its well under 1ms.

I don't know why the call behavior would change so drastically by wrapping parent calls in go routines. I wonder if anyone can shed some light on what's going on here.

blazeroni avatar Aug 10 '22 22:08 blazeroni

hey @Cashmaney, nice solution!

From my testing, the primary bottleneck in the original code is creating the iterator in IterateValidatorSlashEventsBetween. It takes about 180ms in the original, single-threaded code. My solution focused on caching those results.

This solution doesn't directly address that problem, but seems to have fixed it anyway. I checked how long those calls take in this PR and its well under 1ms.

I don't know why the call behavior would change so drastically by wrapping parent calls in go routines. I wonder if anyone can shed some light on what's going on here.

Hey. Thanks - it was really funny to see two radically different solutions posted basically at the same time.

You seemed to have dived a bit deeper than I did. I'm not even sure how and where that function is called from, so I'll leave it to someone more knowledgeable in both the sdk and Go to weigh in on that

Cashmaney avatar Aug 11 '22 11:08 Cashmaney

@Cashmaney there's no need to create a SimApp, you just need to benchmark CanWithdrawInvariant and to do that you only need a Keeper. I would just create a keeper and call InitGenesis on it perhaps?

alexanderbez avatar Aug 15 '22 01:08 alexanderbez

I'm in favor of https://github.com/cosmos/cosmos-sdk/pull/12885 since it's a general case solution. @alexanderbez I don't think we should be afraid to improve old code, I made an analysis of the cache iterator and it seems OK to me.

kocubinski avatar Aug 16 '22 20:08 kocubinski

@kocubinski to be fair, @alexanderbez saw a much different version of the code that was merely a workaround for the actual problem. The assessment of it being too complicated was probably correct.

blazeroni avatar Aug 16 '22 21:08 blazeroni

a much different version of the code that was merely a workaround for the actual problem

Ah, I see there was a force push. This is my first look at the PR, nice simplification.

This solution doesn't directly address that problem, but seems to have fixed it anyway. I checked how long those calls take in this PR and its well under 1ms.

I'd like to uncover why we see these perf improvements here too, but in end I don't see why we can't merge both.

kocubinski avatar Aug 16 '22 22:08 kocubinski

@kocubinski Based on my testing, it seems the goroutines help the code find data that will shrink/clear the unsorted cache, allowing the rest of the processing to go much more quickly. How fast that happens depends on the number of cpu cores, and varies slightly from run to run.

blazeroni avatar Aug 16 '22 23:08 blazeroni

@kocubinski @alexanderbez do we still want this pr as well?

tac0turtle avatar Aug 19 '22 16:08 tac0turtle

@marbar3778 yeah I don't see why not actually. Let me give this one quick look again, and we can merge 👍

Any objections @kocubinski?

alexanderbez avatar Aug 19 '22 20:08 alexanderbez

LGTM! Just a few minor nit comments

Fixed:)

Cashmaney avatar Aug 20 '22 20:08 Cashmaney

Would like another approval prior to merge :)

alexanderbez avatar Aug 21 '22 02:08 alexanderbez

Oh and teeny nit... I will soon post a PR that fumpts this :)

https://github.com/Cashmaney/cosmos-sdk/pull/1

faddat avatar Aug 21 '22 12:08 faddat

@alexanderbez hey, I checked on codeql.

Answer is no.

https://github.com/github/codeql/issues/9298

faddat avatar Aug 21 '22 16:08 faddat

Codecov Report

Merging #12886 (bad3bab) into main (7728516) will decrease coverage by 0.93%. The diff coverage is 0.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##             main   #12886      +/-   ##
==========================================
- Coverage   54.10%   53.16%   -0.94%     
==========================================
  Files         662      641      -21     
  Lines       56426    54885    -1541     
==========================================
- Hits        30528    29181    -1347     
+ Misses      23515    23378     -137     
+ Partials     2383     2326      -57     
Impacted Files Coverage Δ
x/distribution/keeper/invariants.go 0.00% <0.00%> (ø)
depinject/one_per_module.go
depinject/internal/codegen/import.go
depinject/internal/codegen/file.go
depinject/simple.go
depinject/supply.go
depinject/module_dep.go
depinject/module_key.go
depinject/internal/codegen/ident.go
... and 15 more

codecov[bot] avatar Aug 31 '22 17:08 codecov[bot]

@alexanderbez are we merging this?

tac0turtle avatar Sep 16 '22 07:09 tac0turtle

We already merged https://github.com/cosmos/cosmos-sdk/pull/12885 -- we can close this PR.

alexanderbez avatar Sep 16 '22 15:09 alexanderbez