cosmos-sdk icon indicating copy to clipboard operation
cosmos-sdk copied to clipboard

feat: Optimistic Execution

Open facundomedica opened this issue 2 years ago • 1 comments

Description

RFC: #16499 Closes: #XXXX


Author Checklist

All items are required. Please add a note to the item if the item is not applicable and please add links to any relevant follow up issues.

I have...

  • [ ] included the correct type prefix in the PR title
  • [ ] added ! to the type prefix if API or client breaking change
  • [ ] targeted the correct branch (see PR Targeting)
  • [ ] provided a link to the relevant issue or specification
  • [ ] followed the guidelines for building modules
  • [ ] included the necessary unit and integration tests
  • [ ] added a changelog entry to CHANGELOG.md
  • [ ] included comments for documenting Go code
  • [ ] updated the relevant documentation or specification
  • [ ] reviewed "Files changed" and left comments if necessary
  • [ ] confirmed all CI checks have passed

Reviewers Checklist

All items are required. Please add a note if the item is not applicable and please add your handle next to the items reviewed if you only reviewed selected items.

I have...

  • [ ] confirmed the correct type prefix in the PR title
  • [ ] confirmed ! in the type prefix if API or client breaking change
  • [ ] confirmed all author checklist items have been addressed
  • [ ] reviewed state machine logic
  • [ ] reviewed API design and naming
  • [ ] reviewed documentation is accurate
  • [ ] reviewed tests and test coverage
  • [ ] manually tested (if applicable)

facundomedica avatar Jun 15 '23 19:06 facundomedica

I think we need to be especially careful when OE is aborted, because finalizeBlockState could be left in a bad state.

  1. InitChain happens, finalizeBlockState is not nil
  2. During FinalizeBlock we check if finalizeBlockState == nil, otherwise we re-use it (because we want to be able to access any state written during InitChain)
  3. If we abort, we need to be able to reset to the state that we had right after InitChain.

Maybe to keep it simple we can execute OE only if height > initialHeight. That way we can just clear up the state 100% of the times

facundomedica avatar Jun 30 '23 10:06 facundomedica

@facundomedica your pull request is missing a changelog!

github-actions[bot] avatar Aug 02 '23 15:08 github-actions[bot]

Still need to figure out if/why make this optional, but it's R4R

facundomedica avatar Aug 02 '23 15:08 facundomedica

Still need to figure out if/why make this optional, but it's R4R

I propose that the optionality of this roll up under the runtime WG discussions in that two separate implementations of Consensus could provide either synchronous or async optimistic execution. Perhaps they share some code. It'd be up to the app to pass in one or the other implementation.

For now we're stuck with a feature flag, is that pretty ugly to weave in?

kocubinski avatar Aug 03 '23 14:08 kocubinski

The thing is, why would someone turn off this feature? Are there any benefits in doing that? Do chains care when the block execution happens?

facundomedica avatar Aug 07 '23 09:08 facundomedica

Overall I am not in favor of the usage of mutex. I feel we're violating the golang principle of "don't communicate by sharing memory, share memory by communicating". Aligning with that principal here I think looks like using channels as the primary communication mechanism.

I think internalFinalizeBlock should take a cancellable context.Context replacing the need to call ShouldAbort at intervals. In general this looks like:

for {
  select {
  case <-ctx.Done():
    return ctx.Err()
  default:
     // do a piece of work
     }
 }    

So internalFinalizeBlock needs to iterate through its chunks of work a bit differently to accommodate the select multiplexer. I find code written this way much easier to reason about, find bugs in, and extend.

Did you consider an approach like this during implementation already?

kocubinski avatar Aug 17 '23 15:08 kocubinski

Results of local testing with 4 nodes with and without emulated network latency:

AVG DELTA:  2572.1724137931033 without OE, no delay
AVG DELTA:  4400.122448979592 without OE, with delay
AVG DELTA:  4607.166666666667 without OE, with delay, with jitter

AVG DELTA:  2245.1176470588234 with OE, no delay
AVG DELTA:  3073.6326530612246 with OE, with delay
AVG DELTA:  3116.1136363636365 with OE, with delay, with jitter

Used pumba:

./pumba_darwin_arm64 netem --duration 240s delay --time 500
./pumba_darwin_arm64 netem --duration 240s delay --time 500 --jitter 100

A fixed 2s sleep was put in place in BeginBlock to get consistent results, meaning that the actual block execution always took 2s

facundomedica avatar Aug 21 '23 13:08 facundomedica

I tried cleaning up the concurrency model with little success, now I'm passing a context to internalFinalizeBlock which makes it unaware of the whole OE thing, it uses this context just to know if it needs to return early.

Using channels is a cleaner way to do this, but the problem is that the function we are passing as our FinalizeBlockFn has access to baseapp and to its properties, so we have partial control over what's going on with the state. This ended up in getting wrong app hashes from time to time.

A great improvement would be to have FinalizeBlock to use only whatever is passed to it and with little to no access to baseapp. This way we could run the entire block completely concurrently by passing only cached copies of the contexts and values; and by the time we abort we just discard all of these. On the other hand, this could complicated this a bunch given that we'd need to make sure no other process modifies the base contexts and values being passed. Also I don't know if it's possible at all without a major refactor on baseapp

facundomedica avatar Aug 27 '23 11:08 facundomedica

I've added a check so we don't start optimistic execution when ProcessProposal returns a "reject"

facundomedica avatar Sep 09 '23 05:09 facundomedica

@mergifyio backport release/v0.50.x

tac0turtle avatar Oct 23 '23 10:10 tac0turtle

backport release/v0.50.x

✅ Backports have been created

mergify[bot] avatar Oct 23 '23 10:10 mergify[bot]

Description

RFC: #16499

Closes: #XXXX


Author Checklist

*All items are required. Please add a note to the item if the item is not applicable and

please add links to any relevant follow up issues.*

I have...

  • [ ] included the correct type prefix in the PR title

  • [ ] added ! to the type prefix if API or client breaking change

  • [ ] targeted the correct branch (see PR Targeting)

  • [ ] provided a link to the relevant issue or specification

  • [ ] followed the guidelines for building modules

  • [ ] included the necessary unit and integration tests

  • [ ] added a changelog entry to CHANGELOG.md

  • [ ] included comments for documenting Go code

  • [ ] updated the relevant documentation or specification

  • [ ] reviewed "Files changed" and left comments if necessary

  • [ ] confirmed all CI checks have passed

Reviewers Checklist

*All items are required. Please add a note if the item is not applicable and please add

your handle next to the items reviewed if you only reviewed selected items.*

I have...

  • [ ] confirmed the correct type prefix in the PR title

  • [ ] confirmed ! in the type prefix if API or client breaking change

  • [ ] confirmed all author checklist items have been addressed

  • [ ] reviewed state machine logic

  • [ ] reviewed API design and naming

  • [ ] reviewed documentation is accurate

  • [ ] reviewed tests and test coverage

  • [ ] manually tested (if applicable)

CASABECI avatar Nov 06 '23 15:11 CASABECI

Great work @facundomedica 👏

alexanderbez avatar Nov 11 '23 22:11 alexanderbez