cosmos-sdk
cosmos-sdk copied to clipboard
feat: Optimistic Execution
Description
RFC: #16499 Closes: #XXXX
Author Checklist
All items are required. Please add a note to the item if the item is not applicable and please add links to any relevant follow up issues.
I have...
- [ ] included the correct type prefix in the PR title
- [ ] added
!to the type prefix if API or client breaking change - [ ] targeted the correct branch (see PR Targeting)
- [ ] provided a link to the relevant issue or specification
- [ ] followed the guidelines for building modules
- [ ] included the necessary unit and integration tests
- [ ] added a changelog entry to
CHANGELOG.md - [ ] included comments for documenting Go code
- [ ] updated the relevant documentation or specification
- [ ] reviewed "Files changed" and left comments if necessary
- [ ] confirmed all CI checks have passed
Reviewers Checklist
All items are required. Please add a note if the item is not applicable and please add your handle next to the items reviewed if you only reviewed selected items.
I have...
- [ ] confirmed the correct type prefix in the PR title
- [ ] confirmed
!in the type prefix if API or client breaking change - [ ] confirmed all author checklist items have been addressed
- [ ] reviewed state machine logic
- [ ] reviewed API design and naming
- [ ] reviewed documentation is accurate
- [ ] reviewed tests and test coverage
- [ ] manually tested (if applicable)
I think we need to be especially careful when OE is aborted, because finalizeBlockState could be left in a bad state.
InitChainhappens,finalizeBlockStateis not nil- During FinalizeBlock we check if
finalizeBlockState == nil, otherwise we re-use it (because we want to be able to access any state written duringInitChain) - If we abort, we need to be able to reset to the state that we had right after
InitChain.
Maybe to keep it simple we can execute OE only if height > initialHeight. That way we can just clear up the state 100% of the times
@facundomedica your pull request is missing a changelog!
Still need to figure out if/why make this optional, but it's R4R
Still need to figure out if/why make this optional, but it's R4R
I propose that the optionality of this roll up under the runtime WG discussions in that two separate implementations of Consensus could provide either synchronous or async optimistic execution. Perhaps they share some code. It'd be up to the app to pass in one or the other implementation.
For now we're stuck with a feature flag, is that pretty ugly to weave in?
The thing is, why would someone turn off this feature? Are there any benefits in doing that? Do chains care when the block execution happens?
Overall I am not in favor of the usage of mutex. I feel we're violating the golang principle of "don't communicate by sharing memory, share memory by communicating". Aligning with that principal here I think looks like using channels as the primary communication mechanism.
I think internalFinalizeBlock should take a cancellable context.Context replacing the need to call ShouldAbort at intervals. In general this looks like:
for {
select {
case <-ctx.Done():
return ctx.Err()
default:
// do a piece of work
}
}
So internalFinalizeBlock needs to iterate through its chunks of work a bit differently to accommodate the select multiplexer. I find code written this way much easier to reason about, find bugs in, and extend.
Did you consider an approach like this during implementation already?
Results of local testing with 4 nodes with and without emulated network latency:
AVG DELTA: 2572.1724137931033 without OE, no delay
AVG DELTA: 4400.122448979592 without OE, with delay
AVG DELTA: 4607.166666666667 without OE, with delay, with jitter
AVG DELTA: 2245.1176470588234 with OE, no delay
AVG DELTA: 3073.6326530612246 with OE, with delay
AVG DELTA: 3116.1136363636365 with OE, with delay, with jitter
Used pumba:
./pumba_darwin_arm64 netem --duration 240s delay --time 500
./pumba_darwin_arm64 netem --duration 240s delay --time 500 --jitter 100
A fixed 2s sleep was put in place in BeginBlock to get consistent results, meaning that the actual block execution always took 2s
I tried cleaning up the concurrency model with little success, now I'm passing a context to internalFinalizeBlock which makes it unaware of the whole OE thing, it uses this context just to know if it needs to return early.
Using channels is a cleaner way to do this, but the problem is that the function we are passing as our FinalizeBlockFn has access to baseapp and to its properties, so we have partial control over what's going on with the state. This ended up in getting wrong app hashes from time to time.
A great improvement would be to have FinalizeBlock to use only whatever is passed to it and with little to no access to baseapp. This way we could run the entire block completely concurrently by passing only cached copies of the contexts and values; and by the time we abort we just discard all of these. On the other hand, this could complicated this a bunch given that we'd need to make sure no other process modifies the base contexts and values being passed. Also I don't know if it's possible at all without a major refactor on baseapp
I've added a check so we don't start optimistic execution when ProcessProposal returns a "reject"
@mergifyio backport release/v0.50.x
backport release/v0.50.x
✅ Backports have been created
- #18205 feat: Optimistic Execution (backport #16581) has been created for branch
release/v0.50.xbut encountered conflicts
Description
RFC: #16499
Closes: #XXXX
Author Checklist
*All items are required. Please add a note to the item if the item is not applicable and
please add links to any relevant follow up issues.*
I have...
[ ] included the correct type prefix in the PR title
[ ] added
!to the type prefix if API or client breaking change[ ] targeted the correct branch (see PR Targeting)
[ ] provided a link to the relevant issue or specification
[ ] followed the guidelines for building modules
[ ] included the necessary unit and integration tests
[ ] added a changelog entry to
CHANGELOG.md[ ] included comments for documenting Go code
[ ] updated the relevant documentation or specification
[ ] reviewed "Files changed" and left comments if necessary
[ ] confirmed all CI checks have passed
Reviewers Checklist
*All items are required. Please add a note if the item is not applicable and please add
your handle next to the items reviewed if you only reviewed selected items.*
I have...
[ ] confirmed the correct type prefix in the PR title
[ ] confirmed
!in the type prefix if API or client breaking change[ ] confirmed all author checklist items have been addressed
[ ] reviewed state machine logic
[ ] reviewed API design and naming
[ ] reviewed documentation is accurate
[ ] reviewed tests and test coverage
[ ] manually tested (if applicable)
Great work @facundomedica 👏