cosmos-sdk
cosmos-sdk copied to clipboard
[Feature]: save iavl tree asynchronously to improve throughput
Summary
In recent development of memiavl, we find that the "async commit" mode increase the block replaying throughput tremendously(40%), I think it'll have even bigger impact in normal iavl, because the db overhead is larger here.
The idea is in abci commit, we only need to calculate the root hash using the in memory nodes (working nodes) immediately, but the db saving part can run concurrently without blocking the consensus state machine. because of the nature of the blockchain architecture, we can lose a few blocks state changes, cometbft will just replay them on startup, as long as the atomicity is handled right.
One can also consider this as an upgrade of the previous no-db-sync proposal: https://github.com/cosmos/cosmos-sdk/issues/14966, they share similar side-effects, this proposal wants to run the whole SaveVersion
at background.
Problem Definition
The IAVL tree commit contributes a lot to the total block execution time (TODO profiling numbers), we can achieve much higher throughput by commit the iavl tree to db in background, it's especially important when the node is catching up with the chain, or the chain want to push the TPS to the limit.
Proposal
- In iavl store's
Commit
, return theWorkingHash
instead as theCommitID
, but do theSaveVersion
at background. - In upgrade module, before the panic of the upgrade msg, call a new API to wait for the asynchronous commit to finish, so the upgrade can success after restart. (do similar thing in other places that need graceful restart if there's any)
- make sure the iavl implementation is ok to have
SaveVersion
runs concurrently.
This SGTM! There me be a bit more complexity in replay, and knowing which block to start from on restart! (And tracking which blocks are fully committed -- this becomes more important / harder when we make the write size smaller)
(And we may need to guarantee that you've succesfully committed to a disk a block at least k heights back for TM. Should generally be fine to just store the state diffs quickly, and re-merkelize on the fly for resync though)
This feels like a great idea to me!
There me be a bit more complexity in replay, and knowing which block to start from on restart! (And tracking which blocks are fully committed -- this becomes more important / harder when we make the write size smaller)
we can rely on the status of latest commit info of the rootmulti for the last fully committed block, which is always written after all the stores.
~~The main concern for now is if the current IAVL implementation concurrency safe for this use case.~~
Thread-safty is not the concern, but the mutex is, it won't be able to achieve better throughput if the reading is blocked by the writing. I think we can make the changes in sdk first, @cool-develope can optimize the iavl later, memiavl can take full advantage of this.
I feel this could be similar to what we'll do in the optimistic execution, where the commit is not "actual" commit.
Love this idea @yihuang -- thanks for opening a PR too!
Unforturntely it don't work with current IAVL I think, the main issue is the WorkingHash
don't modify current version, so if the next Commit
happens before the background SaveVersion
, it'll use the old version.
We'll need some refactoring in iavl to take advantage of this. @cool-develope
while we cant do multiple iavl tree commits, we could do 1, where we return commit to comet right away and save asynchronous. This would help in a single case and would still have performance benefits
while we cant do multiple iavl tree commits, we could do 1, where we return commit to comet right away and save asynchronous. This would help in a single case and would still have performance benefits
yeah, that is possible, something like this?
// both channels are unbuffered
commitChan := make(chan int64)
commitResultChan := make(chan error)
func Commit() {
// make sure the last async commit finished
err := <- commitResultChan
// calculate root hash for the new commit
commitID := store.WorkingCommitID()
// submit new commit request
commitChan <- commitID.Version
return commitID
}
// async commit worker
go func() {
for v := range commitChan {
_, _, err := store.SaveVersion()
commitResultChan <- err
}
}()
yea that seems right, it would be interesting to see what sort of speed up that would give. I dont forsee this helping too much on sync but at the head of the chain it could help
I think on return to comet, we need to be sure we could replay to this block on our own. Which means we need to synchronously commit to the state diff, but not the merkelization
@tac0turtle are you imagining that during sync, your not committing to disk at all? (Or only committing at "wide" intervals, e.g. snapshots)
otherwise not sure I follow how this wouldn't help? Your letting there be multiple db txs in parallel for writes to disk, so committing is happening in parallel to CPU processing of nodes (whereas today you don't use CPU & commit at once)
I think on return to comet, we need to be sure we could replay to this block on our own
I think we don't need to make sure that, as long as the cometbft can replay the missing blocks from block store on abci handshake.
not committing to disk at all? (Or only committing at "wide" intervals, e.g. snapshots
this won't be an archive node though, some node just need to provide full service at every historical versions, I guess.
Unforturntely it don't work with current IAVL I think, the main issue is the
WorkingHash
don't modify current version, so if the nextCommit
happens before the backgroundSaveVersion
, it'll use the old version. We'll need some refactoring in iavl to take advantage of this. @cool-develope
yeah, IMO we should separate the iavl into three parts, WorkingHash
, Commit
, SaveStorage
in Commit
stage, just finalize the iavl version in memory and nodes will be stored background in SaveStorage
.
it won't break any APIs of SDK
this is very similar to the work aditya did on iavl a few years ago which was reverted, we should look at that work to see what we can reuse. A good time to dive into that is after we get a decent portion of the way through store v2 rewrite
not sure that prior IAVL work is helpful, this all becomes much cleaner write-side with the new IAVL storage format. (Every version is independent writes!)