cosmos-sdk icon indicating copy to clipboard operation
cosmos-sdk copied to clipboard

[Feature]: save iavl tree asynchronously to improve throughput

Open yihuang opened this issue 1 year ago • 15 comments

Summary

In recent development of memiavl, we find that the "async commit" mode increase the block replaying throughput tremendously(40%), I think it'll have even bigger impact in normal iavl, because the db overhead is larger here.

The idea is in abci commit, we only need to calculate the root hash using the in memory nodes (working nodes) immediately, but the db saving part can run concurrently without blocking the consensus state machine. because of the nature of the blockchain architecture, we can lose a few blocks state changes, cometbft will just replay them on startup, as long as the atomicity is handled right.

One can also consider this as an upgrade of the previous no-db-sync proposal: https://github.com/cosmos/cosmos-sdk/issues/14966, they share similar side-effects, this proposal wants to run the whole SaveVersion at background.

Problem Definition

The IAVL tree commit contributes a lot to the total block execution time (TODO profiling numbers), we can achieve much higher throughput by commit the iavl tree to db in background, it's especially important when the node is catching up with the chain, or the chain want to push the TPS to the limit.

Proposal

  • In iavl store's Commit, return the WorkingHash instead as the CommitID, but do the SaveVersion at background.
  • In upgrade module, before the panic of the upgrade msg, call a new API to wait for the asynchronous commit to finish, so the upgrade can success after restart. (do similar thing in other places that need graceful restart if there's any)
  • make sure the iavl implementation is ok to have SaveVersion runs concurrently.

yihuang avatar May 16 '23 07:05 yihuang

This SGTM! There me be a bit more complexity in replay, and knowing which block to start from on restart! (And tracking which blocks are fully committed -- this becomes more important / harder when we make the write size smaller)

(And we may need to guarantee that you've succesfully committed to a disk a block at least k heights back for TM. Should generally be fine to just store the state diffs quickly, and re-merkelize on the fly for resync though)

This feels like a great idea to me!

ValarDragon avatar May 16 '23 07:05 ValarDragon

There me be a bit more complexity in replay, and knowing which block to start from on restart! (And tracking which blocks are fully committed -- this becomes more important / harder when we make the write size smaller)

we can rely on the status of latest commit info of the rootmulti for the last fully committed block, which is always written after all the stores.

yihuang avatar May 16 '23 07:05 yihuang

~~The main concern for now is if the current IAVL implementation concurrency safe for this use case.~~

Thread-safty is not the concern, but the mutex is, it won't be able to achieve better throughput if the reading is blocked by the writing. I think we can make the changes in sdk first, @cool-develope can optimize the iavl later, memiavl can take full advantage of this.

yihuang avatar May 16 '23 07:05 yihuang

I feel this could be similar to what we'll do in the optimistic execution, where the commit is not "actual" commit.

yihuang avatar May 16 '23 08:05 yihuang

Love this idea @yihuang -- thanks for opening a PR too!

alexanderbez avatar May 16 '23 13:05 alexanderbez

Unforturntely it don't work with current IAVL I think, the main issue is the WorkingHash don't modify current version, so if the next Commit happens before the background SaveVersion, it'll use the old version. We'll need some refactoring in iavl to take advantage of this. @cool-develope

yihuang avatar May 17 '23 03:05 yihuang

while we cant do multiple iavl tree commits, we could do 1, where we return commit to comet right away and save asynchronous. This would help in a single case and would still have performance benefits

tac0turtle avatar May 17 '23 07:05 tac0turtle

while we cant do multiple iavl tree commits, we could do 1, where we return commit to comet right away and save asynchronous. This would help in a single case and would still have performance benefits

yeah, that is possible, something like this?

// both channels are unbuffered
commitChan := make(chan int64)
commitResultChan := make(chan error)

func Commit() {
  // make sure the last async commit finished
  err := <- commitResultChan
  // calculate root hash for the new commit
  commitID := store.WorkingCommitID()
  // submit new commit request
  commitChan <- commitID.Version
  return commitID
}

// async commit worker
go func() {
  for v := range commitChan {
    _, _, err := store.SaveVersion()
    commitResultChan <- err
  }
}()

yihuang avatar May 17 '23 08:05 yihuang

yea that seems right, it would be interesting to see what sort of speed up that would give. I dont forsee this helping too much on sync but at the head of the chain it could help

tac0turtle avatar May 17 '23 08:05 tac0turtle

I think on return to comet, we need to be sure we could replay to this block on our own. Which means we need to synchronously commit to the state diff, but not the merkelization

ValarDragon avatar May 17 '23 09:05 ValarDragon

@tac0turtle are you imagining that during sync, your not committing to disk at all? (Or only committing at "wide" intervals, e.g. snapshots)

otherwise not sure I follow how this wouldn't help? Your letting there be multiple db txs in parallel for writes to disk, so committing is happening in parallel to CPU processing of nodes (whereas today you don't use CPU & commit at once)

ValarDragon avatar May 17 '23 09:05 ValarDragon

I think on return to comet, we need to be sure we could replay to this block on our own

I think we don't need to make sure that, as long as the cometbft can replay the missing blocks from block store on abci handshake.

not committing to disk at all? (Or only committing at "wide" intervals, e.g. snapshots

this won't be an archive node though, some node just need to provide full service at every historical versions, I guess.

yihuang avatar May 17 '23 09:05 yihuang

Unforturntely it don't work with current IAVL I think, the main issue is the WorkingHash don't modify current version, so if the next Commit happens before the background SaveVersion, it'll use the old version. We'll need some refactoring in iavl to take advantage of this. @cool-develope

yeah, IMO we should separate the iavl into three parts, WorkingHash, Commit , SaveStorage in Commit stage, just finalize the iavl version in memory and nodes will be stored background in SaveStorage. it won't break any APIs of SDK

cool-develope avatar May 17 '23 14:05 cool-develope

this is very similar to the work aditya did on iavl a few years ago which was reverted, we should look at that work to see what we can reuse. A good time to dive into that is after we get a decent portion of the way through store v2 rewrite

tac0turtle avatar May 17 '23 20:05 tac0turtle

not sure that prior IAVL work is helpful, this all becomes much cleaner write-side with the new IAVL storage format. (Every version is independent writes!)

ValarDragon avatar May 18 '23 10:05 ValarDragon