iavl icon indicating copy to clipboard operation
iavl copied to clipboard

Intermittent deadlock when Closing a tree

Open mark-rushakoff opened this issue 1 year ago • 1 comments

I am frequently encountering this in tests involving multiple SDK apps in the same process, on iavl tag v1.3.2. I am working with the latest cosmos-sdk commit, where async pruning is fixed to true.

The relevant code snippets are:

https://github.com/cosmos/iavl/blob/d89d5d22030e8fea42ce5406743ce45b75ad86ea/nodedb.go#L1122-L1129

and

https://github.com/cosmos/iavl/blob/d89d5d22030e8fea42ce5406743ce45b75ad86ea/nodedb.go#L599-L608

(*nodeDB).startPruning runs in its own goroutine, created during newNodeDB. (*nodeDB).Close is called on a separate goroutine, e.g. from closing an SDK commitment store. Flow during the deadlock happens as follows:

  1. The Close goroutine acquires the lock on ndb.mtx
  2. Concurrently, the startPruning goroutine enters the default case and attempts to call ndb.mtx.Lock(), but it cannot acquire the lock until the Close goroutine releases it
  3. Therefore, the Close goroutine is blocked reading from ndb.done because the startPruning goroutine cannot advance past acquiring the lock

mark-rushakoff avatar Dec 13 '24 21:12 mark-rushakoff

@julienrbrt @alpe why is this issue still open if https://github.com/cosmos/iavl/pull/1023 resolved it?

Maybe it is accidentally still open because that PR merged to a release branch and not to main which would auto-close this issue.

rootulp avatar Jul 23 '25 14:07 rootulp