Intermittent deadlock when Closing a tree
I am frequently encountering this in tests involving multiple SDK apps in the same process, on iavl tag v1.3.2. I am working with the latest cosmos-sdk commit, where async pruning is fixed to true.
The relevant code snippets are:
https://github.com/cosmos/iavl/blob/d89d5d22030e8fea42ce5406743ce45b75ad86ea/nodedb.go#L1122-L1129
and
https://github.com/cosmos/iavl/blob/d89d5d22030e8fea42ce5406743ce45b75ad86ea/nodedb.go#L599-L608
(*nodeDB).startPruning runs in its own goroutine, created during newNodeDB. (*nodeDB).Close is called on a separate goroutine, e.g. from closing an SDK commitment store. Flow during the deadlock happens as follows:
- The
Closegoroutine acquires the lock onndb.mtx - Concurrently, the
startPruninggoroutine enters the default case and attempts to callndb.mtx.Lock(), but it cannot acquire the lock until theClosegoroutine releases it - Therefore, the
Closegoroutine is blocked reading fromndb.donebecause thestartPruninggoroutine cannot advance past acquiring the lock
@julienrbrt @alpe why is this issue still open if https://github.com/cosmos/iavl/pull/1023 resolved it?
Maybe it is accidentally still open because that PR merged to a release branch and not to main which would auto-close this issue.