cosmos-sdk
cosmos-sdk copied to clipboard
Store: Concurrent iterators cause panic with iavl 0.19
Summary of Bug
Injective reported a concurrent issue with iterators and fast node.
https://gist.github.com/achilleas-kal/1751d83e22b313a7f759add9ff21998f
We should write a couple tests in the store to test concurrency of trees being used in the sdk.
- [ ] Write test for concurrency in the store package
- [ ] fix iavl concurrent issue
We've had to rollback the cachekv iterator and store code to the version from v0.44.5, as the vanilla SDK has always resulted in panics for us. This rolled back code has worked fine for us https://github.com/InjectiveLabs/cosmos-sdk/commit/7c184a917557ca806293b8b507d7378e558d6d4e
Here's some more context from a Hackerone bug report we submitted many months ago.
And then we run a long fuzz test, so I cannot tell you exactly how to reproduce it, since it's a fuzz test. But I can tell you the reason for the deadlock. When running the tests with -race and some additional logging, you can see the output here https://gist.github.com/gorgos/1e9b4e194b67c9c5c8e4aed064a0e774. The line prefixStore.ReverseIterator(nil, nil) calls cachekv/memiterator.go which calls newMemDBIterator. (see first stack trace) This acquires a db readlock https://github.com/tendermint/tm-db/blob/6f9a08cd4522fb13d8df69b1694587d5015114fc/memdb_iterator.go#L39 which is closed in the inner function below that runs in parallel. This function blocks in line https://github.com/tendermint/tm-db/blob/6f9a08cd4522fb13d8df69b1694587d5015114fc/memdb_iterator.go#L57. (see last stack trace) So far this works as expected I assume, with mem db waiting for iterator to either read or close. Unfortunately, the second call prefixStore.Iterator(nil, nil) goes into dirtyItems (see second stack trace) which then calls clearUnsortedCacheSubset and then tries to set a value inside the mem db here https://github.com/cosmos/cosmos-sdk/blob/a5128e76185c054ccfe2402fa6593a23be7f8be3/store/cachekv/store.go#L357. But this doesn't work since the previous process still has a mem db readlock. We have tracked down the issue that introduced this behavior to this one https://github.com/cosmos/cosmos-sdk/pull/10026/commits/b9e6389276010ddb17d29358d91e2aee39747de5.
The hacked code in CacheKV and tm-db continuously seem to bite us in the ass...
At first, I thought this issue was caused by the update on the SDK (https://github.com/cosmos/cosmos-sdk/pull/12626) which now iterates through the tree without checking if the tree is mutable or not. This update looks wrong given that the MutableTree Iterator has this contract: no updates are made to the tree while an iterator is active.
.
But even then, trying to iterate through the ImmutableTree gives similar results which leads me to think that the issue is in the fast cache/node/iterator implemented in IAVL v0.18.
I was able to replicate this issue with these:
func TestIterateConcurrency(t *testing.T) {
tree, err := getTestTree(0)
require.NoError(t, err)
for i := 0; i < 100; i++ {
go func() {
for j := 0; j < 1000000; j++ {
tree.Set([]byte(fmt.Sprintf("%d%d", i, j)), rand.Bytes(1))
}
}()
tree.Iterate(func(key []byte, value []byte) bool {
return false
})
}
}
// TestConcurrency throws "fatal error: concurrent map iteration and map write" and
// also sometimes "fatal error: concurrent map writes"
func TestIteratorConcurrency(t *testing.T) {
tree := setupMutableTree(t)
for i := 0; i < 1000; i++ {
go func() {
for j := 0; j < 1000000; j++ {
tree.Set([]byte(fmt.Sprintf("%d%d", i, j)), rand.Bytes(1))
}
}()
itr, _ := tree.Iterator(nil, nil, true)
for ; itr.Valid(); itr.Next() {
}
}
}
These tests don't fail on v0.17.3 or v0.18.0 but they do fail on v0.19.0 (after fast cache was added). There are also some new unanswered comments there that we might want to take a look at https://github.com/cosmos/iavl/pull/468
cc: @p0mvn I would appreciate your thoughts on this one 🙏
Hi @facundomedica
When does this pattern of concurrently Set()
ing and Iterate()
ing occur in the SDK?
From my understanding, all data is kept in cachekv until Commit()
, at which point we Write()
into IAVL. I don't think there are any iterations during Commit()
so this problem shouldn't happen.
On a quick look, this comment might be related to the issue: https://github.com/cosmos/iavl/pull/468#discussion_r959060203
From my memory, we did not synchronize these maps to avoid performance overhead based on the knowledge that this read/write pattern should not happen. If the read/write patterns have changed, the maps in the IAVL discussion probably need to be synchronized
When does this pattern of concurrently Set()ing and Iterate()ing occur in the SDK?
In the SDK I'm not sure we have that happening, but for sure it's possible to do so; which is what we are seeing in the stack trace shared by Injective.
@marbar3778 what do you think of: "this read/write pattern should not happen"?
@marbar3778 what do you think of: "this read/write pattern should not happen"?
This is the main problem. Behaviour changed because it wasn't known and still isn't known. I think we may want to add a mutex to the value to replicate previous behaviour and then define the behaviour we should aim to move towards and refactor accordingly.
When does this pattern of concurrently
Set()
ing andIterate()
ing occur in the SDK?
Looking in the sdk is not best here. Users define how they interact with the store because its not well defined in the documentation. SDK is a poor example of complex usage of the store package imo
Yeah, but also keep in mind the IAVL implementation states various invariants/contracts that must be held. So either we improve docs around this and state what is and is not allowed, or we look into supporting this behavior. Even in the later case, we should improve expected usage/contract docs.
@facundomedica could we document expected behaviour of iavl in the sdk docs so users understand what is possible and what is not. Since you added the test in iavl I think that's sufficient
@facundomedica could we document expected behaviour of iavl in the sdk docs so users understand what is possible and what is not. Since you added the test in iavl I think that's sufficient
Here: https://github.com/cosmos/cosmos-sdk/pull/13386