[Bug]: baseapp create query context has mutex contention on IAVL versioning
Is there an existing issue for this?
- [x] I have searched the existing issues
What happened?
We currently see nodes crashing under too many threads, when getting many queries. We are getting over 20% of the goroutines blocked on mutexes in IAVL just for seeing if we have the relevant version. (2878 goroutines stuck here)
But these are queries that don't specify a height, so this is an unneeded contention in the first place. Furthermore, many of the queries themselves are blocked on IAVL reads, so this significantly exacerbates the problem here, leading to crashes + all queries processing too slowly.
We should get query fetching the IAVL version to have lock-free mechanisms. E.g. a CAS operation to fetch a "supported versions" within IAVL, that we update with a CAS op on new block/prune. Or maybe just a CAS op to handle this for getting latest version.
Cosmos SDK Version
0.50
How to reproduce?
Run many slightly slow queries, e.g. cosmwasm queries. You are then liable to too many threads causing a node to crash. If you profile via pprof to get where the threads are, you see graphs as above.
is this on iavl or the underlying db? goleveldb is not well optimised for workloads for single writer multiple reader. Have you tried testing this with pebbledb?
Hey I can add something to this.
Recently I made a trading bot using osmosis and sqs.
Actually maybe I'm not adding much but giving a very strong "me too".
What Dev is describing here is exactly what I experienced.