cosmos-sdk icon indicating copy to clipboard operation
cosmos-sdk copied to clipboard

[Bug]: baseapp create query context has mutex contention on IAVL versioning

Open ValarDragon opened this issue 10 months ago • 2 comments

Is there an existing issue for this?

  • [x] I have searched the existing issues

What happened?

Image

We currently see nodes crashing under too many threads, when getting many queries. We are getting over 20% of the goroutines blocked on mutexes in IAVL just for seeing if we have the relevant version. (2878 goroutines stuck here)

But these are queries that don't specify a height, so this is an unneeded contention in the first place. Furthermore, many of the queries themselves are blocked on IAVL reads, so this significantly exacerbates the problem here, leading to crashes + all queries processing too slowly.

We should get query fetching the IAVL version to have lock-free mechanisms. E.g. a CAS operation to fetch a "supported versions" within IAVL, that we update with a CAS op on new block/prune. Or maybe just a CAS op to handle this for getting latest version.

Cosmos SDK Version

0.50

How to reproduce?

Run many slightly slow queries, e.g. cosmwasm queries. You are then liable to too many threads causing a node to crash. If you profile via pprof to get where the threads are, you see graphs as above.

ValarDragon avatar Feb 13 '25 16:02 ValarDragon

is this on iavl or the underlying db? goleveldb is not well optimised for workloads for single writer multiple reader. Have you tried testing this with pebbledb?

tac0turtle avatar Feb 18 '25 10:02 tac0turtle

Hey I can add something to this.

Recently I made a trading bot using osmosis and sqs.

Actually maybe I'm not adding much but giving a very strong "me too".

What Dev is describing here is exactly what I experienced.

faddat avatar Jun 19 '25 11:06 faddat