erigon
erigon copied to clipboard
Deadlock in BeginRo
System information
Erigon version: ./erigon --version
2022.99.99-dev-1303023c
OS & Version: Windows/Linux/OSX ubuntu 20.04
Commit hash: 1303023cb8949050250384a2713c6f65896dd27e
Expected behaviour
No deadlock
Actual behaviour
Deadlock :( It stucks in semaphore.Acquire
call.
Steps to reproduce the behaviour
Send about 5k eth_call
requests simultaneously via 2 websocket connections.
Logs
pkill -SIGUSR1 erigon
goroutine 56608 [select, 10 minutes]:
golang.org/x/sync/semaphore.(*Weighted).Acquire(0xc0001500f0, {0x200d3b0, 0xc0cb6bbe90}, 0x1)
golang.org/x/[email protected]/semaphore/semaphore.go:60 +0x345
github.com/ledgerwatch/erigon-lib/kv/mdbx.(*MdbxKV).BeginRo(0xc0002791e0, {0x200d3b0?, 0xc0cb6bbe90?})
github.com/ledgerwatch/[email protected]/kv/mdbx/kv_mdbx.go:425 +0x85
github.com/ledgerwatch/erigon-lib/kv/remotedbserver.(*KvServer).Tx(0xc0a8768800, {0x2017408, 0xc0cb6a1390})
github.com/ledgerwatch/[email protected]/kv/remotedbserver/server.go:84 +0x77
github.com/ledgerwatch/erigon-lib/gointerfaces/remote._KV_Tx_Handler({0x181dda0?, 0xc0a8768800}, {0x20128b8?, 0xc0cb6c24e0})
github.com/ledgerwatch/[email protected]/gointerfaces/remote/kv_grpc.pb.go:195 +0x9f
github.com/grpc-ecosystem/go-grpc-middleware/recovery.StreamServerInterceptor.func1({0x181dda0?, 0xc0a8768800?}, {0x20128b8?, 0xc0cb6c24e0?}, 0x77ce0c149e98?, 0xc0cb69d4e8?)
github.com/grpc-ecosystem/[email protected]/recovery/interceptors.go:51 +0xb0
github.com/grpc-ecosystem/go-grpc-middleware.ChainStreamServer.func1.1.1({0x181dda0?, 0xc0a8768800?}, {0x20128b8?, 0xc0cb6c24e0?})
github.com/grpc-ecosystem/[email protected]/chain.go:49 +0x3a
github.com/grpc-ecosystem/go-grpc-middleware.ChainStreamServer.func1({0x181dda0, 0xc0a8768800}, {0x20128b8, 0xc0cb6c24e0}, 0x1689740?, 0xc0cb6a1380?)
github.com/grpc-ecosystem/[email protected]/chain.go:58 +0xbf
google.golang.org/grpc.(*Server).processStreamingRPC(0xc000000780, {0x2018460, 0xc0a89a1040}, 0xc0cb6beb40, 0xc0b2165f20, 0x2b4e0e0, 0x0)
google.golang.org/[email protected]/server.go:1565 +0xf26
google.golang.org/grpc.(*Server).handleStream(0xc000000780, {0x2018460, 0xc0a89a1040}, 0xc0cb6beb40, 0x0)
google.golang.org/[email protected]/server.go:1640 +0x9d6
google.golang.org/grpc.(*Server).serveStreams.func1.2()
google.golang.org/[email protected]/server.go:932 +0x98
created by google.golang.org/grpc.(*Server).serveStreams.func1
google.golang.org/[email protected]/server.go:930 +0x28a
Same if increase —db.read.concurrency ?
Tried to increase from 32 to 200, didn't help. Can you reproduce it on your side?
Tried to run with —db.read.concurrency=5000
, seems to be working. Is it ok to run node with this setting?
But still this is not quite the correct behavior of the program if it deadlocks during a burst of load..
Eventually it stucked :( Even with this setting. First time I send ~5k requests and it responds successfully, then it stucks a second time I send them. So, I think it doesn't release semaphore the same number of times it acquires it, do it?
do you run separated RPCDaemon? if yes: with --datadir? if no try add --datadir
"Can you reproduce it on your side?" - I didn't try yet
Yes, I'm running a separated RPCDaemon, they are on the same machine, but in different containers.
Tried to run with --datadir
and it seems to be working, even with db.read.concurrency=32
.
So, it has to be a deadlock when transaction creates with remotedb?
ok, then it's a bit known issue with remotedb grpc server architecture. I advise you work with --datadir now. don't know when we will able to fix it.
I see, thanks. What is known about this bug at the moment? Maybe a little more context, so I can try to check this out myself?
what we know:
- unclear - is it really deadlock or just --db.read.concurrency limit touched and all next readers just waiting previous one
- seems Golang is not very friendly to embedded C databases: when you get IO inside C code - golang see that whole thread is locked (golang's runtime can't schedule any goroutine to this thread) - and golang exclude current thread from it's thread pool. As a result - with much parallel read transactions (which doing some IO) you will have tons of threads (which is in-general bad for performance) and then also will touch golang's limit of 10K threads and app crush. this is reason why we have --db.read.concurrency semaphore.
- network-based transactions (remotedb) are slower and you faster getting to the limit.
--datadir does not solve this for local erigon + local rpcdaemon under high load we can what appears to be a lockup on the rpcd every <20mins
I think I now have something here - will push some code this afternoon so we can try to prove it:
- roTxsLimiter is a semaphore which defaults to the runtime.GOMAXPROCS(-1) - which for most machines will be c. 8 (correct me if wrong!)
- db.read.concurrency flag sets this roTxsLimiter - and the repros where issues are seen are passing this value as c. 200-10k (as seen on GH)
Given point 2 in @AskAlexSharov comment above it seems plausible this value is just set too high?
I guess in the meantime if there's chance to try again - remove the db.read.concurrency flag altogether, let it default to GOMAXPROCS(-1), and see if the repro produces the same exhaustion/locking?
Could this be due to the fact that all go runtime threads are excluded? So there are no free threads for goroutines to process result form database. I can try to test it with lower db.read.concurrency
.
But this hypothesis is refuted by the fact that there are still some working goroutines which print GoodPeers
?
Yeah good point on the goodPeers - think I need to instrument some code so we can see more clearly what's going on here - I do think the first part of what you said looks like it 'could' be correct - for me to go prove!
Tried to run with default value (max(10, GOMAXPROCS*2)
), db.read.concurrency=2
and db.read.concurrency=1
-- it still stucks.
@kaikash you seem to be able to replicate quickly - do you have an example of the calls you are making?
@revitteth here is a test script https://github.com/kaikash/eth-call-test
It works very unpredictable, sometimes it stucks, sometimes it is not. Try increasing connection pool size (for me it stucks with 10 cons and 10k requests). Maybe this script is not the best because it makes same requests and erigon caching it?..
@kaikash sorry for delay, still working on this one!
So... we had a thought about this having just run into a similar issue in some code we are building related to the same concept. We discovered that if the application is receiving a huge amount of requests async, it actually never has time for GC (as GC is generally a blocking activity and blocking activities are deferred while a thread is in use) so what may be happening to Erigon too is simply that the thread lifetime is essentially 'forever' while it is in constant use and no GC can occur. I can't say for sure in the code, but perhaps setting a max lifetime on the threads may help release memory and not run into this deadlock.
Thanks for the input! I'm balancing this one with a few other issues so the insight is super appreciated. Will be picking this up next week for sure.
I removed rpcDaemon and running in single process mode. Still it gets stuck. FWIW, I am running bsc node on erigon:alpha.
Thanks, useful info, currently trying to finish up on another bug and should get to this soon.
Now it's not stuck but hitting IOPS issue 50K, I am getting throughput of 1000/RPS. I now suspect issue is with rpcDaemon as separate process and once I moved to single process my main issue is IOPS.
I can reliably fix the problem by running in single process mode on ethereum node too (before this I was running rpcDaemon too).
@AskAlexSharov I've opened a PR over in erigon-lib which looks promising for helping here https://github.com/ledgerwatch/erigon-lib/pull/639
No longer able to repro after the above fix! Will close and we can reopen if any further problem.