erigon Deadlock in BeginRo

System information

Erigon version: ./erigon --version 2022.99.99-dev-1303023c

OS & Version: Windows/Linux/OSX ubuntu 20.04

Commit hash: 1303023cb8949050250384a2713c6f65896dd27e

Expected behaviour

No deadlock

Actual behaviour

Deadlock :( It stucks in semaphore.Acquire call.

Steps to reproduce the behaviour

Send about 5k eth_call requests simultaneously via 2 websocket connections.

Logs

pkill -SIGUSR1 erigon

goroutine 56608 [select, 10 minutes]:
golang.org/x/sync/semaphore.(*Weighted).Acquire(0xc0001500f0, {0x200d3b0, 0xc0cb6bbe90}, 0x1)
	golang.org/x/[email protected]/semaphore/semaphore.go:60 +0x345
github.com/ledgerwatch/erigon-lib/kv/mdbx.(*MdbxKV).BeginRo(0xc0002791e0, {0x200d3b0?, 0xc0cb6bbe90?})
	github.com/ledgerwatch/[email protected]/kv/mdbx/kv_mdbx.go:425 +0x85
github.com/ledgerwatch/erigon-lib/kv/remotedbserver.(*KvServer).Tx(0xc0a8768800, {0x2017408, 0xc0cb6a1390})
	github.com/ledgerwatch/[email protected]/kv/remotedbserver/server.go:84 +0x77
github.com/ledgerwatch/erigon-lib/gointerfaces/remote._KV_Tx_Handler({0x181dda0?, 0xc0a8768800}, {0x20128b8?, 0xc0cb6c24e0})
	github.com/ledgerwatch/[email protected]/gointerfaces/remote/kv_grpc.pb.go:195 +0x9f
github.com/grpc-ecosystem/go-grpc-middleware/recovery.StreamServerInterceptor.func1({0x181dda0?, 0xc0a8768800?}, {0x20128b8?, 0xc0cb6c24e0?}, 0x77ce0c149e98?, 0xc0cb69d4e8?)
	github.com/grpc-ecosystem/[email protected]/recovery/interceptors.go:51 +0xb0
github.com/grpc-ecosystem/go-grpc-middleware.ChainStreamServer.func1.1.1({0x181dda0?, 0xc0a8768800?}, {0x20128b8?, 0xc0cb6c24e0?})
	github.com/grpc-ecosystem/[email protected]/chain.go:49 +0x3a
github.com/grpc-ecosystem/go-grpc-middleware.ChainStreamServer.func1({0x181dda0, 0xc0a8768800}, {0x20128b8, 0xc0cb6c24e0}, 0x1689740?, 0xc0cb6a1380?)
	github.com/grpc-ecosystem/[email protected]/chain.go:58 +0xbf
google.golang.org/grpc.(*Server).processStreamingRPC(0xc000000780, {0x2018460, 0xc0a89a1040}, 0xc0cb6beb40, 0xc0b2165f20, 0x2b4e0e0, 0x0)
	google.golang.org/[email protected]/server.go:1565 +0xf26
google.golang.org/grpc.(*Server).handleStream(0xc000000780, {0x2018460, 0xc0a89a1040}, 0xc0cb6beb40, 0x0)
	google.golang.org/[email protected]/server.go:1640 +0x9d6
google.golang.org/grpc.(*Server).serveStreams.func1.2()
	google.golang.org/[email protected]/server.go:932 +0x98
created by google.golang.org/grpc.(*Server).serveStreams.func1
	google.golang.org/[email protected]/server.go:930 +0x28a

pkill.log

Aug 20 '22 16:08 kaikash

Same if increase —db.read.concurrency ?

Aug 21 '22 03:08 AskAlexSharov

Tried to increase from 32 to 200, didn't help. Can you reproduce it on your side?

Aug 21 '22 09:08 kaikash

Tried to run with —db.read.concurrency=5000, seems to be working. Is it ok to run node with this setting?

But still this is not quite the correct behavior of the program if it deadlocks during a burst of load..

Aug 21 '22 10:08 kaikash

Eventually it stucked :( Even with this setting. First time I send ~5k requests and it responds successfully, then it stucks a second time I send them. So, I think it doesn't release semaphore the same number of times it acquires it, do it?

Aug 21 '22 10:08 kaikash

do you run separated RPCDaemon? if yes: with --datadir? if no try add --datadir

Aug 21 '22 14:08 AskAlexSharov

"Can you reproduce it on your side?" - I didn't try yet

Aug 21 '22 14:08 AskAlexSharov

Yes, I'm running a separated RPCDaemon, they are on the same machine, but in different containers.

Tried to run with --datadir and it seems to be working, even with db.read.concurrency=32.

So, it has to be a deadlock when transaction creates with remotedb?

Aug 21 '22 16:08 kaikash

ok, then it's a bit known issue with remotedb grpc server architecture. I advise you work with --datadir now. don't know when we will able to fix it.

Aug 22 '22 06:08 AskAlexSharov

I see, thanks. What is known about this bug at the moment? Maybe a little more context, so I can try to check this out myself?

Aug 22 '22 09:08 kaikash

what we know:

unclear - is it really deadlock or just --db.read.concurrency limit touched and all next readers just waiting previous one
seems Golang is not very friendly to embedded C databases: when you get IO inside C code - golang see that whole thread is locked (golang's runtime can't schedule any goroutine to this thread) - and golang exclude current thread from it's thread pool. As a result - with much parallel read transactions (which doing some IO) you will have tons of threads (which is in-general bad for performance) and then also will touch golang's limit of 10K threads and app crush. this is reason why we have --db.read.concurrency semaphore.
network-based transactions (remotedb) are slower and you faster getting to the limit.

Aug 22 '22 11:08 AskAlexSharov

--datadir does not solve this for local erigon + local rpcdaemon under high load we can what appears to be a lockup on the rpcd every <20mins

Aug 22 '22 13:08 PeaStew

I think I now have something here - will push some code this afternoon so we can try to prove it:

roTxsLimiter is a semaphore which defaults to the runtime.GOMAXPROCS(-1) - which for most machines will be c. 8 (correct me if wrong!)
db.read.concurrency flag sets this roTxsLimiter - and the repros where issues are seen are passing this value as c. 200-10k (as seen on GH)

Given point 2 in @AskAlexSharov comment above it seems plausible this value is just set too high?

Aug 22 '22 13:08 revitteth

I guess in the meantime if there's chance to try again - remove the db.read.concurrency flag altogether, let it default to GOMAXPROCS(-1), and see if the repro produces the same exhaustion/locking?

Aug 22 '22 14:08 revitteth

Could this be due to the fact that all go runtime threads are excluded? So there are no free threads for goroutines to process result form database. I can try to test it with lower db.read.concurrency.

But this hypothesis is refuted by the fact that there are still some working goroutines which print GoodPeers?

Aug 22 '22 14:08 kaikash

Yeah good point on the goodPeers - think I need to instrument some code so we can see more clearly what's going on here - I do think the first part of what you said looks like it 'could' be correct - for me to go prove!

Aug 22 '22 14:08 revitteth

Tried to run with default value (max(10, GOMAXPROCS*2)), db.read.concurrency=2 and db.read.concurrency=1 -- it still stucks.

Aug 22 '22 14:08 kaikash

@kaikash you seem to be able to replicate quickly - do you have an example of the calls you are making?

Aug 22 '22 17:08 revitteth

@revitteth here is a test script https://github.com/kaikash/eth-call-test

Aug 22 '22 22:08 kaikash

It works very unpredictable, sometimes it stucks, sometimes it is not. Try increasing connection pool size (for me it stucks with 10 cons and 10k requests). Maybe this script is not the best because it makes same requests and erigon caching it?..

Aug 22 '22 22:08 kaikash

@kaikash sorry for delay, still working on this one!

Aug 25 '22 16:08 revitteth

So... we had a thought about this having just run into a similar issue in some code we are building related to the same concept. We discovered that if the application is receiving a huge amount of requests async, it actually never has time for GC (as GC is generally a blocking activity and blocking activities are deferred while a thread is in use) so what may be happening to Erigon too is simply that the thread lifetime is essentially 'forever' while it is in constant use and no GC can occur. I can't say for sure in the code, but perhaps setting a max lifetime on the threads may help release memory and not run into this deadlock.

Aug 26 '22 22:08 PeaStew

Thanks for the input! I'm balancing this one with a few other issues so the insight is super appreciated. Will be picking this up next week for sure.

Aug 26 '22 22:08 revitteth

I removed rpcDaemon and running in single process mode. Still it gets stuck. FWIW, I am running bsc node on erigon:alpha.

Sep 10 '22 17:09 zimbabao

Thanks, useful info, currently trying to finish up on another bug and should get to this soon.

Sep 10 '22 18:09 revitteth

Now it's not stuck but hitting IOPS issue 50K, I am getting throughput of 1000/RPS. I now suspect issue is with rpcDaemon as separate process and once I moved to single process my main issue is IOPS.

I can reliably fix the problem by running in single process mode on ethereum node too (before this I was running rpcDaemon too).

Sep 10 '22 22:09 zimbabao

@AskAlexSharov I've opened a PR over in erigon-lib which looks promising for helping here https://github.com/ledgerwatch/erigon-lib/pull/639

Sep 14 '22 12:09 hexoscott

No longer able to repro after the above fix! Will close and we can reopen if any further problem.

Sep 19 '22 04:09 revitteth

erigon erigon copied to clipboard

Deadlock in BeginRo

System information

Expected behaviour

Actual behaviour

Steps to reproduce the behaviour

Logs

erigon
erigon copied to clipboard