KeyDB icon indicating copy to clipboard operation
KeyDB copied to clipboard

[BUG] KeyDB deadlock

Open swdev128 opened this issue 1 year ago • 1 comments

Describe the bug

I'm running two instances of KeyDB (replication). Each of them tends to occasionally move into a total deadlock condition. Neither the application I'm developing nor keydb-cli binary can connect to keydb-server.

Gdb attached to gdb-server shows all threads are awaiting each other on futexes in readWriteLock and a mutex in AsyncWorkQueue::m_mutex.

My observations from gdb investigation: -- bgsaveCommand attempting to acquire global WRITE lock with aeAcquireForkLock (with g_forkLock::m_readCount tends to be around 1-3, preventing new global READ locks) -- AsyncWorkerQueue::WorkerThreadMain (1) stuck on trying to acquire global READ lock with aeProcessOnline while owning lock on AsyncWorkQueue::m_mutex -- AsyncWorkerQueue::WorkerThreadMain (2) owning global READ lock after calling aeProcessOnline, and stuck attempting to lock AsyncWorkQueue::m_mutex (locked by (1))

To reproduce

Run two KeyDB instances in replication mode. Binary comes from a compilation of "RELEASE_6_3_4" branch on Github

Expected behavior

No deadlocks while running under low-moderate load.

Additional information

After the deadlock CPU usage reported by 'top' is 0% and CPU time of the process does not change. While working, transaction load is constant (about 500tps) with about 300 keys in DB, DB size is about 2MB. KeyDB is running within Docker image (managed by Kubernetes) with up to 4GB of RAM and 3 CPUs. 'top' command shows the resources are more than sufficient. I tried enabling/disabling/re-configuring features trying to nail down the root cause and the scenario where it shows up most frequently, but not much luck. I tried turning off/on background save, switching to AOF, and also tweaking server settings: repl-ping-replica-period, repl-backlog-size, repl-timeout, server-threads, min-clients-per-thread, active-client-balancing, timeout. Unfortunately none of these changes resulted in fixing the issue.

Kindly please advise possible root cause, workaround or the best code solution.

swdev128 avatar Nov 29 '24 15:11 swdev128