sofa-jraft icon indicating copy to clipboard operation
sofa-jraft copied to clipboard

Deadlock on configuration application in NodeImpl when disruptors are full

Open alievmirza opened this issue 2 months ago • 9 comments

Describe the bug

There is a deadlock in NodeImpl when working with full LogManagerImpl#diskQueue, FSMCallerImpl#taskQueue and NodeImpl#writeLock.

  1. NodeImpl#executeApplyingTasks() takes NodeImpl.writeLock and calls LogManager.appendEntries()
  2. LogManager tries to enqueue a task to diskQueue which is full, hence it blocks until a task gets consumed from diskQueue
  3. diskQueue is consumed by StableClosureEventHandler
  4. StableClosureEventHandler tries to enqueue a task to FSMCallerImpl#taskQueue, which is also full, so this also blocks until a task gets consumed from FSMCallerImpl#taskQueue
  5. FSMCallerImpl#taskQueue is consumed by ApplyTaskHandler
  6. ApplyTaskHandler calls NodeImpl#onConfigurationChangeDone(), which tries to take NodeImpl#writeLock

As a result, there is a deadlock: NodeImpl#writeLock -> LogManager#diskQueue -> FSMCallerImpl#taskQueue -> NodeImpl#writeLock (disruptors are used as blocking queues in JRaft, so, when full, they act like locks).

This was caught by com.alipay.sofa.jraft.core.NodeTest#testNodeTaskOverload which uses extremely short disruptors (2 items max each).

Steps to reproduce

Run com.alipay.sofa.jraft.core.NodeTest#testNodeTaskOverload in a loop several times, for my local machine it is reproducible within 50-100 runs.

Environment

  • SOFAJRaft version: v1.3.14 (latest commit 890033a64d8ed5c8838463f278b940355553e413)
  • JVM version (e.g. java -version): openjdk version "11.0.23"
  • OS version (e.g. uname -a): macOs 14.5
  • Maven version: 3.9.6
  • IDE version: IntelliJ IDEA 2024.1 (Community Edition)

alievmirza avatar May 20 '24 06:05 alievmirza