Leader message sequence out-of-sync when opening queue to cluster with 92.8 and 93.15 mix
Is there an existing issue for this?
- [x] I have searched the existing issues
Current Behavior
I observe this alarm:
04APR2025_21:00:31.928 (6181072896) ERROR mqbblp_clusterstatemanager.cpp:1575 ALARM [CLUSTER_STATE] Cluster (c2x2): got queueAssignmentAdvisory: [ sequenceNumber = [ electorTerm = 1 sequenceNumber = 9 ] queues = [ [ uri = "bmq://bmq.test.mmap.fanout/q1" key = [ CD71ED7DB2 ] partitionId = 0 appIds = [ ] ] ] ] from current leader: [east2, 2], with smaller leader message sequence: [ electorTerm = 1 sequenceNumber = 9 ]. Current value: [ electorTerm = 1 sequenceNumber = 10 ]. Ignoring this advisory.
And producer hangs.
Expected Behavior
Should not have any alarm, and producer should not hang.
Steps To Reproduce
- Prepare a 4-node cluster in non-FSM mode. Three nodes are on 92.8, and one node on 93.15.
- Start the three 92.8 nodes. One of them becomes leader.
- Start the 93.15 node.
- Start a producer to connect to the 93.15 node, and open a fanout queue.
- You should see the alarm.
BlazingMQ Version
0.92.8, 0.93.15
Anything else?
- @dorjesinpo is the original reporter of this issue.
- The producer must connect to 93.15 node to trigger alarm. If producer connects through proxy or a 92.8 replica or the 92.8 primary, then there is no alarm.
- This behavior does not occur in FSM mode.
Investigation Report
Let QAA == QueueAssignmentAdvisory; LSN == leader sequence number == leader message sequence
The issue is because during mqbc::ClusterUtil::assignQueue, we apply QAA to CSL then broadcast QAA in the legacy way. Say QAA has LSN (1,11). If producer connects directly to 93.15 node, it delays the processing of the legacy QAA. QAA to CSL gets processed first, and leader happily applies a QAA commit in CSL with LSN (1,12). Then, the 93.15 tries to process legacy QAA and sees (1,11) is less than (1,12), triggering above alarm.
Supporting evidence logs from the 93.15 when I tested manually:
04APR2025_21:50:41.593 (6172176384) INFO mqbc_incoreclusterstateledger.cpp:992 IncoreClusterStateLedger (cluster: c2x2): Applying cluster message with type = UPDATE and seqNum = [ electorTerm = 1 sequenceNumber = 11 ] from '[east2, 2]': [ choice = [ queueAssignmentAdvisory = [ sequenceNumber = [ electorTerm = 1 sequenceNumber = 11 ] queues = [ [ uri = "bmq://bmq.test.mmap.fanout/q1" key = [ CB286FC084 ] partitionId = 0 appIds = [ ] ] ] ] ] ]
04APR2025_21:50:41.593 (6172176384) INFO mqbc_electorinfo.h:339 Setting elector's leader sequence number to [ electorTerm = 1 sequenceNumber = 11 ]
04APR2025_21:50:41.593 (6172176384) INFO mqbc_incoreclusterstateledger.cpp:576 Sent ack '[ choice = [ leaderAdvisoryAck = [ sequenceNumberAcked = [ electorTerm = 1 sequenceNumber = 11 ] ] ] ]' back to leader node [east2, 2]
04APR2025_21:50:41.594 (6172176384) INFO mqbc_incoreclusterstateledger.cpp:831 Applying cluster state record event from node '[east2, 2]'
04APR2025_21:50:41.594 (6172176384) INFO mqbc_incoreclusterstateledger.cpp:992 IncoreClusterStateLedger (cluster: c2x2): Applying cluster message with type = COMMIT and seqNum = [ electorTerm = 1 sequenceNumber = 12 ] from '[east2, 2]': [ choice = [ leaderAdvisoryCommit = [ sequenceNumber = [ electorTerm = 1 sequenceNumber = 12 ] sequenceNumberCommitted = [ electorTerm = 1 sequenceNumber = 11 ] ] ] ]
04APR2025_21:50:41.594 (6172176384) INFO mqbc_electorinfo.h:339 Setting elector's leader sequence number to [ electorTerm = 1 sequenceNumber = 12 ]
04APR2025_21:50:41.595 (6172176384) ERROR mqbblp_clusterstatemanager.cpp:1575 ALARM [CLUSTER_STATE] Cluster (c2x2): got queueAssignmentAdvisory: [ sequenceNumber = [ electorTerm = 1 sequenceNumber = 11 ] queues = [ [ uri = "bmq://bmq.test.mmap.fanout/q1" key = [ CB286FC084 ] partitionId = 0 appIds = [ ] ] ] ] from current leader: [east2, 2], with smaller leader message sequence: [ electorTerm = 1 sequenceNumber = 11 ]. Current value: [ electorTerm = 1 sequenceNumber = 12 ]. Ignoring this advisory.
@emelialei88's https://github.com/bloomberg/blazingmq/pull/584 automatically fixes the issue. Mystery solved.
@emelialei88's https://github.com/bloomberg/blazingmq/pull/584 has been merged