blazingmq icon indicating copy to clipboard operation
blazingmq copied to clipboard

Leader message sequence out-of-sync when opening queue to cluster with 92.8 and 93.15 mix

Open kaikulimu opened this issue 10 months ago • 1 comments

Is there an existing issue for this?

  • [x] I have searched the existing issues

Current Behavior

I observe this alarm:

04APR2025_21:00:31.928 (6181072896) ERROR mqbblp_clusterstatemanager.cpp:1575 ALARM [CLUSTER_STATE] Cluster (c2x2): got queueAssignmentAdvisory: [ sequenceNumber = [ electorTerm = 1 sequenceNumber = 9 ] queues = [ [ uri = "bmq://bmq.test.mmap.fanout/q1" key = [ CD71ED7DB2 ] partitionId = 0 appIds = [ ] ] ] ] from current leader: [east2, 2], with smaller leader message sequence: [ electorTerm = 1 sequenceNumber = 9 ]. Current value: [ electorTerm = 1 sequenceNumber = 10 ]. Ignoring this advisory.

And producer hangs.

Expected Behavior

Should not have any alarm, and producer should not hang.

Steps To Reproduce

  1. Prepare a 4-node cluster in non-FSM mode. Three nodes are on 92.8, and one node on 93.15.
  2. Start the three 92.8 nodes. One of them becomes leader.
  3. Start the 93.15 node.
  4. Start a producer to connect to the 93.15 node, and open a fanout queue.
  5. You should see the alarm.

BlazingMQ Version

0.92.8, 0.93.15

Anything else?

  • @dorjesinpo is the original reporter of this issue.
  • The producer must connect to 93.15 node to trigger alarm. If producer connects through proxy or a 92.8 replica or the 92.8 primary, then there is no alarm.
  • This behavior does not occur in FSM mode.

kaikulimu avatar Apr 04 '25 21:04 kaikulimu

Investigation Report

Let QAA == QueueAssignmentAdvisory; LSN == leader sequence number == leader message sequence

The issue is because during mqbc::ClusterUtil::assignQueue, we apply QAA to CSL then broadcast QAA in the legacy way. Say QAA has LSN (1,11). If producer connects directly to 93.15 node, it delays the processing of the legacy QAA. QAA to CSL gets processed first, and leader happily applies a QAA commit in CSL with LSN (1,12). Then, the 93.15 tries to process legacy QAA and sees (1,11) is less than (1,12), triggering above alarm.

Supporting evidence logs from the 93.15 when I tested manually:

04APR2025_21:50:41.593 (6172176384) INFO mqbc_incoreclusterstateledger.cpp:992 IncoreClusterStateLedger (cluster: c2x2): Applying cluster message with type = UPDATE and seqNum = [ electorTerm = 1 sequenceNumber = 11 ] from '[east2, 2]': [ choice = [ queueAssignmentAdvisory = [ sequenceNumber = [ electorTerm = 1 sequenceNumber = 11 ] queues = [ [ uri = "bmq://bmq.test.mmap.fanout/q1" key = [ CB286FC084 ] partitionId = 0 appIds = [ ] ] ] ] ] ]
04APR2025_21:50:41.593 (6172176384) INFO mqbc_electorinfo.h:339 Setting elector's leader sequence number to [ electorTerm = 1 sequenceNumber = 11 ]
04APR2025_21:50:41.593 (6172176384) INFO mqbc_incoreclusterstateledger.cpp:576 Sent ack '[ choice = [ leaderAdvisoryAck = [ sequenceNumberAcked = [ electorTerm = 1 sequenceNumber = 11 ] ] ] ]' back to leader node [east2, 2]
04APR2025_21:50:41.594 (6172176384) INFO mqbc_incoreclusterstateledger.cpp:831 Applying cluster state record event from node '[east2, 2]'
04APR2025_21:50:41.594 (6172176384) INFO mqbc_incoreclusterstateledger.cpp:992 IncoreClusterStateLedger (cluster: c2x2): Applying cluster message with type = COMMIT and seqNum = [ electorTerm = 1 sequenceNumber = 12 ] from '[east2, 2]': [ choice = [ leaderAdvisoryCommit = [ sequenceNumber = [ electorTerm = 1 sequenceNumber = 12 ] sequenceNumberCommitted = [ electorTerm = 1 sequenceNumber = 11 ] ] ] ]
04APR2025_21:50:41.594 (6172176384) INFO mqbc_electorinfo.h:339 Setting elector's leader sequence number to [ electorTerm = 1 sequenceNumber = 12 ]
04APR2025_21:50:41.595 (6172176384) ERROR mqbblp_clusterstatemanager.cpp:1575 ALARM [CLUSTER_STATE] Cluster (c2x2): got queueAssignmentAdvisory: [ sequenceNumber = [ electorTerm = 1 sequenceNumber = 11 ] queues = [ [ uri = "bmq://bmq.test.mmap.fanout/q1" key = [ CB286FC084 ] partitionId = 0 appIds = [ ] ] ] ] from current leader: [east2, 2], with smaller leader message sequence: [ electorTerm = 1 sequenceNumber = 11 ]. Current value: [ electorTerm = 1 sequenceNumber = 12 ]. Ignoring this advisory.

@emelialei88's https://github.com/bloomberg/blazingmq/pull/584 automatically fixes the issue. Mystery solved.

kaikulimu avatar Apr 04 '25 22:04 kaikulimu

@emelialei88's https://github.com/bloomberg/blazingmq/pull/584 has been merged

kaikulimu avatar Jul 18 '25 21:07 kaikulimu