rabbitmq-server Quorum queue replica that was shut down can rejoin after a restart and queue deletion, re-declaration

Quorum queue replica that was shut down can rejoin after a restart and queue deletion, re-declaration

Open luos opened this issue 1 year ago • 4 comments

Describe the bug

Hi,

The issue below involves deleting and recreating a queue while a node is down, which means that most users will not be affected by this.

We've identified an issue with Quorum Queues which causes an out of date replica to come back as a leader again, resending past log messages, causing the now follower to reapply local effects, causing the new consumer to receive messages which were already processed.

This leads to duplicate message delivery - even though the messages were acknowledged properly and the queue processed the acks, etc. Basically the log will be replayed in its entirety, meaning messages processed days ago can reappear.

The effect of it is similar to https://github.com/rabbitmq/ra/issues/387.

This issue causes the queue to actually become broken in some scenarios, but that is expected due to the bad internal state.

We know that the proper solution is to not delete the queue but probably ra should also have some built in protection to not allow out of date members to rejoin the cluster - at least not to become leaders.

I think a potential solution would be is to include cluster id in pre_vote and request_vote_rpc messages. According to my review, today there is no shared cluster ID for the ra clusters. There is a uid but that is for the server, not for the cluster.

Reproduction steps

Use a three node cluster.
Connect to "rmq1" with a consumer
Create a quorum queue named “test” on "rmq1"
Create consumer on "rmq1" for queue "test"
Publish a single message with a unique identifier (eg. current time)
Acknowledge the message on the consumer
Shut down rmq1, client is disconnected
Client reconnects to one of the up nodes (rmq2)
Client deletes and recreates the queue
Creates a consumer for queue "test"
Restart down node "rmq1"
The queue starts up on rmq1, notices that it is more up to date than the newly created replica on rmq2, leader becomes rmq1, other nodes revert back to follower
RMQ1 notices that the (newly created) followers do not have some indexes,
1. resends the append_entries for these log items
2. log message "setting last index to 3, next_index 4 for…"
Follower receives the entries from the log, and plays them with the bad initial state, meaning it will send out the message to the current local consumer.

Expected behavior

One or all of the following: :-)

Deleted replica can not rejoin the ra cluster
Deleted replica deletes itself
It can not become leader

Additional context

I can share some traces or debug output, not sure it makes sense without context.

Attached the "restart sequence", nothing special.

restart.sh.txt

Sep 24 '24 09:09 luos

rabbitmq-server rabbitmq-server copied to clipboard

Quorum queue replica that was shut down can rejoin after a restart and queue deletion, re-declaration

Describe the bug

Reproduction steps

Expected behavior

Additional context

rabbitmq-server
rabbitmq-server copied to clipboard