rabbitmq-server icon indicating copy to clipboard operation
rabbitmq-server copied to clipboard

Quorum queue declaration behavior when initial replica count < actual cluster member at declaration time

Open rubber-ant opened this issue 2 years ago • 11 comments

Issue discussed with @mkuratczyk on slack: https://rabbitmq.slack.com/archives/CTMSV81HA/p1638543089101700

Example issue: a cluster with 3 nodes and quorum queue have only 1 member instead of 3 at

image

image

Scenario:

this happens when a queue is declared by a consumer/producer pod running before the initial rabbitmq cluster starts. Once the cluster start rabbitmq-0 all the quorum queues stick with only 1 members=rabbitmq-0 and once the node rabbitmq-1 and rabbitmq-2 are up too and the nodes are part of the cluster, they are ignored by the quorum queues unless I manually delete the queues or run rabbitmq-queues add_member for each queue

rubber-ant avatar Dec 03 '21 18:12 rubber-ant

I ran some quick tests and I can see Will start up to 3 replicas for quorum queue queue X in the (debug) logs so the intention is there. Setting x-quorum-initial-group-size=3 explicitly doesn’t help and neither does rabbitmq-queues rebalance quorum. Seems like the problem is that once the remaining nodes join the cluster, queues just ignore that - even though there are now 3 nodes in the RabbitMQ cluster, queues declared with the replication factor of 3, stay on the nodes that existed when they were declared and don't "spread" to the new nodes.

mkuratczyk avatar Dec 03 '21 18:12 mkuratczyk

If all your rabbit nodes aren’t clustered at the time of queue declaration they won’t be included in the quorum queue node selection. There is no way they could. Best wait and run apps until the infrastructure dependency is fully started/formed. Or use rabbitmq-queues grow After each rabbit node joins.

kjnilsson avatar Dec 03 '21 19:12 kjnilsson

We recently introduced a configuration property that allows to configure the intended cluster size (cluster_formation.target_cluster_size_hint). Therefore, we could reject queue declarations until the number of cluster nodes is equal (or higher I guess) than the configured value. We are discussing the details within the team and will update this issue once we agree on the details.

mkuratczyk avatar Dec 07 '21 14:12 mkuratczyk

If a user declares a quorum queue with a replication factor of N and there aren't that many nodes in the cluster at that time I would expect either -

  • The declaration to immediately fail
  • The queue to "grow" once nodes become available

rabbitmq-queues grow is a workaround but it's not called out in the docs like it should be:

https://www.rabbitmq.com/rabbitmq-queues.8.html#Cluster https://www.rabbitmq.com/quorum-queues.html#replica-management

I suspect it's typical of applications to reconnect and declare queues as fast as possible so this scenario may not be as unlikely as it seems.

lukebakken avatar Dec 07 '21 17:12 lukebakken

Quorum queues intentionally do not try to be smart about how many replicas they should have. The number of replicas is explicit by design.

Rejecting queue declaration before the cluster is formed sounds right to me but it will leave some users pissed: "I don't care if my infrastructure is ready, I don't want any exceptions in my lovely app!"

So if the default is 3 replicas, we would have to return and log a very specific message and only introduce this change in 3.10. Clients that try to connect before the cluster is formed has been an issue for years: developers just don't care.

A more radical solution would be to refuse client connections until a certain number of cluster members is online. This will cause a pitchfork revolution for the same reasons.

michaelklishin avatar Dec 10 '21 11:12 michaelklishin

Developers don't always know whether things are fully ready or not. For example in a GitOps environment, once the app is deployed, it will try to connect. It probably can't (and I'd argue, it shouldn't even try to) determine if all RabbitMQ nodes are ready - if it successfully connected - all is good from the app's perspective. While dependencies can be expressed in GitOps tools, operators (not developers) would need to be aware that deploying the app after the cluster is not sufficient - they need to explicitly check if all nodes are up before allowing the app to be deployed. I think we can solve this problem on the RabbitMQ side.

I'm working on change that relies on cluster_formation.target_cluster_size_hint and therefore, should not affect anyone who did not configure this property explicitly (in other words, it'll be an opt-in solution).

I was thinking about reject vs retry and agree - I don't think we ever had a case where a queue declaration can fail temporarily - if it failed, retrying was pointless. Therefore, currently I'm working on an internal retry - after all, those "missing" nodes should join soon.

The change I'm working on can be summarized as: if the number of nodes in the cluster is lower than the cluster_formation.target_cluster_size_hint then the node should not import definitions and once it's up, any rabbit_queue_type:declare/2 attempts should sleep and retry endlessly. From the app's perspective, there are two possible outcomes:

  • if load_definitions is configured, the connection will fail with an authentication failure because the definitions have not been imported yet (and therefore, there are no users)
  • or the declaration will take a long time (and will potentially time out)

I believe both cases should be handled quite well by existing apps as this is exactly what happens before the first node starts (timeout) or while it loads definitions (authentication failure).

It's also important to remember that this problem, and proposed solution, only affects the initial cluster formation. Once the cluster is formed once, if all nodes go down and then once starts up, applications will be able to connect and declare queues (which may lead to imbalanced queue distribution in some scenarios but that's a separate problem).

mkuratczyk avatar Dec 10 '21 12:12 mkuratczyk

Correction: I just realized that listeners are started after the import completes so the authentication failure is probably unexpected. However, it's still a connection failure so there is a chance the app will try again or will stop/crash and will be restarted and hopefully by the time it tries again - it will connect successfully.

mkuratczyk avatar Dec 10 '21 12:12 mkuratczyk

@michaelklishin how do you feel about a change like this? It requires more testing but the initial tests are promising - so far it fixes this issue and potentially some more (definitions imported multiple times which can take a lot of time in some environments).

One thing that could also be considered is to partially import definitions on the all-but-last nodes. Currently I skip the import altogether but that means:

  1. Users will get authentication failure if they try too early (it's ok but it's a new behavior and can be confusing)
  2. All work is done at the very end, by the last node

Definitions are imported by type, starting with: users, vhosts, permissions, topic_permissions, queues and then other things. We could import things up to, but excluding queues. That would prevent authentication failures and should make the total cluster startup time even better at the cost of some code complexity and a bit of redundant work (but still less than currently, since right now we just (re)import everything on every node).

mkuratczyk avatar Dec 10 '21 16:12 mkuratczyk

@mkuratczyk I'm afraid such behavior change, even though reasonable and won't affect 99.9% of the lifecycle of many clusters, might still be breaking for applications that cannot reconnect or not ready to handle an operation timeout, for example. Which is a large % of them, anecdotally speaking.

michaelklishin avatar Jan 03 '22 15:01 michaelklishin

I now see this pair of behaviors as favoring consistency or availability when the cluster only has a subset of nodes:

  • The current behavior tries to fulfil client operations the best it can, thus favoring availability
  • The behavior @mkuratczyk proposes arguably focuses on resource usage reduction and declaring queues exactly with the number of replicas configured, even if client operations would sometimes fail, thus favoring consistency

I don't know about definition import, but it makes some sense to focus on consistency for QQ declarations as that's what QQs are all about. Still, this would be a breaking change for some poorly developed apps.

michaelklishin avatar Jan 04 '22 09:01 michaelklishin

This PR remediates this problem in a different way. Quorum queues can still be declared with a lower number of members than configured, but the members will be added when additional nodes join.

https://github.com/rabbitmq/rabbitmq-server/pull/8218

mkuratczyk avatar Jul 11 '23 09:07 mkuratczyk