cluster-operator icon indicating copy to clipboard operation
cluster-operator copied to clipboard

Scale down

Open ChunyiLyu opened this issue 4 years ago • 36 comments

Is your feature request related to a problem? Please describe.

Scaling down a rabbitmq cluster does not work at the moment. We should look into why it fails, and what we need to do to support scaling down.

Describe the solution you'd like A clear and concise description of what you want to happen.

Describe alternatives you've considered A clear and concise description of any alternative solutions or features you've considered.

Additional context

  • rabbitmqctl forget_cluster_node might be needed before we deleting pods
  • we might need to skip preStop hooks
  • what do we do with non-HA queues that only exists in nodes that will be deleted?

ChunyiLyu avatar Jul 27 '20 10:07 ChunyiLyu

From today's sync-up:

rabbitmq-server might not support scaling down - in which case, should we support it?

I didn't know what 'scaling down doesn't work' meant in this issue, so I dug into it; there are three/four failure modes to scaling down not working:

  • If we lose the majority of nodes in a cluster during scaling down (e.g. 5->1), we go into pause minority and the whole cluster goes down
  • If any of the nodes are quorum critical or mirror sync critical (judged with rabbitmq-queues command), and this does not self-resolve (e.g. under heavy load / insufficient consumers), the pod will fail to terminate as the preStop hook never completes
  • Quorum queues require the terminating node to be removed from the RAFT algorithm through rabbitmq-queues shrink
  • Durable, classic queues (i.e. not mirrored/quorum) whose primary node is the node going down will be lost irrecoverably
    • This one is tenuous in my opinion - if you want replication, classic queues aren't going to provide that

coro avatar Aug 04 '20 12:08 coro

For reference, the bitnami helm chart does allow for scaling down, but from my testing it does this by recreating the cluster, rather than removing single nodes.

coro avatar Aug 04 '20 13:08 coro

@harshac and I played about with the helm chart today. It seems it has a persistent disk so that even when you scale the cluster & it recreates all of the nodes, it still persists any durable queues or persistent messages between iterations of the cluster.

coro avatar Aug 04 '20 13:08 coro

Removing individual nodes would be much less disruptive with the introduction of maintenance mode in 3.8.x. RabbitMQ nodes are never implicitly removed from the cluster: rabbitmqctl forget_cluster_node would indeed be required to permanently remove a node.

michaelklishin avatar Aug 04 '20 16:08 michaelklishin

RabbitMQ server does support permanent node removal from an existing cluster. Raft-based features such as quorum queues and MQTT client ID tracker require such node to also be explicitly removed from the Raft member list using, see

rabbitmq-queues help delete_member
rabbitmqctl help decommission_mqtt_node

The problematic part with downsizing is that the reduced number of nodes may or may not handle the same load on the system gracefully. A five-node cluster where all nodes run with the default open file handle limit of 1024 can sustain about 5000 connections but a three-node cluster would not be able to do that.

michaelklishin avatar Aug 05 '20 10:08 michaelklishin

It doesn't look like there is any native StatefulSet hook for specifically scaling down. The closest I've found is this issue, which covers our case exactly. It points out that there is no way to gracefully decommission a node in Kubernetes StatefulSets.

We will likely have to implement custom logic in the controller for this. Perhaps we can detect in the controller where the spec.replicas field in the CR is decreased, and use this to set flags on all of the pods in the stateful set, which lead to different preStop hook logic that does the maintenance mode work mentioned above? There would be no way for us to tell which pods would be brought down in this case, but any that do would have this behaviour. I have a suspicion that that would lead to all of the pods running this anyway when they are eventually brought down.

coro avatar Aug 13 '20 15:08 coro

Indeed, it'd be great to see that as a native feature as described in that issue. For the time being, we could implement what you said which is more or less what we already have when deleting the cluster: https://github.com/rabbitmq/cluster-operator/blob/main/controllers/rabbitmqcluster_controller.go#L471. We can keep the replicas count untouched until the pod/node is successfully deleted/forgotten.

I'm not sure about this part: There would be no way for us to tell which pods would be brought down in this case. I'm pretty sure StatefulSet guarantees that if you scale down, the nodes with the higher index number will be deleted and even they will be deleted one by one. So the process could look something like this:

  1. Enable RabbitMQ maintenance mode for the node with the highest index
  2. Wait for the node to be drained
  3. Label (or in some other way mark) the node with the highest index for deletion
  4. Reduce the number of replicas in the StatefulSet
  5. The node with the highest index should be terminated quickly (it's empty anyway)
  6. We can run the forget/delete node command (unless done as part of node's termination)

mkuratczyk avatar Aug 13 '20 17:08 mkuratczyk

This issue has been marked as stale due to 60 days of inactivity. Stale issues will be closed after a further 30 days of inactivity; please remove the stale label in order to prevent this occurring.

github-actions[bot] avatar Jan 02 '21 00:01 github-actions[bot]

Do you have a solution to scale down the pod at this moment? I edit a configuration on RabbitMQ instance for a replica from 3 to 1 but nothing happened.

adthonb avatar Jun 15 '21 06:06 adthonb

Do you have a solution to scale down the pod at this moment? I edit a configuration on RabbitMQ instance for a replica from 3 to 1 but nothing happened.

Scale down is not supported by the cluster operator and it's not a planed feature for us atm. Reducing the number of replicas is ignored by the cluster operator, and if you check the operator logs and published events for your RabbitmqCluster, there should be a line says "Cluster Scale down not supported".

ChunyiLyu avatar Jun 22 '21 08:06 ChunyiLyu

Just ran across this. I have a development, single-node cluster set up that has 20ish devs, each with their own username and vhost. While doing some testing, I accidently scaled it up to a 3 node cluster, and now I have no way to bring it back down to one except to destroy and re-create it. Not a big deal, except I'll have to reconfigure all the users, vhosts and their permissions all over again.

With as easy as it is to change "replicas: 1 -> 3", it would have been nice to have a similar experience from 3 -> 1 (at least in a non-production setting).

NArnott avatar Aug 27 '21 16:08 NArnott

Unfortunately making the scale down easy is exactly what we don't want to do until it's well supported (because it would make it easy to lose data). I understand that in your case it's a non-production system but I don't see how we could make it easy only in that case. Having said that, if your are only concerned with the definitions (vhosts, users, permissions, etc), you can easily take care of them by exporting the definitions and importing them to a new cluster. Something like this:

# deploy a 3-node cluster (for completeness, you already have one)
$ kubectl rabbitmq create foo --replicas 3

# export the definitions
$ kubectl exec foo-server-0 -- rabbitmqctl export_definitions - > definitions.json

# delete the 3-node cluster
$ kubectl rabbitmq delete foo

# deploy a single-node replacement
$ kubectl rabbitmq create foo

# copy the definitions to the new cluster's pod
$ kubectl cp definitions.json foo-server-0:/tmp/definitions.json

# import the definitions
$ kubectl exec foo-server-0 -- rabbitmqctl import_definitions /tmp/definitions.json

It's a bit mundane but will do the trick. You can replace the last few steps with an import on node startup (see the example.

Keep in mind two caveats:

  • you will lose the messages in the queues, so make sure you only follow these steps where that's not a problem
  • the default user created when a cluster is deployed is somewhat special. If you import the definitions after the new cluster is up (as in the steps above), you will still have default user of the new cluster (so the new cluster will have 1 more user than the old one). If you import the definitions on boot (as in the linked example), the new cluster will have exactly the same users as the old one but unfortunately your Kubernetes Secret (foo-default-user in this example) will contain credentials that won't work (they are generated for the new cluster but RabbitMQ doesn't create the default user when definitions import is configured). We hope to find a way to keep the secret in sync in such situations, but for now, you can just manually update it with the old cluster's credentials so that your secret matches the actual state.

mkuratczyk avatar Aug 30 '21 07:08 mkuratczyk

Hi @mkuratczyk, just want to know any update about this feature? is the scale down fully supported now? or do we have any plan/ETA? thanks :D

yuzsun avatar Nov 09 '21 06:11 yuzsun

It's not. Please provide the details of your situation and what you'd expect to happen and we can try to provide a workaround or will take such a use-case into account when working on that.

mkuratczyk avatar Nov 09 '21 07:11 mkuratczyk

we were using rabbitmq Statefulset in Kubernetes, and we found it was unable to scale down. whenever we tried to scale down the Statefulset to 0, there will be 3 replicas remained. image the issue only occurs in rabbitmq, other Statefulset like nginx Statefulset will be successfully scaled down to 0.

yuzsun avatar Nov 10 '21 02:11 yuzsun

  1. The Operator reconciles the StatefulSet so if you configured replicas: 3 in the RabbitmqCluster, when you change the StatefulSet, the Operator will "fix it" because it no longer matches the definition. This is how reconciliation is supposed to work in Kubernetes (the same mechanism is what restarts pods when they disappear - their number no longer matches the defined number of replicas).
  2. You can't decrease the replicas in the RabbitmqCluster to prevent you from losing data. Scaling down this way can lead to data loss in a production environment (there can be data that is only present on some of the nodes and those nodes could get deleted).

Why do you want to scale down to zero? RabbitMQ is a stateful, clustered, application. It can't start in a fraction of a second when an application tries to connect. Is this some test environment that you want to turn off for the weekend or something like that? This question is what I meant by your use case.

If it is a test env, there are a few things you can try:

  1. Perhaps you don't need to maintain the state in the first place? We now support storage: 0 which doesn't even create the PVCs. If you just want to spin up a RabbitMQ cluster quickly (eg. to run CI tests) and delete it afterwards - that can be a good option.
  2. If you want to maintain the state but stop/delete the pods for some time, you can try kubectl rabbitmq pause-reconciliation NAME and then you can perform operations on the statefulset, that the Operator will not overwrite (reconcile). But please be careful - by manually changing the statefulset you can easily lose data or get into a state that the Operator will not be able to reconcile.

But again - please let us know what you are trying to achieve. RabbitMQ is more like a database. I'm not sure why you would run nginx in a statefulset but they are fundamentally different applications so you can't just expect the manage them the same way.

mkuratczyk avatar Nov 10 '21 08:11 mkuratczyk

For a situation where queue messages are not important can we do this? for example we have a 3 nodes cluster and we want to reduce it to 2, we can change CRD replicas to 2, then edit statefulset replicas to 2?

AminSojoudi avatar Nov 10 '21 10:11 AminSojoudi

Please read through this thread and try if you want. I'd expect you to still need to run forget_cluster_node at least. Scale down is unsupported, so we don't have all the answers and don't test this. You can contribute by experimenting and sharing with others.

You didn't share your use case either. Why do you want to go from 3 nodes to 2? 2-nodes RabbitMQ clusters are hardly supported in the first place. Quorum queues and streams require 3 nodes (or just 1 for test envs). If you don't care about your messages - there are other options (running single node in the first place, running a cluster with no persistent storage).

mkuratczyk avatar Nov 10 '21 10:11 mkuratczyk

  1. The Operator reconciles the StatefulSet so if you configured replicas: 3 in the RabbitmqCluster, when you change the StatefulSet, the Operator will "fix it" because it no longer matches the definition. This is how reconciliation is supposed to work in Kubernetes (the same mechanism is what restarts pods when they disappear - their number no longer matches the defined number of replicas).
  2. You can't decrease the replicas in the RabbitmqCluster to prevent you from losing data. Scaling down this way can lead to data loss in a production environment (there can be data that is only present on some of the nodes and those nodes could get deleted).

Why do you want to scale down to zero? RabbitMQ is a stateful, clustered, application. It can't start in a fraction of a second when an application tries to connect. Is this some test environment that you want to turn off for the weekend or something like that? This question is what I meant by your use case.

If it is a test env, there are a few things you can try:

  1. Perhaps you don't need to maintain the state in the first place? We now support storage: 0 which doesn't even create the PVCs. If you just want to spin up a RabbitMQ cluster quickly (eg. to run CI tests) and delete it afterwards - that can be a good option.
  2. If you want to maintain the state but stop/delete the pods for some time, you can try kubectl rabbitmq pause-reconciliation NAME and then you can perform operations on the statefulset, that the Operator will not overwrite (reconcile). But please be careful - by manually changing the statefulset you can easily lose data or get into a state that the Operator will not be able to reconcile.

But again - please let us know what you are trying to achieve. RabbitMQ is more like a database. I'm not sure why you would run nginx in a statefulset but they are fundamentally different applications so you can't just expect the manage them the same way.

i am working with @yuzsun. our use case is this is a test environment and we want to stop kubernetes cluster running rabbitmq during weekend. I will look into the "storage: 0" option

ksooner avatar Nov 11 '21 00:11 ksooner

Please read through this thread and try if you want. I'd expect you to still need to run forget_cluster_node at least. Scale down is unsupported, so we don't have all the answers and don't test this. You can contribute by experimenting and sharing with others.

You didn't share your use case either. Why do you want to go from 3 nodes to 2? 2-nodes RabbitMQ clusters are hardly supported in the first place. Quorum queues and streams require 3 nodes (or just 1 for test envs). If you don't care about your messages - there are other options (running single node in the first place, running a cluster with no persistent storage).

I tested that and worked well without even calling forget_cluster_node, the situation that I have provided was unreal, it was just for testing purposes but our real scenario is this: we have 3 nodes k8s cluster, we have rabbitmq cluster with 3 replicas with podAntiAffinity, accidentally someone in our team changes the replicas to 4 in case of high load. unfortunately, there is no way back, the new rabbitmq node is stuck in a pending state, we cannot do anything even if we remove podAntiAffinity. we cannot even delete the whole rabbitmq cluster and create a new one because that way our data will be lost, cause rabbitmq k8s operator delete the PVCs.

AminSojoudi avatar Nov 12 '21 09:11 AminSojoudi

@AminSojoudi There's a workaround to scale down, assuming you are ok with potential data loss. If the 4th instance never moved from Pending state, it should be fine.

  1. Set the replicas to the desired number in the RabbitmqCluster spec, e.g. 3
  2. Run kubectl scale sts --replicas=<desired-number> <rabbit-server-statefulset>
  3. Pod exec into one of the remaining nodes
  4. Run rabbitmqctl cluster_status to obtain the names of clustered Disk Nodes
  5. Run rabbitmqctl forget_cluster_node 'rabbit@...', where the name of the node should be the one you deleted.
  6. You may have a leftover PVC for the 4th Pod. You may want to consider cleanup of this PVC.

For your use case @ksooner, you can skip steps 3-5, since you simply want to scale to 0 over the weeked. The PVC will remain if you scale down to 0, so previous durable exchanges/queues will still be there when you scale back to 3. Also, you may get Pods in Terminating state for a long time. This likely will be due to our pre-stop hook, which ensure data safety and availability; to get around that, set .spec.terminationGracePeriodSeconds in RabbitmqCluster to something low like 10.

I'd like to re-state that scale down is not supported by the Operator, and this workaround risks data loss, and effectively bypasses most, if not all, data safety measures we've incorporated in the Operator.

Zerpet avatar Nov 12 '21 17:11 Zerpet

This was really helpful thank you for sharing

malmyros avatar Jan 07 '22 12:01 malmyros

@mkuratczyk I'm using K8s on bare-metal and I'm not using shared storage operators.

I have a two cases when scaling down looks like can help:

  1. When need to replace node's server with a new server.
  2. When need to move cluster between datacenters.

As I can't just move PVs between K8s nodes I've to scale up and scale down cluster for each node one by one.

For case N1 I can scale up cluster first for +1 new node, scale down for -1 old node. For case N2 I can scale up/scale down with step in 1 node for each nodes in cluster.

Is it valid cases for you? Or I can do something different?

oneumyvakin avatar Nov 19 '22 12:11 oneumyvakin

  1. How is this different from a typical Kubernetes operation like changing CPUs for a pod? In that case, the disk is detached, a new pod (new "server") is created, and the disk attached. It feels like you can do pretty much the same (probably by copying data, not physically moving the disk)
  2. For migrating between data centres, blue-green is generally the best option.

Just to be clear, I know there are cases where scale-down may be helpful. It's just that it's a hard problem to solve in a generic way (that will work for different cases, with different queue types and so on).

mkuratczyk avatar Nov 21 '22 08:11 mkuratczyk

How is this going?

OzzyKampha avatar Mar 08 '23 10:03 OzzyKampha

A good case for why the scaledown is useful is for cost saving purposes. For example, scaling down all pods in development namespace to 0 during the inactive hours thus reducing the required node count. If the operator does not support it, then we have a dangling pod left in the namespace.

RicardsRikmanis avatar Mar 24 '23 08:03 RicardsRikmanis

We realize there are use cases, they just don't make the problem any simpler. :) Development environments are the easiest case unless you have some special requirements. What's the benefit of scaling down to zero compared to just deleting and redeploying the cluster the following day? You can keep your definitions in a JSON file or also as Kubernetes resources, so that the cluster has all the necessary users, queues and so on when it starts.

mkuratczyk avatar Mar 24 '23 09:03 mkuratczyk

The other case for cluster scaling down: We have quorum queues cluster with 5 rabbitmq nodes (one rabbitmq node per phyiscal k8s node). If 3 k8s nodes go down, the cluster will be unavaliable because they lost the majority. Scaling down the replica to 3 will make the cluster accessible again.

Loop experts @mkuratczyk @Zerpet for awareness

xixiangzouyibian avatar Jun 06 '23 03:06 xixiangzouyibian

@AminSojoudi There's a workaround to scale down, assuming you are ok with potential data loss. If the 4th instance never moved from Pending state, it should be fine.

  1. Set the replicas to the desired number in the RabbitmqCluster spec, e.g. 3
  2. Run kubectl scale sts --replicas=<desired-number> <rabbit-server-statefulset>
  3. Pod exec into one of the remaining nodes
  4. Run rabbitmqctl cluster_status to obtain the names of clustered Disk Nodes
  5. Run rabbitmqctl forget_cluster_node 'rabbit@...', where the name of the node should be the one you deleted.
  6. You may have a leftover PVC for the 4th Pod. You may want to consider cleanup of this PVC.

For your use case @ksooner, you can skip steps 3-5, since you simply want to scale to 0 over the weeked. The PVC will remain if you scale down to 0, so previous durable exchanges/queues will still be there when you scale back to 3. Also, you may get Pods in Terminating state for a long time. This likely will be due to our pre-stop hook, which ensure data safety and availability; to get around that, set .spec.terminationGracePeriodSeconds in RabbitmqCluster to something low like 10.

I'd like to re-state that scale down is not supported by the Operator, and this workaround risks data loss, and effectively bypasses most, if not all, data safety measures we've incorporated in the Operator.

Should label rabbitmqcluster as pauseReconcile=true to make CRD stop watching the cluster firstly ?

xixiangzouyibian avatar Jun 06 '23 04:06 xixiangzouyibian

@xixiangzouyibian RabbitMQ cluster membership and quorum queue membership are separate concerns. Scaling down the RabbitMQ cluster cleanly would require scaling down quorum queues as well. However, for quorum queue membership changes, the quorum of nodes need to be available. For situations like you described, an API was recently added to force a single quorum queue (a Ra/Raft cluster) member to assume it's the only member now: https://github.com/rabbitmq/ra/pull/306. This is a very dangerous operation and will likely will lead to data loss.

You should not lose 3 out of 5 members in the first place.

mkuratczyk avatar Jun 06 '23 06:06 mkuratczyk