cluster-operator icon indicating copy to clipboard operation
cluster-operator copied to clipboard

Enable all feature flags on upgrade

Open MirahImage opened this issue 1 year ago • 15 comments

The default behavior of the cluster operator should be to enable all feature flags after an upgrade as an additional PostDeploy step.

Currently, all feature flags are enabled when a cluster is created, but they are never again enabled. Should a new feature flag be added during an upgrade, then that feature flag will not currently automatically be enabled. This could cause future upgrades to fail without manual intervention to enable the feature flags.

This behavior could be disabled by disabling the PostDeploy steps, much like the queue rebalance.

MirahImage avatar Jan 25 '23 14:01 MirahImage

I think it's safer for a human operator to decide when to enable what kind of feature flag. Enabling a feature flag could - depending on the migration function of the feature flag - pause certain operations in RabbitMQ:

As an operator, the most important part of this procedure to remember is that if the migration takes time, some components and thus some operations in RabbitMQ might be blocked during the migration.

However, having on opt-in (or opt-out) option sounds reasonable.

ansd avatar Jan 25 '23 14:01 ansd

This issue has been marked as stale due to 60 days of inactivity. Stale issues will be closed after a further 30 days of inactivity; please remove the stale label in order to prevent this occurring.

github-actions[bot] avatar Apr 06 '23 00:04 github-actions[bot]

Sometimes it's not as easy as "I think it's safer for a human operator to decide when to enable what kind of feature flag.".

Current deployment scenarios might have one team take care of the operating system layer (including updates of installed packates) and another team might be responsible for the application layer and service configurations. So running a "yum update" or "apt upgrade" should not break the application. Furthermore I can not see a way to fix the updated package / configuration once the update caused the service to not start. One can not just simply update rabbitmq and enable new feature flags afterwards as the service might just not start after the update.

I understand feature flags shoud be enabled on purpose by someone who understands what's going on but on the other hand the service should start even after the service binaries got updated. This is causing quite some hassle for us at the moment as we either need to drop the whole complex configuration including users and password or (the way we do this at the moment): roll back rabbitmq to an earlier version, enable all feature flags and then update again. this of course causes quite some downtime which is just not right ... come on guys ... you can do better than that!

ftdcn avatar Jun 20 '23 09:06 ftdcn

Auto enabling feature flags might also be better implemented in rabbitmq-server itself: https://github.com/rabbitmq/rabbitmq-server/issues/5212

ansd avatar Jun 20 '23 10:06 ansd

I have been using rabbitmq successfully for many years - but the behavior of the feature flags annoys me a lot. I updated my server without setting the feature flags first. The updated server no longer starts. rabbitmqctl enable_feature_flag all seams to works only when the server is running. And actually I just wanted to make a problem-free update (like in all the years before) Just my 2 cents from a very satisfied RabbitMQ user

mundus08 avatar Jun 24 '23 07:06 mundus08

@mundus08 what would be your suggestion?

  1. if we don't have a mechanism like that, we can't change certain things (sometimes even fix bugs), because all nodes in the cluster need to behave the same way, so either no evolution or no rolling upgrades
  2. if we automatically enable all feature flags after an upgrade, downgrades will be impossible (they are not supported, but people do use them, especially when there are post upgrade issues)
  3. to never enforce feature flags, we would need to maintain backwards compatibility forever, which is super hard

Asking users to run one command once they are confident the upgrade succeeded would seem like a reasonable compromise...

We can consider options such us enabling all flags automatically on the next upgrade, after an FF was introduced for example. Say you upgrade from x.y.z to x.y+1.0 and there's a new feature flag. If you then upgrade to x.y+1.1, we would automatically enable all FFs introduced in x.y+1.0. Basically assuming that since you upgraded again, the previous upgrade must have been successful. This still wouldn't solve all upgrade paths, but would make it easier for those who upgrade regularly. The drawback is that some FFs could have an expensive migration process, which would be automatically triggered and could surprise users in a different way...

mkuratczyk avatar Jun 24 '23 08:06 mkuratczyk

@mkuratczyk Please excuse the late reply. I'm probably not a typical user as I'm only running a single node installation so I can't evaluate the different options. My expectation would be that after an unattended update (I use Ansible to update all my Debian servers) the RabbitMQ server would be in a stable state. If necessary, an update should not be carried out if the server can then no longer be started due to a missing feature. Since the upgrade was automated, I can not reproduce if there was a warning that Ansible ignored.

mundus08 avatar Aug 16 '23 10:08 mundus08

This can't be fully guaranteed for all cases in a stateful service. Imagine you have an old version running, with some data on disk. You decide to start managing your machine differently and keep it up to date regularly (say, with Ansible). Suddenly you upgrade from, say, 3.8 to 3.12. Some on-disk representation changed and 3.12 can no longer read data stored by 3.8. On a dev machine, the easiest solution is to delete the data, which is something you can totally add to your scripts if that's acceptable for you, but not something we can just add to happen by default (obviously many users do care about their data).

We are looking at options to make upgrades simpler and have this kind of issues occur less often, but you can't expect distributed stateful services to just always upgrade successfully unattended. There's a reason so many people want to use cloud/managed data services - they effectively outsource such concerns to somebody else. :)

mkuratczyk avatar Aug 16 '23 11:08 mkuratczyk

One clarification: I forgot this is in the context of the Operator. In this case, there are some additional/different considerations. Ideally, the Operator would indeed prevent such upgrades. The problem is that at least with the current design, surprisingly, the Operator doesn't know what version it's upgrading to. The only source of such information is the image tag, which is reliable in most, but not all cases. For example some users relocate publicly available images to their local registries and change the tags in the process (the tag may not contain the version at all). Another example, when we perform tests as part of RabbitMQ development, we use images with branch names or commit SHAs instead of versions. People may also have floating tags (eg. 3-management).

Having said that, I agree that it'd be nice to add such functionality to the Operator. It could behave the same way as it does currently, when it can't find the exact version in the tag and be smarter when it does (it'd need to assume the image tag doesn't lie).

mkuratczyk avatar Aug 16 '23 11:08 mkuratczyk

@mkuratczyk

I see the need to force human intervention in the upgrade process as well as the fact that the agents ( apt / ansible / operator / ... ) don't have the needed visibility to enable the required flags or halt the upgrade. I also agree that the change, whatever the change might be, does need to happen on rabbitmq-server ( https://github.com/rabbitmq/rabbitmq-server/issues/5212 )

However, I still would like to pile on and say that a regular upgrade process should not irrecoverably break running services.

I got hit with this just now on my development machine because of a simple apt upgrade. The upgrade, entirely non-specific to rabbitmq, resulted in a non-functioning and not trivially recoverable single node cluster. This wasn't even an apt full-upgrade or dist-upgrade or whatever it's called lately - which typically is where possibly breaking changes should come from.

While I can nuke the stored data and start over here, or install the previous version using apt and enable the flags, this circumstance fills me with fear for how the in-kubernetes deployment will fare if the image were to be updated for any reason. The downtime associated with that is likely going to be significantly longer, and more importantly, exponentially more expensive.

To make matters worse, the first time I heard of rabbitmq feature flags is when the upgrade broke and I looked at the logs.

As an example, postgresql also has potentially incompatible data formats between versions. Despite this, a postgres cluster does not break during upgrade. Yes, it does require some manual work to upgrade the cluster afterwards and it probably has the wrong / multiple versions running until you do this, but in over a decade of running postgres, I have never had breaking failures when naively upgrading postgres along with the system it is in.

At the minimum, it should be made possible for feature_flags to be enabled on an offline cluster. Those admins who have kept up with the flags will get a seamless transition, and those who leave it to apt or the operator or similar will have a clean way to recover. If this results in data loss, a suitable warning can be provided at the time, with a suggestion to roll back the version first in case the data is important.

chintal avatar Aug 22 '23 07:08 chintal

Just happened to me during an upgrade on a machine using it's packaging system - updated a (single node) node from 3.8 to 3.11 - unable to start. Also unable to downgrade since this would break all other packages due to dependencies. There has to be a way to enable feature flags without being able to run the node to get into a running state again? (If starting over - downgrading is not possible - is the only solution I think it's time to look for alternative MQs that allow to be repaired in case of errors ...)

tspspi avatar Oct 08 '23 22:10 tspspi

  1. This repo is about the Kubernetes Operator, which I don't think you are using
  2. Upgrading directly from 3.8 to 3.11 is not supported in the first place.
  3. If it is a dev machine, just delete the data folder (/var/lib/rabbitmq/* or whatever is is for you) and that's it - you will start a fresh 3.12 instance
  4. If you can't do the above, you can try downgrading, manually editing the feature_flags file (in the RabbitMQ data directory) to see if you can start 3.8 this way

If you want to spend the time looking for a different messaging system - go ahead, but you can also start contributing to the project you already rely on. That's what open source is about.

mkuratczyk avatar Oct 09 '23 07:10 mkuratczyk

I have an idea that can be middleground between all opinions expressed in this issue. The Cluster Operator already has CONTROL_RABBITMQ_IMAGE env variable that does the following (quotting docs):

EXPERIMENTAL! When this is set to true, the operator will always automatically set the default image tags. This can be used to automate the upgrade of RabbitMQ clusters, when the Operator is upgraded. Note there are no safety checks performed, nor any compatibility checks between RabbitMQ versions.

We could extend the behaviour of this variable to also always enable feature flags. This behaviour will be considered experimental, as the actual behaviour of this env variable. My argument for this suggestion is that automatically enabling all feature flags, after every upgrade, is sort of "hands free" or "auto-pilot" management of rabbitmq, which is the same as automatically changing the RabbitMQ image.

Zerpet avatar Oct 16 '23 10:10 Zerpet