cluster-operator
cluster-operator copied to clipboard
Operator should check all instances versions before upgrading itself
As a user I want to make sure I will be able to upgrade all my clusters through the operator. To prevent such upgrade issues, the operator should not upgrade itself unless all instances are in the latest minor or major the operator supports before upgrade. In case one or more instances are not upgraded the operator should not upgrade and issue a warning. The user can override this with some flag in the yaml to ignore upgrade path problems or similar
... such that we comply with
However, note that only upgrading from one minor to the next minor or major is supported. To upgrade from e.g. 3.6.16 to 3.8.7, it is necessary to upgrade to 3.7.28 first.
https://www.rabbitmq.com/feature-flags.html#version-compatibility
A key point with this is that the validation needs to happen before the operator is upgraded, warning when attempting to apply YAML to upgrade the operator version. This feels like a good case for a validating webhook for the operator itself, if that's possible?
I wonder under what circumstances this would be actually used. The quote from RabbitMQ docs is not necessarily an argument for a check like this - we might as well allow the operator to upgrade but then prevent a (hypothetical) 3.8->3.10 cluster upgrade. For me that's more flexible and works better when the person responsible for the Operator's deployment and the owner of the RabbitMQ cluster are different folks on different teams. Now, how would the Operator know that 3.8->3.10 is attempted is less clear - since images can be relocated and renamed, relying on image tags can be tricky (although we could make that an explicit requirement for relocated images).
Another option is something we discussed a few months back, which is to add metadata to the OCI images. When the image property changes in the RabbitmqCluster definition, Operator could query the metadata of the new image from the OCI Registry to find version information (version is already available in that metadata; you can use docker inspect rabbitmq to see what's there). A nice bonus here would be that the image could contain information about the minimum operator version required so that if there is a breaking change in the image/RabbitMQ, we could make sure that even an old Operator would behave correctly (it wouldn't know how to handle the breaking change itself but it could refuse to upgrade to that version). Hopefully we would never need that feature if we just prevented skipping minors but it did happen in the past (there was a 3.6.x patch release that had a breaking change).
This issue has been marked as stale due to 60 days of inactivity. Stale issues will be closed after a further 30 days of inactivity; please remove the stale label in order to prevent this occurring.
@MirahImage @yaronp68 @mkuratczyk Is there any update on this discussion ?. Code browsing reconcilers in the operator, we do not see version compatibility of rabbitmq being considered and statefulset is just updated with the latest update.
Documentation page of operator highlights this, Scaling up and automated rolling upgrades of RabbitMQ clusters. Would it be better to consider version compatibility checks during rolling upgrade attempts ?
Also, we do not see features like maintenance mode, rebalancing queues during scaling cluster etc. being used. Are these not in the scope of the operator yet ?
Hi.
Rebalancing and maintenance mode are used: https://github.com/rabbitmq/cluster-operator/blob/main/api/v1beta1/rabbitmqcluster_types.go#L76-L80 https://github.com/rabbitmq/cluster-operator/blob/main/internal/resource/statefulset.go#L620
As for version version checks, we keep coming back to this discussion but it's not implemented yet. Have you run into a case where an upgrade failed because of no such checks?
The problem is ultimately with the flexibility of containers. Since we allow the image value to be set arbitrarily, we can't really tell what version will be deployed after the image field is changed (it is currently possible to set the image to something like rmq:prod with no version info at all). As in my previous comment - we could make a specific tag structure a requirement or query the registry first to check image metadata (again, we would need to require relocated images to still contain specific keys/values in the metadata) but for the time being we don't do that.
@mkuratczyk No, we have not run into the issue so far. But we were just examining the upgrade procedures considering a lot of documented incompatible versions in rabbitmq. Is there any mitigation plans in place suppose such an incompatible version upgrade is done and quorum is disrupted ?
Can you give some examples of documented incompatible versions in rabbitmq?
This issues is here because we realize the day will likely come when we need to implement that. However, the Operator was developed with RabbitMQ 3.8 in mind and since then 3.9 was released. 3.8->3.9 upgrade should be pretty straightforward - there was not need to implement this check.
I think this should be handled different - a common way would be to set a "cluster version" in your rabbitmqcluster.rabbitmq.com CRD. The user should not be able to set a custom docker image url / tag. The operator should then watch the CRD and if you increase your cluster version from 3.8 -> 3.9 is should trigger a rolling upgrade of the existing containers (one by one).
Thanks for your input. This is one of the options we are considering. However, it's not without drawbacks. Many large users relocate images to their internal registries. As I mentioned in the comments above, this leads to potentially unpredictable images names. At the very list, the Operator would need it's own configuration to define the registry. Likely, it would also need some configuration options for tags or a requirement not to change the tags when relocating the image. We are deferring work on this issue, hoping that a pattern will emerge in the Kubernetes community so that we can follow it. For the time being, we offer maximum flexibility to enable different use cases.