azure-service-operator icon indicating copy to clipboard operation
azure-service-operator copied to clipboard

Upgrade across multiple major versions

Open hmi12 opened this issue 1 year ago • 6 comments

Describe the current behavior Currently, we're runing ASO v2.0.0-beta.2 Is that possible to directly upgrade to the latest version?

hmi12 avatar Jun 06 '24 09:06 hmi12

You cannot go directly from beta.2 to the latest GA version, as there were a number of resource management changes between those two versions.

You need to pay close attention to the following changes:

  • breaking changes in -beta.4
    • Manual migration of certain resources may be required. Check if you're using those resources/properties (documented in link above).
  • breaking changes in v2.0.0
    • Alpha versions of CRDs removed, need to run asoctl to clean up alpha resources if you've ever installed them. If you never installed alpha version of CRDs you're fine
    • Must upgrade from v2.0.0-beta.5 due to Helm chart changes. Doesn't apply if you didn't use Helm to install/upgrade.
  • breaking changes in v2.1.0
    • Need to use --crd-patterns to install new CRDs. Your existing CRDs will be upgraded.
  • breaking changes in v2.4.0
    • Beta CRDs are deprecated. You've definitely used these so you'll need to run the asoctl command once you're on v2.3.0.

I would recommend you read the other breaking change notices as well just to make sure you're not using the resources impacted.

The recommended upgrade pattern would be to go to every individual ASO version. This isn't strictly required but it's safest and is what we recommend. That way if something goes wrong it's obvious what the old/new versions are and the changes that might be causing the problem. You're more likely to get quality support from us following this pattern.

A (relatively) cautious but still slightly more risky upgrade would be: v2.0.0-beta.2 -> v2.0.0-beta.4 -> v2.0.0-beta.5 -> v2.0.0 -> v2.1.0 -> v2.3.0 -> v2.4.0 -> v2.7.0

This hits all of the versions that contain major changes but skips over some of the minor version releases that don't have major changes.

A risky but it might work upgrade v2.0.0-beta.2 -> v2.0.0-beta.4 -> v2.0.0-beta.5 -> v2.0.0 -> v2.7.0

This hits the minimum versions that you MUST hit to get from where you are to latest.

Note that in all cases, when you do the upgrade from v2.0.0+ to v2.4.0+ you must follow the v2.4.0 instructions on beta CRD deprecation and swap your CRDs to the GA versions. I would recommend you just do it one ASO version at a time (the recommended pattern). You don't need to actually spend lots of time at each ASO version, you can upgrade to a version, ensure the ASO pod launches successfully with no errors, maybe re-apply one of your resources with a simple edit (change tags or similar) to make sure things are working, and then upgrade again to the next version.

matthchr avatar Jun 06 '24 16:06 matthchr

@matthchr Really appreciate your detailed recommendation. We need to conduct some verification in the testing environment. Or, is the following solution feasible?

  1. Add the "skip-reconcile" annotation to all Azure resources;
  2. Uninstall the old version of ASO from AKS;
  3. Install the latest version of ASO directly;
  4. Finally import the Azure resources using asoctl.

hmi12 avatar Jun 07 '24 11:06 hmi12

That should at least in theory also work. Note that the annotation is reconcile-policy.

You'll need to make sure that you uninstall the CRDs too (which Helm won't do by default but you can do manually once you've deleted all of the instances of the CRDs).

Since asoctl gives you YAML that you still might need to massage a bit (for providing secrets, etc), and you also already have some (beta) YAML whose shape is likely very similar to the GA YAML shape, it's not clear to me if it'll be easier to start completely from scratch with asoctl imported resources or if it'd be easier to just modify your YAMLs locally to move from beta to GA version of CRDs (which if you follow that breaking change documentation should just be the version itself and maybe a few other small things) and then reapply them.

As to which is easier, full upgrade outlined above or this approach, it probably depends on how many ASO resources you have. if you have hundreds or thousands of resources you'd need to re-import, it'll probably be easier to just do the upgrade, even accounting for the fact that some of those resources may need to be updated due to the breaking changes mentioned above. Most resources will just need their version changed by swapping the v1beta1 to v1api. On the other hand, if you don't have that many resources, marking them as reconcile-policy: skip, deleting them (in k8s but not azure) and then re-importing with asoctl might be easier.

matthchr avatar Jun 07 '24 16:06 matthchr

It's also worth noting that while the above is a lot of special-cases and gotchas, that's primarily because of the large amount of time between beta2 and 2.7.0, the fact that the beta CRDs were deprecated, and the fact that in beta.5 we added so many CRDs that we couldn't use Helm to manage them anymore because the chart was too large, we had to start managing them ourselves.

  • Removal of CRD versions that were once the storage version in Kubernetes require special handling and running the asoctl tool + updating all of your applied CR verisons.
  • Moving from Helm managing the CRDs to the ASO pod managing the CRDs itself required some manual updates of the CRDs to remove Helm annotations for the upgrade path from beta4 to beta5.

Once you're into the GA version (2.0.0+), there are technically small breaking changes here and there but none that are going to impact every resource like the beta->GA migration does. I wouldn't expect a hypothetical v2.5.0 -> 2.14.0 to be this complicated.

matthchr avatar Jun 07 '24 16:06 matthchr

How did you get on? Did you successfully upgrade - and which route did you take?

theunrepentantgeek avatar Jun 25 '24 00:06 theunrepentantgeek

The upgrade is still pending on our task list. We might test both methods in the test environment, but we haven't started yet. We will update here if any new findings.

hmi12 avatar Jul 02 '24 14:07 hmi12

@matthchr @theunrepentantgeek We tried to uninstall ASO 2.0.0-Beta2, also including remove the old CRDs and then install the latest version, but an error occurred while installing the new CRDs. It seems that the deprecated version is still present in etcd and cannot be manually removed. We might need to use asoctl clean crds to migrate the deprecated CRDs, but the prerequisite is: Ensure the current ASO v2 version in your cluster is beta.5.... Therefore, it seems we have to follow the recommended solution and upgrade through each individual ASO version sequentially.

Error message during install latest CRDs: request to convert CR from an invalid group/version: resources.azure.com/v1beta20200601

hmi12 avatar Jul 10 '24 08:07 hmi12

That upgrade documentation was definitely written with the "upgrade 1 version at a time" in mind. The reason for the "must be beta.5" is because asoctl clean crds will only remove the beta CRDs if there are other versions "ahead" of them (the GA versions). So it won't work in your case because the CRDs are still old and don't have the new versions yet. BUT: if you've already deleted all of your old Custom Resources and it's just the CRDs that are left, you could just delete the ASO CRDs too and then reinstall them.

Normally deleting CRDs is scary/bad, but if you know there are no instances of the CRs in the cluster it should work. Going 1 version at a time should also work.

matthchr avatar Jul 10 '24 23:07 matthchr

No further response, closing. Feel free to reopen if you have further questions.

theunrepentantgeek avatar Aug 19 '24 23:08 theunrepentantgeek