osm icon indicating copy to clipboard operation
osm copied to clipboard

Error when performing upgrades on v1.0.0

Open steeling opened this issue 3 years ago • 2 comments

We have a bug in our upgrade path when upgrading from v1.0.0 forwards:

  1. The v1.0.0 osm-bootstrap attempts to write a mesh config on startup @ version v1alpha1. It does this prior to starting it's servers, which include the crd conversion webhook.
  2. In an upgrade, the new one comes up, which now writes the new version v1alpha2. Similarly, it does not start it's servers until this is successfully written.
  3. This triggers a call from K8s to the conversion webhook. However if the old pod is down, the call fails, because the new pod has not yet started its server. ie: there are no servers to respond to the conversion webhook.
  4. The old pod can attempt to come back up, but it also fails, due to trying to path the mesh config CRD, but the patch omits v1alpha2, which fails k8s validation.

Therefore no bootstrap pods can come up, which the remainder of the control plane depends upon.

steeling avatar Aug 04 '22 21:08 steeling

Added default label size/needed. Please consider re-labeling this issue appropriately.

github-actions[bot] avatar Aug 06 '22 00:08 github-actions[bot]

It looks like if enableReconciler is set to true it can lead to issues as well

steeling avatar Aug 10 '22 21:08 steeling

expanding on my last comment: the reconciler attempts to keep things at the current version. If 2 OSM versions are up for a brief period of time, one which has written the new CRD's/webhook/etc, the old version will try to rewrite them.

@jaellio this also breaks cert rotation for those webhooks

steeling avatar Aug 17 '22 21:08 steeling

Doesn't the reconciler also re-create the mutatingwebhookconfiguration if it's deleted?

keithmattix avatar Aug 19 '22 17:08 keithmattix

The issue seems to be due to the way that the CRD patching is coupled with the cert rotation handler code -> we pass in the patchCrds method in crdconversion.go as onCertChange - thus, the first time the certificate gets issued to the conversionwebhook, patchCrds is called before the server is started.

One potential solution is to not immediately call patchCrds the first time the certificate is issued - the actual rotation handler logic will be the same, since that is configured to handle future cert rotations. But the first time setCert is called before the server starts, we don't call patchCrds - instead we call it directly in crdconversion.go once the server starts.

This way, if the old osm-bootstrap pod is down, the conversionwebhook could respond to the server on the new bootstrap pod. If both osm-bootstrap pods are still up, k8s can use either the old or new osm-bootstrap service to update the apiversion on the meshconfig.

nshankar13 avatar Aug 25 '22 16:08 nshankar13

One potential follow up task to this though could be to "bulletproof" this solution by ensuring that k8s only uses the new osm-bootstrap conversionwebhook server to update the meshconfig. While the apiconversion could in theory still work using the conversionwebhook server on the old bootstrap pod, down the road there may be situations where we want to only use the new webhook servers. One idea that @steeling proposed is adding a label to the new service and using a service selector.

nshankar13 avatar Aug 25 '22 16:08 nshankar13

actually patchCRDs is not the issue (we can verify that the new CRD gets created), but that immediately upon startup the bootstrap attempts to create the new meshconfig version (via ensureMeshConfig). It's this creation, which happens prior to the patchCrds call that is causing the issue.

steeling avatar Aug 25 '22 18:08 steeling

@steeling it looks like both ensureMeshConfig and patchCrds are issues. For ensureMeshConfig, we call MeshConfigs.update() if the meshconfig already exists. patchCrds eventually calls CustomResourceDefinitions.Update() - I think both of these update() calls need to happen after the conversionwebhookserver gets started.

nshankar13 avatar Aug 25 '22 18:08 nshankar13

In crdconversion.go, line 90:

if _, err = crdClient.CustomResourceDefinitions().Update(context.Background(), crd, metav1.UpdateOptions{}); err != nil { log.Error().Err(err).Msgf("Error updating conversion webhook configuration for crd : %s", crdName) return err }

in osm-bootstrap.go, line 321:

if _, err := b.configClient.ConfigV1alpha2().MeshConfigs(b.namespace).Update(context.TODO(), config, metav1.UpdateOptions{}); err != nil { return err }

As both of these calls update the meshconfig crd / cr, these could trigger calls from the k8s-api to the conversionwebhookserver. Thus, both of these function calls need to happen after the conversionwebhookserver gets started and not before.

Even though crdconversion.go only changes the CRD conversion spec, not the CR itself, there's still a chance that updating this spec could trigger a call from k8s to the conversion webhook if the meshconfig object already exists. And for an upgrade, the meshconfig object has already been created and deployed by the previous OSM version.

nshankar13 avatar Aug 26 '22 17:08 nshankar13

Spoke offline, but for posterity, k8s calls this out here https://kubernetes.io/docs/tasks/extend-kubernetes/custom-resources/custom-resource-definition-versioning/#writing-reading-and-updating-versioned-customresourcedefinition-objects

So explicitly, the conversion does not happen when writing a CRD.

More definitively, even if it did, the conversion would have already taken place from the ensureMeshConfig call, which writes in the new version

steeling avatar Aug 26 '22 17:08 steeling

Discussed this with @steeling over a call, but another option is to get rid of the crdconversion webhook (and server) entirely. If we look at the conversionHandler methods for the crds in crdconversion.go, each crdconversion handler (except the meshconfig) does nothing if the apiVersions are the same between the previous and current CRD in the upgrade. If the new CRD has additional fields, those fields will get added automatically. Even in the case of the meshconfig, all we do is delete certain unsupportedFields from v1alpha2 to v1alpha1. However, these additional fields will anyway automatically be dropped, because they are not defined in the v1alpha1 CRD spec anymore.

This can be done by setting the conversion strategy to None in the crd spec: https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.24/#customresourceconversion-v1-apiextensions-k8s-io

The init-osm-bootstrap pod will apply the new CRD and thus update the conversion strategy of the original meshconfig CRD to not depend on the conversionwebhook server anymore.

nshankar13 avatar Aug 26 '22 20:08 nshankar13

To clarify a bit the points above:

our current webhooks do nothing that the implicit conversion wouldn't do anyways, when strategy is set to None. Further:

  1. Currently, the conversion webhook that is used will unintuitively be from the old pod, since the new pod needs the conversion to happen prior to the webhook starting.
  2. In a future case, where there are multiple objects being converted, or what is currently the case for a mutating and validating webhook: during an upgrade the webhook that is hit can be from either an old or a new version.

There's a lot more nuances here, especially with the CRD cases. I'm going to have a doc out soon with concrete proposals on how we can improve our upgrade story, and how to better handle CRDS.

For now, I'm in agreement with @nshankar13 that we should remove the conversion webhook, and explicitly set strategy: None on all of our CRD's.

steeling avatar Aug 26 '22 20:08 steeling

Discussed this with @steeling over a call, but another option is to get rid of the crdconversion webhook (and server) entirely. If we look at the conversionHandler methods for the crds in crdconversion.go, each crdconversion handler (except the meshconfig) does nothing if the apiVersions are the same between the previous and current CRD in the upgrade. If the new CRD has additional fields, those fields will get added automatically. Even in the case of the meshconfig, all we do is delete certain unsupportedFields from v1alpha2 to v1alpha1. However, these additional fields will anyway automatically be dropped, because they are not defined in the v1alpha1 CRD spec anymore.

This can be done by setting the conversion strategy to None in the crd spec: https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.24/#customresourceconversion-v1-apiextensions-k8s-io

The init-osm-bootstrap pod will apply the new CRD and thus update the conversion strategy of the original meshconfig CRD to not depend on the conversionwebhook server anymore.

The conversion webhook is necessary when there are schema changes between the versions of the resource and we want to transparently handle these without user intervention. For e.g., consider a field foo that has been dropped in the new version in favor of a new field baz. We want a framework in place where if the value of foo can be translated to the value baz, that's possible. Otherwise, updating API versions risks breaking existing deployments.

If there's an alternative proposal that is capable of introducing new API changes in a way that doesn't break existing deployments without the need for a conversion webhook, that's an option worth discussing especially if it simplifies the upgrade path.

shashankram avatar Aug 26 '22 20:08 shashankram

Currently the conversion webhook does nothing however, except cause bugs and headaches. I recommend that, if and when, we ever need to drop a field, we can revisit how to go about that, and whether a conversion webhook is even the right choice, given some of the caveats I mentioned above

steeling avatar Aug 26 '22 21:08 steeling

Currently the conversion webhook does nothing however, except cause bugs and headaches. I recommend that, if and when, we ever need to drop a field, we can revisit how to go about that, and whether a conversion webhook is even the right choice, given some of the caveats I mentioned above

I think we should evaluate that as a part of the proposal. I am not in favor of removing it if it's something we will require for certain as we evolve our APIs with newer versions. It's important to understand whether these are bugs that can be easily fixed, or if the system is overly complex and unnecessary for our use cases.

shashankram avatar Aug 26 '22 21:08 shashankram

Sure we can do a proposal, and punt on the decision till that's out.

As a quick point to get some data, I'm curious if we tried to quantify the effort if that would shed some light on the decision? Let's say we had a PR out to rip that code out, how much work would you anticipate it would be to revert the PR in the future? My estimate is on the order of days, which to me indicates we'd be safe going forward with current plans.

^^ regardless of estimate happy to wait, but I think it's a useful exercise to try to quantify the chance of making the wrong decision, ie: (probability of revert) X (1/# of months till needed) < # of days to implement revert

steeling avatar Aug 26 '22 21:08 steeling

For e.g., consider a field foo that has been dropped in the new version in favor of a new field baz. We want a framework in place where if the value of foo can be translated to the value baz, that's possible. Otherwise, updating API versions risks breaking existing deployments.

@shashankram @steeling I'm wondering if it would be possible to incorporate this logic as part of the helm pre-install hook / init-bootstrap container instead of relying on the conversionwebhook. This wouldn't necessarily simplify the complexity of the code / implementation itself (for instance, the convertMeshConfig method in config_meshconfig_conversion.go), but at least as part of a pre-install hook we can circumvent issues relating to race conditions and timing with the k8s API.

The main issue though is that implementing this conversion logic ourselves in a pre-install job or as part of the init-osm-bootstrap pod over the course of multiple releases, for multiple CRDs could get messy, and would probably be simpler with the conversion webhook.

nshankar13 avatar Aug 31 '22 15:08 nshankar13

The current implementation of conversion webhooks is essentially equivalent to a None conversion strategy. We've got the webhooks in source control, so it doesn't seem like it would be too difficult to revert. If removing code and changing our conversion strategy will fix a bug while effectively being a no-op, that seems like the pretty clear winner here to me, especially as we're trying to get a patch release out

keithmattix avatar Aug 31 '22 21:08 keithmattix

Fixed in #5065

keithmattix avatar Sep 07 '22 18:09 keithmattix