operator-lifecycle-manager icon indicating copy to clipboard operation
operator-lifecycle-manager copied to clipboard

Attempting to reinstall an operator that uses conversion webhooks fails

Open amisevsk opened this issue 3 years ago • 8 comments

Bug Report

What did you do?

When attempting to reinstall an operator that uses conversion webhooks by

  1. Deleting the operator subscription and any CSVs associated with it
  2. Recreating the operator subscription

The resulting InstallPlan enters a failed state with message similar to

error validating existing CRs against new CRD's schema for "devworkspaces.workspace.devfile.io": error listing resources in GroupVersionResource schema.GroupVersionResource{Group:"workspace.devfile.io", Version:"v1alpha1", Resource:"devworkspaces"}: conversion webhook for workspace.devfile.io/v1alpha2, Kind=DevWorkspace failed: Post "https://devworkspace-controller-manager-service.test-namespace.svc:443/convert?timeout=30s": service "devworkspace-controller-manager-service" not found

When the original CSVs are deleted, the operator's main deployment and service are removed, but CRDs are left in-cluster. However, since the service/CA bundle/deployment that serve the conversion webhook are removed, conversion webhooks are broken at that point. Eventually this impacts garbage collection on the cluster as well.

This can be reproduced by installing the DevWorkspace Operator from the Red Hat catalog. (I can provide yamls/upstream images that reproduce as well, if that's helpful). It may be necessary to create a DevWorkspace in the cluster before deletion, e.g. by oc apply -f https://raw.githubusercontent.com/devfile/devworkspace-operator/main/samples/plain.yaml

What did you expect to see? Operator is able to be reinstalled without removing CRDs and all instances.

What did you see instead? Under which circumstances? It's necessary to completely remove the operator including CRDs. For our operator (DevWorkspace), this also makes uninstall especially complicated as finalizers are used (so CRDs cannot be deleted if the controller is removed, and the controller cannot be restored by reinstalling)

Environment

  • operator-lifecycle-manager version: 4.10.24

  • Kubernetes version information: Kubernetes Version: v1.23.5+012e945 (OpenShift 4.10.24)

  • Kubernetes cluster kind: OpenShift

Possible Solution The only details I noticed that may be relevant are

  1. The order of objects in the InstallPlan varies per install, and sometimes CRDs are applied earlier
  2. The installPlan fails at the point of trying to apply CRDs olm-upgrade

Additional context Add any other context about the problem here.

amisevsk avatar Aug 03 '22 21:08 amisevsk

Thanks for submitting this ticket @amisevsk.

When attempting to reinstall an operator that uses conversion webhooks by

  • Deleting the operator subscription and any CSVs associated with it
  • Recreating the operator subscription

The resulting InstallPlan enters a failed state with message similar toerror validating existing CRs against new CRD's schema for "devworkspaces.workspace.devfile.io": error listing resources in GroupVersionResource schema.GroupVersionResource{Group:"workspace.devfile.io", Version:"v1alpha1", Resource:"devworkspaces"}: conversion webhook for workspace.devfile.io/v1alpha2, Kind=DevWorkspace failed: Post "https://devworkspace-controller-manager-service.test-namespace.svc:443/convert?timeout=30s": service "devworkspace-controller-manager-service" not found

Yep, that's a valid bug and something we need to address. OLM does not allow a CSV to "adopt/update" an existing CRD if it would invalidate existing CRs on cluster. In this case, the installation is failing because OLM can't check existing CRs because there is no operator servicing the conversion webhook defined in the CRD.

There are a few paths forward

  • Option 1: Disable Conversion Webhooks when deleting CSVs: When uninstalling a CSV, if the CRD is not "owned" by a new CSV (as it would during an upgrade), change the conversion strategy to None.
  • Option 2: Don't perform CRD/CR validation if openapiv3schema has not changed: When installing or upgrading a CSV, do not perform CRD/CR schema validation checks if the openapiv3 schema has not changed.

I prefer option 1, as the conversion webhook won't impact GC between uninstalling (deleting the subscription + csv) and reinstalling the operator, but will need to confirm that it fully resolves the issue. I'll be sure to comment on this issue with what I find.

What did you expect to see? Operator is able to be reinstalled without removing CRDs and all instances.

You may be able to workaround this by manually setting the CRD's conversion strategy to none. I'm going to spin up a cluster to check.

awgreene avatar Aug 05 '22 14:08 awgreene

You may be able to workaround this by manually setting the CRD's conversion strategy to none. I'm going to spin up a cluster to check.

The proposed workaround works, if you set all CRD Conversion Strategies to None prior to reinstalling the operator. This further convinces me to support for option 1.

awgreene avatar Aug 05 '22 15:08 awgreene

@awgreene Thank you for testing it out -- this is something we will need to document.

I assume this is the case, but just to double-check: in the case that a user deletes the subscription and CSV without setting conversion to None, the workaround can still be used (i.e. we can set conversion to None before reinstalling), correct?

amisevsk avatar Aug 05 '22 16:08 amisevsk

I just did a quick test in a small cluster and setting

spec:
  conversion:
    strategy: None # note .spec.conversion.webhook is removed here

on all involved CRDs worked as expected and I was able to reinstall the operator.

Thanks again!

amisevsk avatar Aug 05 '22 17:08 amisevsk

I'll work on getting a PR to update a CRD's conversionStrategy to None if no CSV owns the CRD.

awgreene avatar Aug 05 '22 17:08 awgreene

I'll work on getting a PR to update a CRD's conversionStrategy to None if no CSV owns the CRD.

I'm not sure I entirely understand, but note that we currently support some flows where the operator is not installed via OLM, so in some cases CRDs may exist on the cluster that were not created by OLM, and these CRDs need to keep their conversion settings. In these cases, we use service.beta.openshift.io/inject-cabundle: 'true' and mount the certificate to the controller ourselves.

Edit: unless this is a reference to the install process from OLM, in which case nevermind -- makes sense.

amisevsk avatar Aug 05 '22 18:08 amisevsk

Sorry, it would be part of OLM's cleanup of a CSV:

  • Check if the CSV defined any conversion webhooks
  • If a conversion webhook was defined in the CSV, check if the CSV is being replaced or uninstalled
  • If uninstalled, set the conversion strategy to None.

awgreene avatar Aug 05 '22 18:08 awgreene

thanks for the update, in the DevWorkspace Operator case the SOP looks the following way (main trick is setting the conversion strategy to None in CRDs) - https://gist.github.com/ibuziuk/f6260950670ffefd3b5bbfa5aec637bb

ibuziuk avatar Aug 10 '22 16:08 ibuziuk

@awgreene was the fix for this one backported for OpenShift 4.10/4.11?

amisevsk avatar Sep 27 '22 21:09 amisevsk