operator-lifecycle-manager
operator-lifecycle-manager copied to clipboard
Attempting to reinstall an operator that uses conversion webhooks fails
Bug Report
What did you do?
When attempting to reinstall an operator that uses conversion webhooks by
- Deleting the operator subscription and any CSVs associated with it
- Recreating the operator subscription
The resulting InstallPlan enters a failed state with message similar to
error validating existing CRs against new CRD's schema for "devworkspaces.workspace.devfile.io": error listing resources in GroupVersionResource schema.GroupVersionResource{Group:"workspace.devfile.io", Version:"v1alpha1", Resource:"devworkspaces"}: conversion webhook for workspace.devfile.io/v1alpha2, Kind=DevWorkspace failed: Post "https://devworkspace-controller-manager-service.test-namespace.svc:443/convert?timeout=30s": service "devworkspace-controller-manager-service" not found
When the original CSVs are deleted, the operator's main deployment and service are removed, but CRDs are left in-cluster. However, since the service/CA bundle/deployment that serve the conversion webhook are removed, conversion webhooks are broken at that point. Eventually this impacts garbage collection on the cluster as well.
This can be reproduced by installing the DevWorkspace Operator from the Red Hat catalog. (I can provide yamls/upstream images that reproduce as well, if that's helpful). It may be necessary to create a DevWorkspace in the cluster before deletion, e.g. by oc apply -f https://raw.githubusercontent.com/devfile/devworkspace-operator/main/samples/plain.yaml
What did you expect to see? Operator is able to be reinstalled without removing CRDs and all instances.
What did you see instead? Under which circumstances? It's necessary to completely remove the operator including CRDs. For our operator (DevWorkspace), this also makes uninstall especially complicated as finalizers are used (so CRDs cannot be deleted if the controller is removed, and the controller cannot be restored by reinstalling)
Environment
-
operator-lifecycle-manager version: 4.10.24
-
Kubernetes version information: Kubernetes Version: v1.23.5+012e945 (OpenShift 4.10.24)
-
Kubernetes cluster kind: OpenShift
Possible Solution The only details I noticed that may be relevant are
- The order of objects in the InstallPlan varies per install, and sometimes CRDs are applied earlier
- The installPlan fails at the point of trying to apply CRDs

Additional context Add any other context about the problem here.
Thanks for submitting this ticket @amisevsk.
When attempting to reinstall an operator that uses conversion webhooks by
- Deleting the operator subscription and any CSVs associated with it
- Recreating the operator subscription
The resulting InstallPlan enters a failed state with message similar toerror validating existing CRs against new CRD's schema for "devworkspaces.workspace.devfile.io": error listing resources in GroupVersionResource schema.GroupVersionResource{Group:"workspace.devfile.io", Version:"v1alpha1", Resource:"devworkspaces"}: conversion webhook for workspace.devfile.io/v1alpha2, Kind=DevWorkspace failed: Post "https://devworkspace-controller-manager-service.test-namespace.svc:443/convert?timeout=30s": service "devworkspace-controller-manager-service" not found
Yep, that's a valid bug and something we need to address. OLM does not allow a CSV to "adopt/update" an existing CRD if it would invalidate existing CRs on cluster. In this case, the installation is failing because OLM can't check existing CRs because there is no operator servicing the conversion webhook defined in the CRD.
There are a few paths forward
- Option 1: Disable Conversion Webhooks when deleting CSVs: When uninstalling a CSV, if the CRD is not "owned" by a new CSV (as it would during an upgrade), change the conversion strategy to None.
- Option 2: Don't perform CRD/CR validation if openapiv3schema has not changed: When installing or upgrading a CSV, do not perform CRD/CR schema validation checks if the openapiv3 schema has not changed.
I prefer option 1, as the conversion webhook won't impact GC between uninstalling (deleting the subscription + csv) and reinstalling the operator, but will need to confirm that it fully resolves the issue. I'll be sure to comment on this issue with what I find.
What did you expect to see? Operator is able to be reinstalled without removing CRDs and all instances.
You may be able to workaround this by manually setting the CRD's conversion strategy to none. I'm going to spin up a cluster to check.
You may be able to workaround this by manually setting the CRD's conversion strategy to none. I'm going to spin up a cluster to check.
The proposed workaround works, if you set all CRD Conversion Strategies to None prior to reinstalling the operator. This further convinces me to support for option 1.
@awgreene Thank you for testing it out -- this is something we will need to document.
I assume this is the case, but just to double-check: in the case that a user deletes the subscription and CSV without setting conversion to None, the workaround can still be used (i.e. we can set conversion to None before reinstalling), correct?
I just did a quick test in a small cluster and setting
spec:
conversion:
strategy: None # note .spec.conversion.webhook is removed here
on all involved CRDs worked as expected and I was able to reinstall the operator.
Thanks again!
I'll work on getting a PR to update a CRD's conversionStrategy to None if no CSV owns the CRD.
I'll work on getting a PR to update a CRD's conversionStrategy to None if no CSV owns the CRD.
I'm not sure I entirely understand, but note that we currently support some flows where the operator is not installed via OLM, so in some cases CRDs may exist on the cluster that were not created by OLM, and these CRDs need to keep their conversion settings. In these cases, we use service.beta.openshift.io/inject-cabundle: 'true' and mount the certificate to the controller ourselves.
Edit: unless this is a reference to the install process from OLM, in which case nevermind -- makes sense.
Sorry, it would be part of OLM's cleanup of a CSV:
- Check if the CSV defined any conversion webhooks
- If a conversion webhook was defined in the CSV, check if the CSV is being replaced or uninstalled
- If uninstalled, set the conversion strategy to None.
thanks for the update, in the DevWorkspace Operator case the SOP looks the following way (main trick is setting the conversion strategy to None in CRDs) - https://gist.github.com/ibuziuk/f6260950670ffefd3b5bbfa5aec637bb
@awgreene was the fix for this one backported for OpenShift 4.10/4.11?