operator-lifecycle-manager icon indicating copy to clipboard operation
operator-lifecycle-manager copied to clipboard

Problems arising from deleting and rapidly creating a subscription

Open dtfranz opened this issue 1 year ago • 3 comments

Bug Report

Description

I've noticed that occasionally, after a subscription is updated, then deleted and immediately recreated, the newly created subscription will be updated with the status of the old, deleted subscription. This will halt installation within the namespace, as the status will link the new subscription to the installPlan which was garbage-collected as a result of deleting the original subscription.

Workaround

The issue can be resolved by simply deleting the subscription, then re-creating it after giving the controllers enough time to register the deletion event. Creating it with a different name should also ensure that the issue doesn't happen at all.

Possible Cause

I believe this occurs because items in the cache are keyed by namespace/name, and it may therefore be possible for a controller to update the new subscription with an old status using a stale entry from the cache.

Example

Following is an example of a subscription in this state. Note that no CSVs or InstallPlans were present in the namespace at the time. This was reproduced in an OpenShift 4.13 cluster with a catalog-operator image built from this repo as of commit hash 2be5e58:

apiVersion: operators.coreos.com/v1alpha1
kind: Subscription
metadata:
  creationTimestamp: "2023-08-10T18:23:19Z"
  generation: 1
  labels:
    operators.coreos.com/project-quay.openshift-operators: ""
  name: project-quay
  namespace: openshift-operators
  resourceVersion: "334099"
  uid: cab0d6e8-a551-4e39-ad2e-2f3c0a2caf27
spec:
  channel: stable-3.6
  installPlanApproval: Automatic
  name: project-quay
  source: community-operators
  sourceNamespace: openshift-marketplace
status:
  catalogHealth:
  - catalogSourceRef:
      apiVersion: operators.coreos.com/v1alpha1
      kind: CatalogSource
      name: certified-operators
      namespace: openshift-marketplace
      resourceVersion: "307403"
      uid: b1f09bd9-df57-4b1a-8520-a70f2038886b
    healthy: true
    lastUpdated: "2023-08-10T18:22:53Z"
  - catalogSourceRef:
      apiVersion: operators.coreos.com/v1alpha1
      kind: CatalogSource
      name: community-operators
      namespace: openshift-marketplace
      resourceVersion: "319530"
      uid: b3de47e6-43e6-4436-af68-ebaed5a8a7cd
    healthy: true
    lastUpdated: "2023-08-10T18:22:53Z"
  - catalogSourceRef:
      apiVersion: operators.coreos.com/v1alpha1
      kind: CatalogSource
      name: redhat-marketplace
      namespace: openshift-marketplace
      resourceVersion: "318894"
      uid: e018a243-535b-4a4e-bec8-d3350a17eded
    healthy: true
    lastUpdated: "2023-08-10T18:22:53Z"
  - catalogSourceRef:
      apiVersion: operators.coreos.com/v1alpha1
      kind: CatalogSource
      name: redhat-operators
      namespace: openshift-marketplace
      resourceVersion: "324137"
      uid: 007734c1-e24e-4b4b-8edd-6e29e84dc5d3
    healthy: true
    lastUpdated: "2023-08-10T18:22:53Z"
  conditions:
  - lastTransitionTime: "2023-08-10T18:22:53Z"
    message: all available catalogsources are healthy
    reason: AllCatalogSourcesHealthy
    status: "False"
    type: CatalogSourcesUnhealthy
  - status: "False"
    type: BundleUnpacking
  - message: 'constraints not satisfiable: subscription project-quay requires @existing/openshift-operators//quay-operator.v3.8.10,
      subscription project-quay exists, clusterserviceversion quay-operator.v3.7.11
      exists and is not referenced by a subscription, @existing/openshift-operators//quay-operator.v3.8.10
      and @existing/openshift-operators//quay-operator.v3.7.11 originate from package
      project-quay'
    reason: ConstraintsNotSatisfiable
    status: "True"
    type: ResolutionFailed
  - lastTransitionTime: "2023-08-10T18:23:20Z"
    reason: ReferencedInstallPlanNotFound
    status: "True"
    type: InstallPlanMissing
  currentCSV: quay-operator.v3.8.10
  installPlanGeneration: 3
  installPlanRef:
    apiVersion: operators.coreos.com/v1alpha1
    kind: InstallPlan
    name: install-nftdc
    namespace: openshift-operators
    resourceVersion: "333761"
    uid: 5d74d8cb-8170-438c-91ed-3d7d3e44f4ed
  installedCSV: quay-operator.v3.8.10
  installplan:
    apiVersion: operators.coreos.com/v1alpha1
    kind: InstallPlan
    name: install-nftdc
    uuid: 5d74d8cb-8170-438c-91ed-3d7d3e44f4ed
  lastUpdated: "2023-08-10T18:23:20Z"
  state: UpgradePending

Impact

While the impact may be high when this occurs it should be fairly unlikely to happen given the speed that's required when deleting and re-creating the subscription.

Resolution

In the worst-case, this may require re-architecting the internal cache implementation of OLM to make use of UIDs instead of relying on the namespace and name of objects alone. We may also be able to do a UID comparison before doing a status update, but I haven't looked into this very much.

dtfranz avatar Aug 14 '23 21:08 dtfranz

just to note, with argocd this happens quite often, appearantly once in a week i see broken operator subscriptions with this issue, which means all other subscriptions are blocked.. my current workaround is to delete all pods.jobs in the olm namespace, then delete a csv,installplan for a subscription, after this all subscriptions are finding somehow back to a working state.. i saw this with 0.26 and with 0.25 on gke 1.27

Elyytscha avatar Nov 28 '23 21:11 Elyytscha

I'm also running into this issue fairly often with ArgoCD. I turned off autosync/selfheal, but it cropped up again on one of my clusters.

aceat64 avatar Jan 09 '24 16:01 aceat64

encountering same issue, it always breaks automation deployment

ciiiii avatar Mar 12 '24 08:03 ciiiii