operator-lifecycle-manager icon indicating copy to clipboard operation
operator-lifecycle-manager copied to clipboard

Bundle unpacking doesn't retry after DeadlineExceeded even after resubscribing.

Open rohantmp opened this issue 3 years ago • 6 comments

Bug Report

Seen something like this on ~~two~~many separate occasions now.

What did you do?

  • Installed a catalog-source and subscription at the same time.
  • Image pull hit the dockerhub rate limits, so the catalog-source bundle didn't come up.
  • The InstallPlan went into condition DeadlineExceeded with: Danger alert:Failed Bundle unpacking failed. Reason: DeadlineExceeded, and Message: Job was active longer than specified deadline
  • Came back later when the catalog-source was ready and recreated the subscription.
  • The new InstallPlan instantly went into condition DeadlineExceeded
  • Workaround: After much tinkering, found that deleting the failed job in openshift-marketplace and then creating the subscription allowed the install to proceed.

What did you expect to see? When recreating a subscription, retry the bundle lookup even if the previous failed job exists.

What did you see instead? Under which circumstances? A clear and concise description of what you expected to happen (or insert a code snippet). When recreating a subscription, artifacts of the previous failed bundle lookup prevent a retry.

Environment

  • operator-lifecycle-manager version:

release-4.9

  • Kubernetes version information:
oc version
Client Version: 4.9.0-0.nightly-2021-07-16-011609
Server Version: 4.9.0-0.nightly-2021-07-16-011609
  • Kubernetes cluster kind:

OpenShift 4.9

Possible Solution

See if the bundle lookup job was the result of a previous installplan and recreate it instead of failing.

Additional context Add any other context about the problem here.

rohantmp avatar Jul 19 '21 08:07 rohantmp

So the expected flow would be when the subscription is recreated, recreate the unpack job versus now where the job stays in a failed state? It would require invalidating the "old" unpack job and configmap, even though they have the same name because the bundle is the same. Potentially we could use a hash of these names instead, or as an annotation somewhere.

exdx avatar Jul 22 '21 14:07 exdx

Seems like we could have a ownership issue between the configmap and job unpacker, we could revise the ownerrefs to make this more intuitive.

exdx avatar Jul 22 '21 14:07 exdx

There is another aspect to this: after a failed install due to a bundle unpack/pull failure, re-attempting to install the operator by removing the Susbcription and ClusterServiceVersion will run into the stale, failed instance of the previous bundle job. OLM will subsequently not attempt to install the operator because it can't overcome the failed bundle. The bundle should have been removed as soon as the InstallPlan or ClusterServiceVersion has been removed.

This is also seems to impact successful installs that are triggered by patching a Subscription to point to another CatalogSource, as part of testing operator upgrades. It could be a separate issue, but unless after removing the Subscription / ClusterServiceVersionyou also remove the succeeded unpack jobs, re-attempting to test an upgrade like this also fails.

dmesser avatar Sep 09 '21 12:09 dmesser

I encountered this issue and resolved it by deleting the Jobs in the operator's namespace, and any InstallPlan too. Then I restarted the OLM operator.

EronWright avatar Oct 28 '21 22:10 EronWright

I encountered this issue and resolved it by deleting the Jobs in the operator's namespace, and any InstallPlan too. Then I restarted the OLM operator.

@EronWright did you re-install the operator? does your steps are similar as this KB: https://access.redhat.com/solutions/6459071

TiloGit avatar Mar 22 '22 21:03 TiloGit

By simply deleting the job works. Thanks @EronWright

dicolasi avatar Feb 19 '23 10:02 dicolasi