operator-lifecycle-manager Bundle unpacking doesn't retry after DeadlineExceeded even after resubscribing.

Bug Report

Seen something like this on ~~two~~many separate occasions now.

What did you do?

Installed a catalog-source and subscription at the same time.
Image pull hit the dockerhub rate limits, so the catalog-source bundle didn't come up.
The InstallPlan went into condition DeadlineExceeded with: Danger alert:Failed Bundle unpacking failed. Reason: DeadlineExceeded, and Message: Job was active longer than specified deadline
Came back later when the catalog-source was ready and recreated the subscription.
The new InstallPlan instantly went into condition DeadlineExceeded
Workaround: After much tinkering, found that deleting the failed job in openshift-marketplace and then creating the subscription allowed the install to proceed.

What did you expect to see? When recreating a subscription, retry the bundle lookup even if the previous failed job exists.

What did you see instead? Under which circumstances? A clear and concise description of what you expected to happen (or insert a code snippet). When recreating a subscription, artifacts of the previous failed bundle lookup prevent a retry.

Environment

operator-lifecycle-manager version:

release-4.9

Kubernetes version information:

oc version
Client Version: 4.9.0-0.nightly-2021-07-16-011609
Server Version: 4.9.0-0.nightly-2021-07-16-011609

Kubernetes cluster kind:

OpenShift 4.9

Possible Solution

See if the bundle lookup job was the result of a previous installplan and recreate it instead of failing.

Additional context Add any other context about the problem here.

Jul 19 '21 08:07 rohantmp

So the expected flow would be when the subscription is recreated, recreate the unpack job versus now where the job stays in a failed state? It would require invalidating the "old" unpack job and configmap, even though they have the same name because the bundle is the same. Potentially we could use a hash of these names instead, or as an annotation somewhere.

Jul 22 '21 14:07 exdx

Seems like we could have a ownership issue between the configmap and job unpacker, we could revise the ownerrefs to make this more intuitive.

Jul 22 '21 14:07 exdx

There is another aspect to this: after a failed install due to a bundle unpack/pull failure, re-attempting to install the operator by removing the Susbcription and ClusterServiceVersion will run into the stale, failed instance of the previous bundle job. OLM will subsequently not attempt to install the operator because it can't overcome the failed bundle. The bundle should have been removed as soon as the InstallPlan or ClusterServiceVersion has been removed.

This is also seems to impact successful installs that are triggered by patching a Subscription to point to another CatalogSource, as part of testing operator upgrades. It could be a separate issue, but unless after removing the Subscription / ClusterServiceVersionyou also remove the succeeded unpack jobs, re-attempting to test an upgrade like this also fails.

Sep 09 '21 12:09 dmesser

I encountered this issue and resolved it by deleting the Jobs in the operator's namespace, and any InstallPlan too. Then I restarted the OLM operator.

Oct 28 '21 22:10 EronWright

I encountered this issue and resolved it by deleting the Jobs in the operator's namespace, and any InstallPlan too. Then I restarted the OLM operator.

@EronWright did you re-install the operator? does your steps are similar as this KB: https://access.redhat.com/solutions/6459071

Mar 22 '22 21:03 TiloGit

By simply deleting the job works. Thanks @EronWright

Feb 19 '23 10:02 dicolasi

operator-lifecycle-manager operator-lifecycle-manager copied to clipboard

Bundle unpacking doesn't retry after DeadlineExceeded even after resubscribing.

Bug Report

operator-lifecycle-manager
operator-lifecycle-manager copied to clipboard