operator-lifecycle-manager
operator-lifecycle-manager copied to clipboard
Bundle unpacking doesn't retry after DeadlineExceeded even after resubscribing.
Bug Report
Seen something like this on ~~two~~many separate occasions now.
What did you do?
- Installed a catalog-source and subscription at the same time.
- Image pull hit the dockerhub rate limits, so the catalog-source bundle didn't come up.
- The
InstallPlan
went into conditionDeadlineExceeded
with:Danger alert:Failed Bundle unpacking failed. Reason: DeadlineExceeded, and Message: Job was active longer than specified deadline
- Came back later when the catalog-source was ready and recreated the subscription.
- The new
InstallPlan
instantly went into conditionDeadlineExceeded
-
Workaround: After much tinkering, found that deleting the failed job in
openshift-marketplace
and then creating the subscription allowed the install to proceed.
What did you expect to see? When recreating a subscription, retry the bundle lookup even if the previous failed job exists.
What did you see instead? Under which circumstances? A clear and concise description of what you expected to happen (or insert a code snippet). When recreating a subscription, artifacts of the previous failed bundle lookup prevent a retry.
Environment
- operator-lifecycle-manager version:
release-4.9
- Kubernetes version information:
oc version
Client Version: 4.9.0-0.nightly-2021-07-16-011609
Server Version: 4.9.0-0.nightly-2021-07-16-011609
- Kubernetes cluster kind:
OpenShift 4.9
Possible Solution
See if the bundle lookup job was the result of a previous installplan and recreate it instead of failing.
Additional context Add any other context about the problem here.
So the expected flow would be when the subscription is recreated, recreate the unpack job versus now where the job stays in a failed state? It would require invalidating the "old" unpack job and configmap, even though they have the same name because the bundle is the same. Potentially we could use a hash of these names instead, or as an annotation somewhere.
Seems like we could have a ownership issue between the configmap and job unpacker, we could revise the ownerrefs to make this more intuitive.
There is another aspect to this: after a failed install due to a bundle unpack/pull failure, re-attempting to install the operator by removing the Susbcription
and ClusterServiceVersion
will run into the stale, failed instance of the previous bundle job. OLM will subsequently not attempt to install the operator because it can't overcome the failed bundle. The bundle should have been removed as soon as the InstallPlan
or ClusterServiceVersion
has been removed.
This is also seems to impact successful installs that are triggered by patching a Subscription
to point to another CatalogSource
, as part of testing operator upgrades. It could be a separate issue, but unless after removing the Subscription
/ ClusterServiceVersion
you also remove the succeeded unpack jobs, re-attempting to test an upgrade like this also fails.
I encountered this issue and resolved it by deleting the Job
s in the operator's namespace, and any InstallPlan
too. Then I restarted the OLM operator.
I encountered this issue and resolved it by deleting the
Job
s in the operator's namespace, and anyInstallPlan
too. Then I restarted the OLM operator.
@EronWright did you re-install the operator? does your steps are similar as this KB: https://access.redhat.com/solutions/6459071
By simply deleting the job works. Thanks @EronWright