operator-lifecycle-manager
operator-lifecycle-manager copied to clipboard
OLM crashes etcd when update fails
Bug Report
What did you do?
- update Community Jaeger Operator from version
1.32.0to1.33.0(using the community-operator-index:v4.9) on an OpenShift 4.9 cluster - the installation failed with the error message
install strategy failed: Deployment.apps "jaeger-operator" is invalid: spec.selector: Invalid value: v1.LabelSelector{MatchLabels:map[string]string{"app.kubernetes.io/name":"jaeger-operator", "name":"jaeger-operator"}, MatchExpressions:[]v1.LabelSelectorRequirement(nil)}: field is immutable - the installation was continuously attempted and kept failing (until the deployment was manually deleted)
What did you expect to see?
I did expect the installation to back-off from attempts exponentially and the cluster to remain stable.
What did you see instead? Under which circumstances?
The flood of installation attempts led to etcd timeouts and failures during leader election leading to multiple restarts of other operators and further failures. The default OpenShift API Fairness and Priority rules did not prevent this from happening.
Environment
- OKD 4.9.0-0.okd-2022-02-12-140851 (latest
4.9-stable) - operator-lifecycle-manager version:
4.9.0-0.okd-2022-02-12-140851 - Kubernetes Version:
v1.22.1-1839+b93fd35dd03051-dirty
When a CSV fails, there is a way to mark errors as unrecoverable versus a recoverable failure. There is a small list of unrecoverable failures but most are recoverable. To solve this, the unrecoverable list should be updated to included cases where an immutable field is attempted to be updated during the course of an upgrade. If OLM doesn't encounter an unrecoverable error when installing the CSV it will always continue to try to install it.
Updating an operator that includes a change to an immutable field would require one to remove the existing version of the operator before attempting to install the newer version. Since OLM does patch updates, it cannot successfully install the newer version.
@exdx I'm not sure I agree. The immutable field error is a classic and even part of the troubleshooting documentation . Retrying here is perfectly fine for me, because it is an issue that has to be resolved manually during installation and can be done easily in most cases (just remove the offending object and let it be recreated by the operator installation - instead of removing the entire operator).
My problem is the way OLM actually retries and that is does not back-off after multiple failures.