operator-lifecycle-manager icon indicating copy to clipboard operation
operator-lifecycle-manager copied to clipboard

OLM crashes etcd when update fails

Open bo0ts opened this issue 3 years ago • 2 comments

Bug Report

What did you do?

  • update Community Jaeger Operator from version 1.32.0 to 1.33.0 (using the community-operator-index:v4.9) on an OpenShift 4.9 cluster
  • the installation failed with the error message install strategy failed: Deployment.apps "jaeger-operator" is invalid: spec.selector: Invalid value: v1.LabelSelector{MatchLabels:map[string]string{"app.kubernetes.io/name":"jaeger-operator", "name":"jaeger-operator"}, MatchExpressions:[]v1.LabelSelectorRequirement(nil)}: field is immutable
  • the installation was continuously attempted and kept failing (until the deployment was manually deleted)

What did you expect to see?

I did expect the installation to back-off from attempts exponentially and the cluster to remain stable.

What did you see instead? Under which circumstances?

The flood of installation attempts led to etcd timeouts and failures during leader election leading to multiple restarts of other operators and further failures. The default OpenShift API Fairness and Priority rules did not prevent this from happening.

Environment

  • OKD 4.9.0-0.okd-2022-02-12-140851 (latest 4.9-stable)
  • operator-lifecycle-manager version: 4.9.0-0.okd-2022-02-12-140851
  • Kubernetes Version: v1.22.1-1839+b93fd35dd03051-dirty

bo0ts avatar Apr 19 '22 11:04 bo0ts

When a CSV fails, there is a way to mark errors as unrecoverable versus a recoverable failure. There is a small list of unrecoverable failures but most are recoverable. To solve this, the unrecoverable list should be updated to included cases where an immutable field is attempted to be updated during the course of an upgrade. If OLM doesn't encounter an unrecoverable error when installing the CSV it will always continue to try to install it.

Updating an operator that includes a change to an immutable field would require one to remove the existing version of the operator before attempting to install the newer version. Since OLM does patch updates, it cannot successfully install the newer version.

exdx avatar Apr 28 '22 19:04 exdx

@exdx I'm not sure I agree. The immutable field error is a classic and even part of the troubleshooting documentation . Retrying here is perfectly fine for me, because it is an issue that has to be resolved manually during installation and can be done easily in most cases (just remove the offending object and let it be recreated by the operator installation - instead of removing the entire operator).

My problem is the way OLM actually retries and that is does not back-off after multiple failures.

bo0ts avatar May 02 '22 07:05 bo0ts