operator-lifecycle-manager
operator-lifecycle-manager copied to clipboard
On GKE Autopilot, install jobs keep getting re-created / installation never finishes
Bug Report
What did you do?
Installed OLM via operator-sdk
. Installation looks fine. Then tried to install operators from OperatorHub.io (I tried MongoDB and Strimzi). E.g.,
kubectl create -f https://operatorhub.io/install/mongodb-operator.yaml
What did you expect to see?
Operator should be installed.
What did you see instead? Under which circumstances?
OLM keeps scheduling the same install jobs without making progress. The jobs enter Completed status, but the installation doesn't finish and OLM creates another install job:
% kubectl get pods -n olm
NAME READY STATUS RESTARTS AGE
085bfd3513b5b01f9aaf9ede153dad8cc0eeb84f1b04c9244a67e1157f5kccx 0/1 Completed 0 13m
catalog-operator-65dcddd547-tfwrb 1/1 Running 0 16h
e28e03280525e60d9ec94b4dfc8f9e438f9f856e638c31517b5835cd1f47wxn 0/1 Completed 0 12h
e28e03280525e60d9ec94b4dfc8f9e438f9f856e638c31517b5835cd1f59f4s 0/1 Completed 0 12h
e28e03280525e60d9ec94b4dfc8f9e438f9f856e638c31517b5835cd1f66hsk 0/1 Completed 0 12h
e28e03280525e60d9ec94b4dfc8f9e438f9f856e638c31517b5835cd1f72tr6 0/1 Completed 0 15h
e28e03280525e60d9ec94b4dfc8f9e438f9f856e638c31517b5835cd1f9gdgh 0/1 Completed 0 12h
e28e03280525e60d9ec94b4dfc8f9e438f9f856e638c31517b5835cd1fbwrvn 0/1 Completed 0 13h
e28e03280525e60d9ec94b4dfc8f9e438f9f856e638c31517b5835cd1fc8c92 0/1 Completed 0 12h
e28e03280525e60d9ec94b4dfc8f9e438f9f856e638c31517b5835cd1fgpkj4 0/1 Completed 0 12h
e28e03280525e60d9ec94b4dfc8f9e438f9f856e638c31517b5835cd1fqxv8f 0/1 Completed 0 12h
e28e03280525e60d9ec94b4dfc8f9e438f9f856e638c31517b5835cd1frzfbs 0/1 Completed 0 12h
e28e03280525e60d9ec94b4dfc8f9e438f9f856e638c31517b5835cd1fsvpsv 0/1 Completed 0 12h
e28e03280525e60d9ec94b4dfc8f9e438f9f856e638c31517b5835cd1ft8txp 0/1 Completed 0 13h
e28e03280525e60d9ec94b4dfc8f9e438f9f856e638c31517b5835cd1ftnt9m 0/1 Completed 0 12h
e28e03280525e60d9ec94b4dfc8f9e438f9f856e638c31517b5835cd1ftsq7d 0/1 Completed 0 13h
e28e03280525e60d9ec94b4dfc8f9e438f9f856e638c31517b5835cd1fx5p6t 0/1 Completed 0 12h
e28e03280525e60d9ec94b4dfc8f9e438f9f856e638c31517b5835cd1fzlcfb 0/1 Completed 0 12h
e28e03280525e60d9ec94b4dfc8f9e438f9f856e638c31517b5835cd1fzq9fx 0/1 Completed 0 13h
olm-operator-545b8dc66f-mxlc8 1/1 Running 0 16h
operatorhubio-catalog-pg89k 1/1 Running 0 10h
packageserver-7445c8c9fd-dqk8j 1/1 Running 0 12h
packageserver-7445c8c9fd-zwdsm 1/1 Running 0 12h
The CSV for the operator is never created:
% kubectl get csv -n operators
No resources found in operators namespace.
OLM keeps trying to install the operator until the Subscription is deleted manually.
Environment
- operator-lifecycle-manager version:
% kubectl get csv packageserver -n olm
NAME DISPLAY VERSION REPLACES PHASE
packageserver Package Server 0.24.0 Succeeded
- Kubernetes version information:
% kubectl version -o yaml
clientVersion:
buildDate: "2023-01-24T19:42:00Z"
compiler: gc
gitCommit: c4dc593360fd87cf0fe27b27e36a4a19b62d90e9
gitTreeState: dirty
gitVersion: v1.24.10-dispatcher-dirty
goVersion: go1.19.5
major: "1"
minor: 24+
platform: darwin/arm64
kustomizeVersion: v4.5.4
serverVersion:
buildDate: "2023-03-23T11:35:05Z"
compiler: gc
gitCommit: fa2714567d7451b5e32541c10119d1246730c197
gitTreeState: clean
gitVersion: v1.24.12-gke.500
goVersion: go1.19.7 X:boringcrypto
major: "1"
minor: "24"
platform: linux/amd64
- Kubernetes cluster kind:
GKE in Autopilot mode
Possible Solution
I believe this is related to GKE Autopilot changing the resource requests for the job pods, causing OLM to think they are out of sync. GKE Autopilot has minimum values for vCPU and Memory requests, which it seems that the install jobs don't satisfy. Relevant logs:
% kubectl logs catalog-operator-65dcddd547-tfwrb -n olm
...
time="2023-06-11T07:29:13Z" level=info msg=syncing event=update reconciling="*v1alpha1.Subscription" selflink=
time="2023-06-11T07:29:13Z" level=info msg=syncing id=iisT/ ip=install-4d4p7 namespace=operators phase=Installing
time="2023-06-11T07:29:13Z" level=info msg=syncing event=update reconciling="*v1alpha1.Subscription" selflink=
W0611 07:29:13.680783 1 warnings.go:70] child pods are preserved by default when jobs are deleted; set propagationPolicy=Background to remove them or set propagationPolicy=Orphan to suppress this warning
time="2023-06-11T07:29:18Z" level=info msg=syncing id=6WWMV ip=install-4d4p7 namespace=operators phase=Installing
W0611 07:29:18.724847 1 warnings.go:70] Autopilot increased resource requests for Job olm/e28e03280525e60d9ec94b4dfc8f9e438f9f856e638c31517b5835cd1f31805 to meet requirements. See http://g.co/gke/autopilot-resources
time="2023-06-11T07:29:18Z" level=warning msg="status not equal, updating..." id=6WWMV ip=install-4d4p7 namespace=operators phase=Installing
time="2023-06-11T07:29:18Z" level=info msg=syncing event=update reconciling="*v1alpha1.Subscription" selflink=
time="2023-06-11T07:29:18Z" level=info msg=syncing id=hTeyM ip=install-4d4p7 namespace=operators phase=Installing
time="2023-06-11T07:29:18Z" level=info msg=syncing event=update reconciling="*v1alpha1.Subscription" selflink=
W0611 07:29:18.757482 1 warnings.go:70] child pods are preserved by default when jobs are deleted; set propagationPolicy=Background to remove them or set propagationPolicy=Orphan to suppress this warning
time="2023-06-11T07:29:23Z" level=info msg=syncing id=HW0wK ip=install-4d4p7 namespace=operators phase=Installing
W0611 07:29:23.802120 1 warnings.go:70] Autopilot increased resource requests for Job olm/e28e03280525e60d9ec94b4dfc8f9e438f9f856e638c31517b5835cd1f31805 to meet requirements. See http://g.co/gke/autopilot-resources
time="2023-06-11T07:29:23Z" level=warning msg="status not equal, updating..." id=HW0wK ip=install-4d4p7 namespace=operators phase=Installing
... (Lots more logs just like these)
Note the lines like
W0611 07:29:18.724847 1 warnings.go:70] Autopilot increased resource requests for Job olm/e28e03280525e60d9ec94b4dfc8f9e438f9f856e638c31517b5835cd1f31805 to meet requirements. See http://g.co/gke/autopilot-resources
If I'm correct about the cause, a quick fix is to set resource requests / limits to something that GKE Autopilot won't have to change. See: http://g.co/gke/autopilot-resources
A more general solution might be to ignore changes to k8s resources that don't materially affect the installation process. I don't know enough about how OLM works to know if that is technically feasible.
Additional context None
Same issue found almost a year later
Do we have any update on this? I am also facing the same issue.