keda keda operator restarted at the time of start.(error retrieving resource lock keda/operator.keda.sh)

Discussed in https://github.com/kedacore/keda/discussions/2722

^{Originally posted by vkamlesh March 7, 2022} Keda operator failed to elect leader after keda-operator pod restart. These restarts are not frequent but it's happening in a few days(6 days) time intervals.

KEDA Version: 2.6.1 Git Commit: efca71d6bc770408468a9e1a4b3984f7136c0967

Kubernetes version: v1.20.9 Kubernetes Cluster : AKS


bash-3.2$ k get po -n keda
NAME                                      READY   STATUS    RESTARTS   AGE
keda-metrics-apiserver-649f4ddbbd-v4pjp   1/1     Running   0          12d
keda-operator-68ddbdcc8f-6h767            1/1     Running   3          12d
bash-3.2$ 


bash-3.2$ kubectl get --raw "/apis/coordination.k8s.io/v1/namespaces/keda/leases/operator.keda.sh"
{"kind":"Lease","apiVersion":"coordination.k8s.io/v1","metadata":{"name":"operator.keda.sh","namespace":"keda","uid":"edb18fd7-b95e-463f-81cf-6a1010073409","resourceVersion":"135421212","creationTimestamp":"2022-02-23T14:33:32Z","managedFields":[{"manager":"keda","operation":"Update","apiVersion":"coordination.k8s.io/v1","time":"2022-02-23T14:33:32Z","fieldsType":"FieldsV1","fieldsV1":{"f:spec":{"f:acquireTime":{},"f:holderIdentity":{},"f:leaseDurationSeconds":{},"f:leaseTransitions":{},"f:renewTime":{}}}}]},"spec":{"holderIdentity":"keda-operator-68ddbdcc8f-6h767_53e840b6-5466-484f-a6f6-16978b7ee12c","leaseDurationSeconds":15,"acquireTime":"2022-03-07T12:07:54.000000Z","renewTime":"2022-03-07T17:36:42.260717Z","leaseTransitions":142}}




bash-3.2$ k logs keda-operator-68ddbdcc8f-6h767 -n keda -f -p


1.6466548397264059e+09	INFO	controller.scaledobject	Reconciling ScaledObject	{"reconciler group": "keda.sh", "reconciler kind": "ScaledObject", "name": "observationsprocessor-func", "namespace": "platform-api"}
E0307 12:07:33.812275       1 leaderelection.go:330] error retrieving resource lock keda/operator.keda.sh: Get "https://10.0.0.1:443/apis/coordination.k8s.io/v1/namespaces/keda/leases/operator.keda.sh": context deadline exceeded
I0307 12:07:33.812329       1 leaderelection.go:283] failed to renew lease keda/operator.keda.sh: timed out waiting for the condition
1.6466548538123553e+09	ERROR	setup	problem running manager	{"error": "leader election lost"}`

Mar 28 '22 10:03 crisp2u

This is most likely a problem in sigs.k8s.io/controller-runtime as it is responsible for leader election. We should investigate.

Mar 29 '22 08:03 zroubalik

I've found this. What puzzles me is that I saw the same error message ("failed to renew lease" ) on the other controllers in the cluster that probably use controller-runtime but they managed to recover. Maybe the default options are to optimistic in keda ?

Mar 30 '22 11:03 crisp2u

Hard to say, could please try to tweak those settings on your setup?

Apr 01 '22 12:04 zroubalik

@crisp2u @zroubalik Where exactly do we need to tweak values?

Apr 19 '22 11:04 vkamlesh

I'm also seeing this issue, and it's leading to noisy pod restart alerts in our AKS cluster. We are only running 1 replica of the KEDA operator, but as of now we're seeing container restarts ~3-8 times a day thanks to "leader election lost"

leaderelection.go:367] Failed to update lock: Put ".../api/v1/namespaces/keda/configmaps/operator.keda.sh": context deadline exceeded leaderelection.go:283] failed to renew lease keda/operator.keda.sh: timed out waiting for the condition ERROR setup problem running manager {"error": "leader election lost"}

@zroubalik - Presumably you were talking previously about tweaking the lease-related settings? Perhaps there should be a hook in the helm chart for configuring the leasing options: https://github.com/kedacore/keda/blob/dcb9c1e2d157ba3a763acfdfba60d819874d2c16/main.go#L87-L95

Jun 08 '22 21:06 wsugarman

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 7 days if no further activity occurs. Thank you for your contributions.

Aug 07 '22 23:08 stale[bot]

This issue has been automatically closed due to inactivity.

Aug 14 '22 23:08 stale[bot]