service-binding-operator icon indicating copy to clipboard operation
service-binding-operator copied to clipboard

Add LeaseDuration and RenewDeadline timeouts for operator

Open cdrage opened this issue 3 years ago • 12 comments

So... when deploying Service Binding Operator on a bare-metal environment I'm running into THIS issue a lot:

https://github.com/operator-framework/operator-sdk/issues/1813

Why? Because my bare-metal cluster is running 100's of containers and unfortunately it takes a while to get the leader. Usually it's within a few seconds, but if there is some HDD usage, it'll take a little while.

What a lot of operator maintainers have done, is implemented the LeaseDuration and RenewDeadline commands to the operator so that us bare-metal people can increase the timeouts.

In this PR I have:

  • Added this to the main.go file
  • Set the default 30 / 20 second timeouts.

So that we have the ability to change the settings via a configmap / yaml within k8s:

    leaderElection:
      leaderElect: false
      resourceName: <example-domain>.io
      leaseDuration: "30s"
      renewDeadline: "20s"

This would have no impact on current functionality, but help those who are experiencing high amount of restarts / CrashLoopback's of the service binding operator pod:

$ k get pods -A
operators       pgo-f96b88c9d-z2nfx                                               1/1     Running                 0                 18h
operators       service-binding-operator-7795b785b4-wh265                         0/1     CrashLoopBackOff        225 (2m30s ago)   23h

And the output here:

{"level":"error","ts":1643209250.576585,"logger":"controller","msg":"Reconciler error","reconcilerGroup":"apiextensions.k8s.io","reconcilerKind":"CustomResourceDefinition","controller":"customresourcedefinition","name":"orders.acme.cert-manager.io","namespace":"","error":"no matches for kind \"Order\" in version \"acme.cert-manager.io/v1alpha2\"","stacktrace":"g$
thub.com/go-logr/zapr.(*zapLogger).Error\n\t/workspace/vendor/github.com/go-logr/zapr/zapr.go:132\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/workspace/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:246\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWork$
tem\n\t/workspace/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:218\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).worker\n\t/workspace/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:197\nk8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1\n\t/workspace/vendor/k8s.io/apimachinery/$
kg/util/wait/wait.go:155\nk8s.io/apimachinery/pkg/util/wait.BackoffUntil\n\t/workspace/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:156\nk8s.io/apimachinery/pkg/util/wait.JitterUntil\n\t/workspace/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:133\nk8s.io/apimachinery/pkg/util/wait.Until\n\t/workspace/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:90"}
E0126 15:00:50.850700       1 leaderelection.go:321] error retrieving resource lock operators/8fa65150.coreos.com: Get "https://10.96.0.1:443/api/v1/namespaces/operators/configmaps/8fa65150.coreos.com": context deadline exceeded
I0126 15:00:50.850783       1 leaderelection.go:278] failed to renew lease operators/8fa65150.coreos.com: timed out waiting for the condition
{"level":"info","ts":1643209250.8508487,"logger":"controller","msg":"Stopping workers","reconcilerGroup":"servicebinding.io","reconcilerKind":"ServiceBinding","controller":"servicebinding"}
{"level":"error","ts":1643209250.8508205,"logger":"setup","msg":"problem running manager","error":"leader election lost","stacktrace":"github.com/go-logr/zapr.(*zapLogger).Error\n\t/workspace/vendor/github.com/go-logr/zapr/zapr.go:132\nmain.main\n\t/workspace/main.go:175\nruntime.main\n\t/usr/local/go/src/runtime/proc.go:204"}
{"level":"info","ts":1643209250.8509405,"logger":"controller-runtime.webhook","msg":"shutting down webhook server"}

Which would fix:

Fixes https://github.com/redhat-developer/odo/issues/5396

cdrage avatar Jan 27 '22 18:01 cdrage

/retest

pmacik avatar Jan 28 '22 12:01 pmacik

Codecov Report

Merging #1097 (9eff114) into master (c40394f) will not change coverage. The diff coverage is n/a.

Impacted file tree graph

@@           Coverage Diff           @@
##           master    #1097   +/-   ##
=======================================
  Coverage   58.08%   58.08%           
=======================================
  Files          33       33           
  Lines        2839     2839           
=======================================
  Hits         1649     1649           
  Misses       1029     1029           
  Partials      161      161           

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update c40394f...9eff114. Read the comment docs.

codecov[bot] avatar Jan 28 '22 14:01 codecov[bot]

/lgtm

baijum avatar Feb 03 '22 01:02 baijum

New changes are detected. LGTM label has been removed.

openshift-ci[bot] avatar Feb 07 '22 14:02 openshift-ci[bot]

Forgot to sign-off the commit, so pushed the change :)

cdrage avatar Feb 07 '22 14:02 cdrage

So that we have the ability to change the settings via a configmap / yaml within k8s:

@cdrage Can you explain how the changes in this PR help to change the settings as you mentioned?

Currently, there is a command-line flag (leader-elect) to enable leader election: https://github.com/redhat-developer/service-binding-operator/blob/f97bd6808708f7adb9b7f2f5fde8bfd41a91e038/main.go#L87

The current change in your PR allows changing leaseDuration and renewDeadline during build time as explained here: https://www.digitalocean.com/community/tutorials/using-ldflags-to-set-version-information-for-go-applications

baijum avatar Feb 15 '22 00:02 baijum

So that we have the ability to change the settings via a configmap / yaml within k8s:

@cdrage Can you explain how the changes in this PR help to change the settings as you mentioned?

Currently, there is a command-line flag (leader-elect) to enable leader election:

https://github.com/redhat-developer/service-binding-operator/blob/f97bd6808708f7adb9b7f2f5fde8bfd41a91e038/main.go#L87

The current change in your PR allows changing leaseDuration and renewDeadline during build time as explained here: https://www.digitalocean.com/community/tutorials/using-ldflags-to-set-version-information-for-go-applications

Unsure what else to add to be honest to the description / commit messages.

I am adding those two values so I can set the timeout and lease duration when electing a leader. I would still like a leader in my node cluster, but due to bare-metal, HDD and intermittent network connection, it consistenly times out.

With this change, I'm able to increase the duration via ConfigMap so the leader is safely elected.

cdrage avatar Feb 24 '22 13:02 cdrage

@baijum any updates?

cdrage avatar Mar 03 '22 13:03 cdrage

With this change, I'm able to increase the duration via ConfigMap so the leader is safely elected.

@cdrage

Are you referring to the component config mentioned here? https://book.kubebuilder.io/component-config-tutorial/tutorial.html

I am not sure how this ConfigMap is used for configuring the operator. Please share some docs/links which I can read to understand more on this.

baijum avatar Mar 03 '22 15:03 baijum

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: Once this PR has been reviewed and has the lgtm label, please ask for approval from baijum by writing /assign @baijum in a comment. For more information see:The Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment Approvers can cancel approval by writing /approve cancel in a comment

openshift-ci[bot] avatar Jul 27 '22 14:07 openshift-ci[bot]

Still running into this issue unfortunately :( so going to rebase and push my changes.

cdrage avatar Jul 27 '22 14:07 cdrage

@cdrage: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/4.12-acceptance 9eff1148b27e58e7b19ce36ca2682e21b0e3596d link true /test 4.12-acceptance

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

openshift-ci[bot] avatar Aug 16 '22 13:08 openshift-ci[bot]