service-binding-operator
service-binding-operator copied to clipboard
Add LeaseDuration and RenewDeadline timeouts for operator
So... when deploying Service Binding Operator on a bare-metal environment I'm running into THIS issue a lot:
https://github.com/operator-framework/operator-sdk/issues/1813
Why? Because my bare-metal cluster is running 100's of containers and unfortunately it takes a while to get the leader. Usually it's within a few seconds, but if there is some HDD usage, it'll take a little while.
What a lot of operator maintainers have done, is implemented the LeaseDuration and RenewDeadline commands to the operator so that us bare-metal people can increase the timeouts.
In this PR I have:
- Added this to the
main.go
file - Set the default 30 / 20 second timeouts.
So that we have the ability to change the settings via a configmap / yaml within k8s:
leaderElection:
leaderElect: false
resourceName: <example-domain>.io
leaseDuration: "30s"
renewDeadline: "20s"
This would have no impact on current functionality, but help those who are experiencing high amount of restarts / CrashLoopback's of the service binding operator pod:
$ k get pods -A
operators pgo-f96b88c9d-z2nfx 1/1 Running 0 18h
operators service-binding-operator-7795b785b4-wh265 0/1 CrashLoopBackOff 225 (2m30s ago) 23h
And the output here:
{"level":"error","ts":1643209250.576585,"logger":"controller","msg":"Reconciler error","reconcilerGroup":"apiextensions.k8s.io","reconcilerKind":"CustomResourceDefinition","controller":"customresourcedefinition","name":"orders.acme.cert-manager.io","namespace":"","error":"no matches for kind \"Order\" in version \"acme.cert-manager.io/v1alpha2\"","stacktrace":"g$
thub.com/go-logr/zapr.(*zapLogger).Error\n\t/workspace/vendor/github.com/go-logr/zapr/zapr.go:132\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/workspace/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:246\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWork$
tem\n\t/workspace/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:218\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).worker\n\t/workspace/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:197\nk8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1\n\t/workspace/vendor/k8s.io/apimachinery/$
kg/util/wait/wait.go:155\nk8s.io/apimachinery/pkg/util/wait.BackoffUntil\n\t/workspace/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:156\nk8s.io/apimachinery/pkg/util/wait.JitterUntil\n\t/workspace/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:133\nk8s.io/apimachinery/pkg/util/wait.Until\n\t/workspace/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:90"}
E0126 15:00:50.850700 1 leaderelection.go:321] error retrieving resource lock operators/8fa65150.coreos.com: Get "https://10.96.0.1:443/api/v1/namespaces/operators/configmaps/8fa65150.coreos.com": context deadline exceeded
I0126 15:00:50.850783 1 leaderelection.go:278] failed to renew lease operators/8fa65150.coreos.com: timed out waiting for the condition
{"level":"info","ts":1643209250.8508487,"logger":"controller","msg":"Stopping workers","reconcilerGroup":"servicebinding.io","reconcilerKind":"ServiceBinding","controller":"servicebinding"}
{"level":"error","ts":1643209250.8508205,"logger":"setup","msg":"problem running manager","error":"leader election lost","stacktrace":"github.com/go-logr/zapr.(*zapLogger).Error\n\t/workspace/vendor/github.com/go-logr/zapr/zapr.go:132\nmain.main\n\t/workspace/main.go:175\nruntime.main\n\t/usr/local/go/src/runtime/proc.go:204"}
{"level":"info","ts":1643209250.8509405,"logger":"controller-runtime.webhook","msg":"shutting down webhook server"}
Which would fix:
Fixes https://github.com/redhat-developer/odo/issues/5396
/retest
Codecov Report
Merging #1097 (9eff114) into master (c40394f) will not change coverage. The diff coverage is
n/a
.
@@ Coverage Diff @@
## master #1097 +/- ##
=======================================
Coverage 58.08% 58.08%
=======================================
Files 33 33
Lines 2839 2839
=======================================
Hits 1649 1649
Misses 1029 1029
Partials 161 161
Continue to review full report at Codecov.
Legend - Click here to learn more
Δ = absolute <relative> (impact)
,ø = not affected
,? = missing data
Powered by Codecov. Last update c40394f...9eff114. Read the comment docs.
/lgtm
New changes are detected. LGTM label has been removed.
Forgot to sign-off the commit, so pushed the change :)
So that we have the ability to change the settings via a configmap / yaml within k8s:
@cdrage Can you explain how the changes in this PR help to change the settings as you mentioned?
Currently, there is a command-line flag (leader-elect
) to enable leader election:
https://github.com/redhat-developer/service-binding-operator/blob/f97bd6808708f7adb9b7f2f5fde8bfd41a91e038/main.go#L87
The current change in your PR allows changing leaseDuration
and renewDeadline
during build time as explained here:
https://www.digitalocean.com/community/tutorials/using-ldflags-to-set-version-information-for-go-applications
So that we have the ability to change the settings via a configmap / yaml within k8s:
@cdrage Can you explain how the changes in this PR help to change the settings as you mentioned?
Currently, there is a command-line flag (
leader-elect
) to enable leader election:https://github.com/redhat-developer/service-binding-operator/blob/f97bd6808708f7adb9b7f2f5fde8bfd41a91e038/main.go#L87
The current change in your PR allows changing
leaseDuration
andrenewDeadline
during build time as explained here: https://www.digitalocean.com/community/tutorials/using-ldflags-to-set-version-information-for-go-applications
Unsure what else to add to be honest to the description / commit messages.
I am adding those two values so I can set the timeout and lease duration when electing a leader. I would still like a leader in my node cluster, but due to bare-metal, HDD and intermittent network connection, it consistenly times out.
With this change, I'm able to increase the duration via ConfigMap so the leader is safely elected.
@baijum any updates?
With this change, I'm able to increase the duration via ConfigMap so the leader is safely elected.
@cdrage
Are you referring to the component config mentioned here? https://book.kubebuilder.io/component-config-tutorial/tutorial.html
I am not sure how this ConfigMap is used for configuring the operator. Please share some docs/links which I can read to understand more on this.
[APPROVALNOTIFIER] This PR is NOT APPROVED
This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please ask for approval from baijum by writing /assign @baijum
in a comment. For more information see:The Kubernetes Code Review Process.
The full list of commands accepted by this bot can be found here.
Approvers can indicate their approval by writing /approve
in a comment
Approvers can cancel approval by writing /approve cancel
in a comment
Still running into this issue unfortunately :( so going to rebase and push my changes.
@cdrage: The following test failed, say /retest
to rerun all failed tests or /retest-required
to rerun all mandatory failed tests:
Test name | Commit | Details | Required | Rerun command |
---|---|---|---|---|
ci/prow/4.12-acceptance | 9eff1148b27e58e7b19ce36ca2682e21b0e3596d | link | true | /test 4.12-acceptance |
Full PR test history. Your PR dashboard.
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.