operator Operator failure during etcd nodes restart

During planned etcd cluster maintenance (one by one nodes restart) caught operator failure with log:

E0219 07:13:05.190439       1 leaderelection.go:356] Failed to update lock: etcdserver: request timed out
I0219 07:13:06.180317       1 leaderelection.go:277] failed to renew lease monitoring/57410f0d.victoriametrics.com: timed out waiting for the condition
{"level":"error","ts":1613718786.180378,"logger":"setup","msg":"problem running manager","error":"leader election lost","stacktrace":"github.com/go-logr/zapr.(*zapLogger).Error\n\tgithub.com/go-logr/[email protected]/zapr.go:128\ngithub.com/VictoriaMetrics/operator/internal/manager.RunManager\n\tgithub.com/VictoriaMetrics/operator/internal/manager/manager.go:189\nmain.main\n\tcommand-line-arguments/main.go:41\nruntime.main\n\truntime/proc.go:204"}
{"level":"error","ts":1613718786.1805472,"logger":"setup","msg":"cannot setup manager","error":"leader election lost","stacktrace":"github.com/go-logr/zapr.(*zapLogger).Error\n\tgithub.com/go-logr/[email protected]/zapr.go:128\nmain.main\n\tcommand-line-arguments/main.go:43\nruntime.main\n\truntime/proc.go:204"}

Expected behavior: retry/reconnect without operator failure.

Feb 19 '21 08:02 zigmund

Thanks for reporting, need to implement graceful restart.

Feb 19 '21 10:02 f41gh7

Hi @f41gh7, We are facing a similar issue:

AWS EKS v1.22.12-eks-6d3986b Doker image: victoriametrics/operator:v0.28.3 Installed with victoria-metrics-k8s-stack v0.12.1

{"level":"info","ts":1663176822.951784,"msg":"Trace[1187934055]: \"Reflector ListAndWatch\" name:k8s.io/[email protected]+incompatible/tools/cache/reflector.go:167 (14-Sep-2022 17:32:49.633) (total time: 53318ms):\nTrace[1187934055]: ---\"Objects listed\" error:<nil> 53318ms (17:33:42.951)\nTrace[1187934055]: [53.318364419s] [53.318364419s] END\n"} {"level":"error","ts":1663176855.1742806,"msg":"error retrieving resource lock monitoring/57410f0d.victoriametrics.com: context deadline exceeded\n","stacktrace":"k8s.io/klog/v2.(*loggingT).printfDepth\n\tk8s.io/klog/[email protected]/klog.go:737\nk8s.io/klog/v2.(*loggingT).printf\n\tk8s.io/klog/[email protected]/klog.go:719\nk8s.io/klog/v2.Errorf\n\tk8s.io/klog/[email protected]/klog.go:1549\nk8s.io/client-go/tools/leaderelection.(*LeaderElector).tryAcquireOrRenew\n\tk8s.io/[email protected]+incompatible/tools/leaderelection/leaderelection.go:330\nk8s.io/client-go/tools/leaderelection.(*LeaderElector).renew.func1.1\n\tk8s.io/[email protected]+incompatible/tools/leaderelection/leaderelection.go:272\nk8s.io/apimachinery/pkg/util/wait.ConditionFunc.WithContext.func1\n\tk8s.io/[email protected]/pkg/util/wait/wait.go:220\nk8s.io/apimachinery/pkg/util/wait.runConditionWithCrashProtectionWithContext\n\tk8s.io/[email protected]/pkg/util/wait/wait.go:233\nk8s.io/apimachinery/pkg/util/wait.poll\n\tk8s.io/[email protected]/pkg/util/wait/wait.go:580\nk8s.io/apimachinery/pkg/util/wait.PollImmediateUntilWithContext\n\tk8s.io/[email protected]/pkg/util/wait/wait.go:545\nk8s.io/apimachinery/pkg/util/wait.PollImmediateUntil\n\tk8s.io/[email protected]/pkg/util/wait/wait.go:536\nk8s.io/client-go/tools/leaderelection.(*LeaderElector).renew.func1\n\tk8s.io/[email protected]+incompatible/tools/leaderelection/leaderelection.go:271\nk8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1\n\tk8s.io/[email protected]/pkg/util/wait/wait.go:155\nk8s.io/apimachinery/pkg/util/wait.BackoffUntil\n\tk8s.io/[email protected]/pkg/util/wait/wait.go:156\nk8s.io/apimachinery/pkg/util/wait.JitterUntil\n\tk8s.io/[email protected]/pkg/util/wait/wait.go:133\nk8s.io/apimachinery/pkg/util/wait.Until\n\tk8s.io/[email protected]/pkg/util/wait/wait.go:90\nk8s.io/client-go/tools/leaderelection.(*LeaderElector).renew\n\tk8s.io/[email protected]+incompatible/tools/leaderelection/leaderelection.go:268\nk8s.io/client-go/tools/leaderelection.(*LeaderElector).Run\n\tk8s.io/[email protected]+incompatible/tools/leaderelection/leaderelection.go:212\nsigs.k8s.io/controller-runtime/pkg/manager.(*controllerManager).startLeaderElection.func3\n\tsigs.k8s.io/[email protected]/pkg/manager/internal.go:643"} {"level":"info","ts":1663176892.0922132,"msg":"failed to renew lease monitoring/57410f0d.victoriametrics.com: timed out waiting for the condition\n"} {"level":"error","ts":1663176892.092245,"logger":"setup","msg":"problem running manager","error":"leader election lost","stacktrace":"github.com/VictoriaMetrics/operator/internal/manager.RunManager\n\tgithub.com/VictoriaMetrics/operator/internal/manager/manager.go:306\nmain.main\n\t./main.go:41\nruntime.main\n\truntime/proc.go:250"} {"level":"error","ts":1663176892.092287,"logger":"setup","msg":"cannot setup manager","error":"leader election lost","stacktrace":"main.main\n\t./main.go:43\nruntime.main\n\truntime/proc.go:250"}

Sep 14 '22 17:09 sydorovdmytro

So if one operator pod win the leader election and failed to renew the lease for some reason[mostly caused by slow/failed response from apiserver], it will exit and pod will restart. Meanwhile other pod can be elected as leader and new start pod can rejoin the leader election. It seems to be the normal case for operator, exit gracefully and do some clean up job.

Jul 11 '23 09:07 Haleygo

@Haleygo it could be done by election loop without exit.

Jul 11 '23 09:07 zigmund

@Haleygo it could be done by election loop without exit.

Yeah, it can rejoin the leader election immediately after failure. But like I said, exit seems to be the normal case for operators. Also see the comments in controller-runtime, looks like it's the best practice now. Maybe can prevent some unknown bugs which will have two leaders[just guessing )]

Jul 11 '23 10:07 Haleygo

@Haleygo I understand your point, most controllers works same way.

I made election loop in few our in-house controllers just to avoid extra alerts during etcd maintenance, for example. Lease logic works completely on code level, no need to restart containers, everything works as expected.

Not a critical bug, just an improvement. Could be closed if decide to leave current behavior.

Jul 11 '23 10:07 zigmund

operator operator copied to clipboard

Operator failure during etcd nodes restart

operator
operator copied to clipboard