gloo icon indicating copy to clipboard operation
gloo copied to clipboard

When leader election fails, gloo crashes

Open kevin-shelaga opened this issue 2 years ago • 6 comments

Gloo Edge Version

1.12.x (latest stable)

Kubernetes Version

1.21.x

Compromise: if gloo has good configuration it should not crash if it can not reach apiserver Namely we should do the following: Allow candidates who lose leadership to fallback to a follower gracefully. Previously, we fataled to ensure that we do not have multiple leaders at once. The downside to this, is that lost leadership can occur either as a result of throttling or network failure with the ApiServer, which may occur intermittently in an installation of Gloo Edge. While there are other solutions to reduce the chance of these happening, we will change our leadership code to revert back to a follower (ie not write statuses) instead of crashing.

Describe the bug

At this moment we arent clear on the route cause, but when leader election fails gloo will crash. etcd and the masters were healthy, but there were resource limits on gloo during this time.

I1019 20:02:47.456285       1 leaderelection.go:248] attempting to acquire leader lease grp-gloo-system/gloo-ee...
I1019 20:03:04.435270       1 leaderelection.go:258] successfully acquired lease grp-gloo-system/gloo-ee
I1019 20:03:12.519136       1 trace.go:205] Trace[120689475]: "Reflector ListAndWatch" name:pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:167 (19-Oct-2022 20:02:47.564) (total time: 24954ms):
Trace[120689475]: ---"Objects listed" 24906ms (20:03:12.470)
Trace[120689475]: [24.954889394s] [24.954889394s] END
{"level":"error","ts":"2022-10-19T20:10:24.668Z","logger":"gloo-ee.v1.event_loop.setup.v1.event_loop.syncer.kubernetes_eds","caller":"kubernetes/eds.go:209","msg":"upstream grp-pc-claims-capabilities-loss-report.v1-review-update-postman-54655: port 8080 not found for service v1-review-update-postman-54655","version":"1.12.28","stacktrace":"github.com/solo-io/gloo/projects/gloo/pkg/plugins/kubernetes.(*edsWatcher).List\n\t/go/pkg/mod/github.com/solo-io/[email protected]/projects/gloo/pkg/plugins/kubernetes/eds.go:209\ngithub.com/solo-io/gloo/projects/gloo/pkg/plugins/kubernetes.(*edsWatcher).watch.func1\n\t/go/pkg/mod/github.com/solo-io/[email protected]/projects/gloo/pkg/plugins/kubernetes/eds.go:230\ngithub.com/solo-io/gloo/projects/gloo/pkg/plugins/kubernetes.(*edsWatcher).watch.func2\n\t/go/pkg/mod/github.com/solo-io/[email protected]/projects/gloo/pkg/plugins/kubernetes/eds.go:257"}
{"level":"error","ts":"2022-10-19T20:22:13.930Z","logger":"gloo-ee.v1.event_loop.setup.v1.event_loop.syncer.kubernetes_eds","caller":"kubernetes/eds.go:209","msg":"upstream grp-pc-auto-3p-reports.v1-review-wiremock-testing-76192: port 8080 not found for service v1-review-wiremock-testing-76192","version":"1.12.28","stacktrace":"github.com/solo-io/gloo/projects/gloo/pkg/plugins/kubernetes.(*edsWatcher).List\n\t/go/pkg/mod/github.com/solo-io/[email protected]/projects/gloo/pkg/plugins/kubernetes/eds.go:209\ngithub.com/solo-io/gloo/projects/gloo/pkg/plugins/kubernetes.(*edsWatcher).watch.func1\n\t/go/pkg/mod/github.com/solo-io/[email protected]/projects/gloo/pkg/plugins/kubernetes/eds.go:230\ngithub.com/solo-io/gloo/projects/gloo/pkg/plugins/kubernetes.(*edsWatcher).watch.func2\n\t/go/pkg/mod/github.com/solo-io/[email protected]/projects/gloo/pkg/plugins/kubernetes/eds.go:257"}
E1019 20:22:41.912708       1 leaderelection.go:367] Failed to update lock: etcdserver: request timed out
E1019 20:22:44.902885       1 leaderelection.go:330] error retrieving resource lock grp-gloo-system/gloo-ee: Get "https://172.30.0.1:443/apis/coordination.k8s.io/v1/namespaces/grp-gloo-system/leases/gloo-ee": context deadline exceeded
I1019 20:22:44.902969       1 leaderelection.go:283] failed to renew lease grp-gloo-system/gloo-ee: timed out waiting for the condition
{"level":"error","ts":"2022-10-19T20:22:44.902Z","logger":"gloo-ee","caller":"kube/factory.go:61","msg":"Stopped Leading","version":"1.12.28","stacktrace":"github.com/solo-io/gloo/pkg/bootstrap/leaderelector/kube.(*kubeElectionFactory).StartElection.func2\n\t/go/pkg/mod/github.com/solo-io/[email protected]/pkg/bootstrap/leaderelector/kube/factory.go:61\nk8s.io/client-go/tools/leaderelection.(*LeaderElector).Run.func1\n\t/go/pkg/mod/k8s.io/[email protected]/tools/leaderelection/leaderelection.go:203\nk8s.io/client-go/tools/leaderelection.(*LeaderElector).Run\n\t/go/pkg/mod/k8s.io/[email protected]/tools/leaderelection/leaderelection.go:213"}
{"level":"fatal","ts":"2022-10-19T20:22:44.903Z","caller":"setup/setup.go:47","msg":"lost leadership, quitting app","stacktrace":"github.com/solo-io/solo-projects/projects/gloo/pkg/setup.Main.func3\n\t/workspace/solo-projects/projects/gloo/pkg/setup/setup.go:47\ngithub.com/solo-io/gloo/pkg/bootstrap/leaderelector/kube.(*kubeElectionFactory).StartElection.func2\n\t/go/pkg/mod/github.com/solo-io/[email protected]/pkg/bootstrap/leaderelector/kube/factory.go:62\nk8s.io/client-go/tools/leaderelection.(*LeaderElector).Run.func1\n\t/go/pkg/mod/k8s.io/[email protected]/tools/leaderelection/leaderelection.go:203\nk8s.io/client-go/tools/leaderelection.(*LeaderElector).Run\n\t/go/pkg/mod/k8s.io/[email protected]/tools/leaderelection/leaderelection.go:213"}

Steps to reproduce the bug

N/A

Expected Behavior

leader election shouldnt cause a crash

Additional Context

No response

kevin-shelaga avatar Oct 20 '22 12:10 kevin-shelaga

After a sync with @yuval-k @kevin-shelaga @nrjpoddar @EItanya we've decided to do the following:

  • Expose configuration to opt-out of leader election. Though it will be enabled by default, it would allow users who are running only a single replica of gloo to disable leader election
  • Allow candidates who lose leadership to fallback to a follower gracefully. Previously, we fataled to ensure that we do not have multiple leaders at once. The downside to this, is that lost leadership can occur either as a result of throttling or network failure with the ApiServer, which may occur intermittently in an installation of Gloo Edge. While there are other solutions to reduce the chance of these happening, we will change our leadership code to revert back to a follower (ie not write statuses) instead of crashing.

sam-heilbron avatar Oct 20 '22 14:10 sam-heilbron

@sam-heilbron it looks like the leaseholder is incorrect and doesnt get updated during these crashes

kevin-shelaga avatar Oct 20 '22 17:10 kevin-shelaga

  • Allow candidates who lose leadership to fallback to a follower gracefully. Previously, we fataled to ensure that we do not have multiple leaders at once. The downside to this, is that lost leadership can occur either as a result of throttling or network failure with the ApiServer, which may occur intermittently in an installation of Gloo Edge. While there are other solutions to reduce the chance of these happening, we will change our leadership code to revert back to a follower (ie not write statuses) instead of crashing.

Part 1 is complete and release in 1.13 and 1.12 EE. The second part has yet to be done.

sam-heilbron avatar Nov 05 '22 01:11 sam-heilbron

This errors also appears when we install gloo, following the documentation, as only ingress-controller. The error is:

E1115 10:33:09.618789       1 leaderelection.go:330] error retrieving resource lock gloo-system/gloo: leases.coordination.k8s.io "gloo" is forbidden: User "system:serviceaccount:gloo-system:gloo" cannot get resource "leases" in API group "coordination.k8s.io" in the namespace "gloo-system"

This is the case because this role is only available if the gateway is enabled.

davinkevin avatar Nov 15 '22 10:11 davinkevin

Just for transparency,

HA for the Gloo Pod (translation, serving translated configuration to gateway and admission validation for new resource) is working since 1.12.32.

SantoDE avatar Jan 25 '23 15:01 SantoDE

@sam-heilbron should this issue be closed now?

DoroNahari avatar Feb 13 '23 10:02 DoroNahari

This will be fixed in 1.17.0

davidjumani avatar Jun 19 '24 23:06 davidjumani