gloo icon indicating copy to clipboard operation
gloo copied to clipboard

Introduce Leader Election for Gloo

Open sam-heilbron opened this issue 3 years ago • 2 comments

Description

Introduce leader election to Gloo component

Context

Introduce leader election if using Kubernetes as the source of truth for resources.

Technical Debt

This is called out loudly in a code comment in https://github.com/solo-io/gloo/blob/master/projects/gloo/pkg/api/converters/kube/artifact_converter.go, where the debt is incurred.

We need to ignore the configmap (or whatever kube resource maintains the state of the leader) during translation. Since it is updated on an interval (2 seconds) if it's processed by Gloo controllers, we will resync the entire state of the world continually.

Ideally, we ignore configmaps with a particular label, but that isn't supported in solo-kit. The faster solution, is to ignore it explicitly in code for now, and handle the more robust solution in a follow-up.

Follow Up Work

  1. Solo-kit enhancements to resource filtering
  2. Helm hardening around other HA features

Checklist:

  • [x] I included a concise, user-facing changelog (for details, see https://github.com/solo-io/go-utils/tree/master/changelogutils) which references the issue that is resolved.
  • [ ] If I updated APIs (our protos) or helm values, I ran make -B install-go-tools generated-code to ensure there will be no code diff
  • [x] I followed guidelines laid out in the Gloo Edge contribution guide
  • [x] I opened a draft PR or added the work in progress label if my PR is not ready for review
  • [x] I have performed a self-review of my own code
  • [x] I have commented my code, particularly in hard-to-understand areas
  • [ ] I have made corresponding changes to the documentation
  • [x] I have added tests that prove my fix is effective or that my feature works

sam-heilbron avatar Aug 10 '22 18:08 sam-heilbron

Visit the preview URL for this PR (updated for commit 81a3ac5):

https://gloo-edge--pr6926-ha-part-3-g0d9pb36.web.app

(expires Mon, 22 Aug 2022 18:36:04 GMT)

🔥 via Firebase Hosting GitHub Action 🌎

github-actions[bot] avatar Aug 11 '22 04:08 github-actions[bot]

Issues linked to changelog: https://github.com/solo-io/gloo/issues/5795

solo-changelog-bot[bot] avatar Aug 11 '22 17:08 solo-changelog-bot[bot]

before we merge i'd recommend reading https://aws.amazon.com/builders-library/leader-election-in-distributed-systems/ and following best practices here, in particular:

  • add metrics for leaders
  • reason about lease expiration, especially in the case of garbage collection (which is a real concern for us as a cpu intensive application)

kdorosh avatar Aug 15 '22 16:08 kdorosh

before we merge i'd recommend reading https://aws.amazon.com/builders-library/leader-election-in-distributed-systems/ and following best practices here, in particular:

  • add metrics for leaders
  • reason about lease expiration, especially in the case of garbage collection (which is a real concern for us as a cpu intensive application)

Good call. I had looked too quickly and seen that the library exposes metrics, but the default is a noop. I'll update to include metrics.

Do you think lease expiration should just be configurable given the differences based on users environments?

sam-heilbron avatar Aug 15 '22 17:08 sam-heilbron

I think the defaults are sane; i don't think we will hit multi second issues unless we have deadlock, kernel errors, almost total network failure.. the kinds of things that should result in a new leader regardless

kdorosh avatar Aug 15 '22 17:08 kdorosh

I am content with current state.

nfuden avatar Aug 15 '22 19:08 nfuden