Description

Introduce leader election to Gloo component

Context

Introduce leader election if using Kubernetes as the source of truth for resources.

Technical Debt

This is called out loudly in a code comment in https://github.com/solo-io/gloo/blob/master/projects/gloo/pkg/api/converters/kube/artifact_converter.go, where the debt is incurred.

We need to ignore the configmap (or whatever kube resource maintains the state of the leader) during translation. Since it is updated on an interval (2 seconds) if it's processed by Gloo controllers, we will resync the entire state of the world continually.

Ideally, we ignore configmaps with a particular label, but that isn't supported in solo-kit. The faster solution, is to ignore it explicitly in code for now, and handle the more robust solution in a follow-up.

Follow Up Work

Solo-kit enhancements to resource filtering
Helm hardening around other HA features

Checklist:

[x] I included a concise, user-facing changelog (for details, see https://github.com/solo-io/go-utils/tree/master/changelogutils) which references the issue that is resolved.
[ ] If I updated APIs (our protos) or helm values, I ran make -B install-go-tools generated-code to ensure there will be no code diff
[x] I followed guidelines laid out in the Gloo Edge contribution guide
[x] I opened a draft PR or added the work in progress label if my PR is not ready for review
[x] I have performed a self-review of my own code
[x] I have commented my code, particularly in hard-to-understand areas
[ ] I have made corresponding changes to the documentation
[x] I have added tests that prove my fix is effective or that my feature works

Aug 10 '22 18:08 sam-heilbron

Visit the preview URL for this PR (updated for commit 81a3ac5):

https://gloo-edge--pr6926-ha-part-3-g0d9pb36.web.app

_{(expires Mon, 22 Aug 2022 18:36:04 GMT)}

_{🔥 via Firebase Hosting GitHub Action 🌎}

Aug 11 '22 04:08 github-actions[bot]

Issues linked to changelog: https://github.com/solo-io/gloo/issues/5795

Aug 11 '22 17:08 solo-changelog-bot[bot]

before we merge i'd recommend reading https://aws.amazon.com/builders-library/leader-election-in-distributed-systems/ and following best practices here, in particular:

add metrics for leaders
reason about lease expiration, especially in the case of garbage collection (which is a real concern for us as a cpu intensive application)

Aug 15 '22 16:08 kdorosh

before we merge i'd recommend reading https://aws.amazon.com/builders-library/leader-election-in-distributed-systems/ and following best practices here, in particular:

add metrics for leaders

reason about lease expiration, especially in the case of garbage collection (which is a real concern for us as a cpu intensive application)

Good call. I had looked too quickly and seen that the library exposes metrics, but the default is a noop. I'll update to include metrics.

Do you think lease expiration should just be configurable given the differences based on users environments?

Aug 15 '22 17:08 sam-heilbron

I think the defaults are sane; i don't think we will hit multi second issues unless we have deadlock, kernel errors, almost total network failure.. the kinds of things that should result in a new leader regardless

Aug 15 '22 17:08 kdorosh

I am content with current state.

Aug 15 '22 19:08 nfuden

gloo
gloo copied to clipboard

Introduce Leader Election for Gloo

Description

Context

Technical Debt

Follow Up Work

Checklist:

gloo gloo copied to clipboard

Introduce Leader Election for Gloo

Description

Context

Technical Debt

Follow Up Work

Checklist:

gloo
gloo copied to clipboard