gloo
gloo copied to clipboard
Introduce Leader Election for Gloo
Description
Introduce leader election to Gloo component
Context
Introduce leader election if using Kubernetes as the source of truth for resources.
Technical Debt
This is called out loudly in a code comment in https://github.com/solo-io/gloo/blob/master/projects/gloo/pkg/api/converters/kube/artifact_converter.go, where the debt is incurred.
We need to ignore the configmap (or whatever kube resource maintains the state of the leader) during translation. Since it is updated on an interval (2 seconds) if it's processed by Gloo controllers, we will resync the entire state of the world continually.
Ideally, we ignore configmaps with a particular label, but that isn't supported in solo-kit. The faster solution, is to ignore it explicitly in code for now, and handle the more robust solution in a follow-up.
Follow Up Work
- Solo-kit enhancements to resource filtering
- Helm hardening around other HA features
Checklist:
- [x] I included a concise, user-facing changelog (for details, see https://github.com/solo-io/go-utils/tree/master/changelogutils) which references the issue that is resolved.
- [ ] If I updated APIs (our protos) or helm values, I ran
make -B install-go-tools generated-codeto ensure there will be no code diff - [x] I followed guidelines laid out in the Gloo Edge contribution guide
- [x] I opened a draft PR or added the work in progress label if my PR is not ready for review
- [x] I have performed a self-review of my own code
- [x] I have commented my code, particularly in hard-to-understand areas
- [ ] I have made corresponding changes to the documentation
- [x] I have added tests that prove my fix is effective or that my feature works
Visit the preview URL for this PR (updated for commit 81a3ac5):
https://gloo-edge--pr6926-ha-part-3-g0d9pb36.web.app
(expires Mon, 22 Aug 2022 18:36:04 GMT)
🔥 via Firebase Hosting GitHub Action 🌎
Issues linked to changelog: https://github.com/solo-io/gloo/issues/5795
before we merge i'd recommend reading https://aws.amazon.com/builders-library/leader-election-in-distributed-systems/ and following best practices here, in particular:
- add metrics for leaders
- reason about lease expiration, especially in the case of garbage collection (which is a real concern for us as a cpu intensive application)
before we merge i'd recommend reading https://aws.amazon.com/builders-library/leader-election-in-distributed-systems/ and following best practices here, in particular:
- add metrics for leaders
- reason about lease expiration, especially in the case of garbage collection (which is a real concern for us as a cpu intensive application)
Good call. I had looked too quickly and seen that the library exposes metrics, but the default is a noop. I'll update to include metrics.
Do you think lease expiration should just be configurable given the differences based on users environments?
I think the defaults are sane; i don't think we will hit multi second issues unless we have deadlock, kernel errors, almost total network failure.. the kinds of things that should result in a new leader regardless
I am content with current state.