flux2
flux2 copied to clipboard
Multi-cluster section
Follow-up after #608. See https://github.com/fluxcd/flux2/pull/608#pullrequestreview-553880064
Can include multiple environments, production setup, and promotion.
I agree -- there are several paths for multi-cluster topologies. We should help folks looking to do multi-cluster understand the concepts that can compose for their environment:
- single repo + multi-path vs. branches vs. multi-repo
- flux bootstrap indepenent clusters on different paths
- remote apply mechanism / kubeConfig constraints
- use a management cluster to reconcile child clusters
- bootstrap flux in a child cluster using a management cluster
- cluster API
- differentiating metrics / alerts
For promotion, there are several strategies:
- tagging
- paths
- branching
^ These can be combined and have significant impact on the way that platform is managed and software is released. Should also link out to the notification / alerts docs and observability
Will add more detail later
There are two examples in fluxcd org on multi-cluster that show different approaches:
- https://github.com/fluxcd/flux2-kustomize-helm-example
- https://github.com/fluxcd/flux2-multi-tenancy
As a starting point, we could write a doc page that introduces the 2 examples and links to the docs inside each repo.
Tagging the repo commits is a flux v1 feature that we need to upgrade flux v2.
I think the docs could do a better job on advising how to manage multiple environments and phased promotions between them.
Particularly, should everything be in one branch? The monorepo has two obvious downsides:
- Changes to common components (kustomize base) will immediately deploy to production, rather than being tested first in the staging environment. (There's a workaround whereby shared changes are temporarily refactored into overlay patches for phased testing in a specific environment, but this is onerous and makes promotion error-prone.)
- The merge controls can not distinguish production from non-production. For example, using GitHub to enforce stringent review/approval processes for production will, as a side effect, cause unnecessary friction when making changes in the development environment.
Having two branches (e.g. production and staging) makes promotions more difficult to perform using git tools. If the branches are totally separate then divergence accumulates, making the repo generally difficult to manage. If the content is factored into a common base and overlays, it is easier to identify and remove divergence between the branch heads, but the histories nonetheless become unrelated (e.g., if a hotfix must be promoted to one app ahead of other already-staged changes for other apps, assuming multiple collaborators use the repo) which makes merges nontrivial.
Should there be a second repo where all of the applications are gathered? For applications represented as helm charts, this is viable (compare bitnami's chart repo) so long as the apps repo has CI to package and release updated charts into an OCI registry, and flux image tag automation is used to propagate releases into appropriate parts of the environments repo. It can also work for kustomize if using a scheme of app-specific release tags (as the flux gitrepo resource can refer to a specific subpath and a specific version pattern). This approach relies on refactoring the apps to minimise what config lives in the env overlay or helmrelease manifest, otherwise environments will still exhibit duplication (and be prone to unintended divergence). The downside is that this requires either intricate CI (for chart packaging) or is reliant on tags for promotions (note tags are less safely preserved than the commit history).
Separate repos for each application is hardly different to having one separate repo for all apps.
Another option is trunk based development with no separate staging environment, either by:
- Instead relying on additional technologies like flagger to manage phased promotions for individual apps (so deployments are staged within the same cluster as production). An outline for using flagger would be useful, especially if it is the recommended approach?
- Alternatively, using CI to dynamically spin up (and tear down) test environments for feature branches.
There are so many common pitfalls, this would be a great place to advise on best practice as in part, the problem originates from flux orchestrating an entire environment from a repo.
I've been working on an a reference architecture for using Flux to manage the continuous delivery of Kubernetes infrastructure and applications on multi-cluster multi-tenant environments.
https://github.com/controlplaneio-fluxcd/d1-fleet
The setup is comprised of multiple repositories that:
- offers a clean separation between platform teams and dev teams, and between infra (cluster add-ons) and apps
- allows customising the infra and apps releases for each environment
- changes to the base overlay do not affect the production clusters
- the staging cluster runs the automation for updating the Helm charts to the latest versions
- promotions for infra and apps from staging to production are done via PRs (merging main into the production branch)
- dev teams get their own repos which are reconciled under a restricted service account
@stefanprodan am I correct in understanding that your reference architecture for continuous delivery has the app repos each use separate different branches for production and staging? In other words, it adopts the "git flow" branching model documented here and here. Note that both those references discourage the use of that branching model for modern DevOps and continuous delivery. Is there anywhere that you've addressed those concerns or presented more discussion regarding the choice?
There is a big difference between code and infra, in my setup the main branch contains both overlays (which is never the case with code). Merging into another branch like production, or tagging the main brach with a semver tag, is just a way to say "promote this". The actual work is done in main, this is trunk-based development, not git flow. One improvement I would like to make, it to use the production branch as a canary, that gets synced by a single production cluster, and promoting the changes to the whole fleet would happen after tagging that branch with a semver tag.