omni icon indicating copy to clipboard operation
omni copied to clipboard

[proposal] Support for installing Kubernetes apps using Omni

Open smira opened this issue 1 year ago • 12 comments

Rationale

Omni allows to define cluster fully via the cluster templates, which allows to install machines, bring them into the cluster, assert they are ready and healthy. Cluster templates also allow to configure Talos Linux (and transitively, Kubernetes).

Sometimes there's an additional requirement to make cluster up and running, e.g.:

  • install Cilium CNI (supported way is via Helm)
  • bootstrap the initial Kubernetes app installer/updater (e.g. ArgoCD or Flux), which is also via Helm
  • install some additional applications if the installation is simple, and gitops flow via ArgoCD/Flux is not desired

Today the only way to install Kubernetes apps is by using Talos machine config "extra bootstrap manifests" feature, but this feature is not based on helm, so the installed manifests are not tracked as installed by helm, and can't be easily managed later by helm. This adds extra bloat to the Talos machine configuration, which is not needed.

Omni can be in a perfect position to manage Kubernetes apps in the cluster: it is a single instance (vs. Talos controlplane machines which can be multiple for a cluster), it already has information about cluster health (knows when it's safe to install), it already has a language to describe the cluster (cluster templates).

Proposed Solution

As much as we are not happy with Helm, Helm is the de-facto standard.

For the initial phase, in order to simplify things, let's limit ourselves to the initial installation of Helm charts (skipping upgrades, changing chart values, etc.), as this is more simple, less risky, and solves the immediate problem of fully bootstrapping the cluster. In the future work, we might support updating charts as well.

As cluster templates are text YAML files, we should try to preserve this simple approach friendly to version control, expansion, templating, etc. The proposal is to use Helmfile as a language to describe what has to be installed.

We can add a field strategy and force it to be set to bootstrap-only to indicate that right now the charts are installed only once.

The initial scope is to support only charts available to Omni without auth or special setup, that is Omni should be able to download the charts from public repositories.

Cluster templates should sync the Helm instructions to an Omni resource (per cluster) describing charts to be installed.

Omni should have a controller which watches cluster status, and as soon as the cluster is ready (Kubernetes API is available), performs helm installation. Omni keeps the status of the install, and if the install was done, and strategy is bootstrap-only, it skips any work on this cluster/Helm chart.

Omni might keep a cache of downloaded Helm charts.

Future Work

  • support updating/upgrading charts
  • support private Helm charts
  • support sharing Helmfile parts across clusters (i.e. enforcing a policy that e.g. cert-manager vX.Y should be installed for all clusters)
  • showing pending updates/scheduling updates, etc.

smira avatar Sep 11 '24 10:09 smira

I like it. My only concern would be Helmfile - I used it for a while on my homelab some years ago, but hit some issues and stopped using it. But I hope it's way better now, as it is actively developed.

My initial idea was to use Flux CD for it, but maybe we leave it to the cluster operators as it is way more complex and CRD based - Helmfile seems to give us the declarative language we need, without entering into the CRDs territory.

Another item in the future work could be, although it is loosely related, some sort of secrets management for these workloads.

utkuozdemir avatar Sep 11 '24 13:09 utkuozdemir

Another item in the future work could be, although it is loosely related, some sort of secrets management for these workloads.

Yes, there's an issue: #572 .

I think we should support sops, includes and templating in cluster templates (but that deserves a separate issue)

smira avatar Sep 11 '24 14:09 smira

Potential problems:

  • helmfile might shell out to helm and other tools (we need implement our own helm integration)

To avoid upgrades for each iteration of helm, the helmfile executable delegates to helm - as a result, helm must be installed.

  • support only Helm charts in Helmfile
  • ask helmfile devs about PR to make optionally use helm as a library

smira avatar Sep 11 '24 16:09 smira

One random thought I had earlier about longer term implementation here. We should take care to design how we'll sync all clusters if we decide to support ongoing rollouts. In the case of, say, 1000+ clusters using this feature, we should make sure that if we sync every 15m or so we should have some random splay or batching or some other mechanism so that Omni isn't trying to update all 1000+ at once.

Totally not for the initial work here, but just wanted to capture it somewhere.

rsmitty avatar Sep 11 '24 19:09 rsmitty

Totally not for the initial work here, but just wanted to capture it somewhere.

Good point, this should mostly work by design, as the controller has a fixed set of worker slots, the concurrency of the operation should be controlled by the number of slots in the controller applying Helmfiles.

smira avatar Sep 12 '24 14:09 smira

I can't say that I like it, but another idea might be to run something like helmfile-controller inside the workload cluster configured by a ConfigMap for example, and Omni simply pushes the ConfigMap, and waits for the controller to do its job.

This might simplify some requirements (e.g. having different versions of the controller), or having access to private helm charts, but it takes away some resources from the workload cluster, the controller has to run with host networking (to install CNI), etc.

smira avatar Sep 13 '24 07:09 smira

I can't say that I like it, but another idea might be to run something like helmfile-controller inside the workload cluster configured by a ConfigMap for example, and Omni simply pushes the ConfigMap, and waits for the controller to do its job.

This might simplify some requirements (e.g. having different versions of the controller), or having access to private helm charts, but it takes away some resources from the workload cluster, the controller has to run with host networking (to install CNI), etc.

If we decided to go that route, we could use flux instead. I'd rather leave those things to the cluster operator, and do the helmfile part completely from Omni, so the clusters would stay "vanilla".

utkuozdemir avatar Sep 13 '24 13:09 utkuozdemir

Another note regarding the strategy field with bootstrap-only. We should make sure that bootstrap-only only triggers on the initial deployment. It is my understanding that the current inlineManifests options are triggered upon upgrades as well, even though the docs imply its only on bootstrap. This could be problematic.

kenlasko avatar Oct 21 '24 13:10 kenlasko

What about using flux-aio (also to have cilium there). Timoni looks like a really good bundler for kubernetes apps.

ghost avatar Nov 10 '24 21:11 ghost

We use timoni charts (with helm) and flux-aio (cilium for talos) which works very well and easy to deploy. Would love if omni has a way to create template (read/write) like devtron/coder.com where one can create/update/delete the timoni charts (flux) (with SSO and groups integrated for having different namespaces for different users) and where one sees the current state and logs. Also would be great if a solution like kubecost/opencost is integrated, where one can give users access to see their cost and also a api interface.

https://devtron.ai/product/release-orchestration/ has for example a chart store for flux helm and with some templates like coder.com where one can parametrized the chart it would be really great.

suse-coder avatar Dec 23 '24 16:12 suse-coder

This is what is use to upgrade cilium with timoni and it works like a charm (cuelang is awesome).

bundle: {
	apiVersion: "v1alpha1"
	name:       "flux-aio"
	instances: {
		"flux": {
			module: {
				url:     "oci://ghcr.io/stefanprodan/modules/flux-aio"
				version: "2.4.0-2"
			}
			namespace: "flux-system"
			values: {
                controllers: {
                    helm: enabled:         true
                    kustomize: enabled:    true
                    notification: enabled: true
                }
				hostNetwork:     true
				securityProfile: "privileged"
				env: {
					"KUBERNETES_SERVICE_HOST": "localhost"
					"KUBERNETES_SERVICE_PORT": "7445"
				}
			}
		}
		"cilium-hr": {
			module: {
				url: "oci://ghcr.io/stefanprodan/modules/flux-helm-release"
				version: "2.4.0-2"
			}
			namespace: "flux-system"
			values: {
				repository: url: "https://helm.cilium.io/"
				chart: {
					name:     "cilium"
					version:  "1.17.0-rc.2"
				}
				helmValues: {
					daemonSet: {
						updateStrategy: {
						type: "RollingUpdate"
						rollingUpdate: {
							maxUnavailable: 1
						}
						}
					}
					operator: {
						rollOutPods: true
						updateStrategy: {
						type: "RollingUpdate"
						}
					}
					rollOutCiliumPods: true
					
					
					ipam: mode:               "kubernetes"
					kubeProxyReplacement:     true
					securityContext: capabilities: {
						ciliumAgent: [
							"CHOWN",
							"KILL",
							"NET_ADMIN",
							"NET_RAW",
							"IPC_LOCK",
							"SYS_ADMIN",
							"SYS_RESOURCE",
							"DAC_OVERRIDE",
							"FOWNER",
							"SETGID",
							"SETUID",
						]
						cleanCiliumState: [
							"NET_ADMIN",
							"SYS_ADMIN",
							"SYS_RESOURCE",
						]
						envoy: [
							"NET_ADMIN",
							"NET_BIND_SERVICE",
							"PERFMON",
							"BPF",
						]
					}

					k8sClientRateLimit: {
						qps:   100
						burst: 200
					}
					cgroup: {
						autoMount: enabled: false
						hostRoot:           "/sys/fs/cgroup"
					}
					k8sServiceHost: "localhost"
					k8sServicePort: 7445
                    l2announcements: {
                        enabled: true
                    }
                    envoy: {
                        enabled: true
                        securityContext: capabilities: {
                            keepCapNetBindService: true
                        }
                    }
                    hubble: {
                        relay: {
                            enabled: true
                        }
                        ui: {
                            enabled: true
                        }
                    }
                    gatewayAPI: {
                        enabled: true
                    }
				}
                sync: targetNamespace: "kube-system"
			}
		}
	}
}

suse-coder avatar Feb 13 '25 19:02 suse-coder

Timoni Bundles are really well designed (type support, that helm does not have) and with fluxcd very well integrated and very fast (way faster than argocd).

This is very well designed template for timoni modules: https://github.com/stefanprodan/podinfo/tree/master/timoni/podinfo

Would be really awesome to have an app store in talos, so one does not have to have to reinvent the wheel every time + some user based rbac in the web ui.

suse-coder avatar Feb 13 '25 19:02 suse-coder

This issue is stale because it has been open 180 days with no activity. Remove stale label or comment or this will be closed in 7 days.

github-actions[bot] avatar Aug 13 '25 02:08 github-actions[bot]

not stale

suse-coder avatar Aug 13 '25 06:08 suse-coder