gardener ☂️ Introduce Gardener Node Agent (a.k.a. replace `cloud-config-downloader`)

How to categorize this issue?

/area quality robustness /kind enhancement epic

What would you like to be added:

Replace the cloud-config-download/executor which are written in the bash programming language with a binary written in go which is a kubernetes controller based on controller-runtime.

This was implemented up to a working implementation during the Gardener Hackathon 2023 in Leverkusen

https://github.com/gardener-community/hackathon

Why is this needed:

With the new Architecture we gain a lot, let's describe the most important gains here.

Developer Productivity

Because we all develop in go day by day, writing business logic in bash is difficult, hard to maintain, almost impossible to test. Getting rid of almost all bash scripts which are currently in use for this very important part of the cluster creation process will enhance the speed of adding new features and removing bugs.

Speed

Until now, the cloud-config-downloader runs in a loop every 60sec to check if something changed on the shoot which requires modifications on the worker node. This produces a lot of unneeded traffic on the api-server and wastes time, it will sometimes take up to 60sec until a desired modification is started on the worker node. By using the controller-runtime we can watch for the node, theOSC in the secret, and the shoot-access-token in the secret. If any of these object changed, and only then, the required action will take effect immediately. This will speed up operations and will reduce the load on the api-server of the shoot dramatically.

Scalability

Actually the cloud-config-downloader add a random wait time before restarting the kubelet in case the kubelet was updated or a configuration change was made to it. This is required to reduce the load on the API server and the traffic on the internet uplink. It also reduces the overall downtime of the services in the cluster because every kubelet restart takes a node for several seconds into NotReady state which eventually interrupts service availability.

Decision was made to keep the existing jitter mechanism which calculates the kubelet-download-and-restart-delay-seconds on the controller itself.

Correctness

The configuration of the cloud-config-downloader is actually done by placing a file for every configuration item on the disk on the worker node. This was done because parsing the content of a single file and using this as a value in bash reduces to something like VALUE=$(cat /the/path/to/the/file). Simple but lacks validation, type safety and whatnot. With the gardener-node-agent we introduce a new API which is then stored in the gardener-node-agent secret and stored on disc in a single yaml file for comparison with the previous known state. This brings all benefits of type safe configuration. Because actual and previous configuration are compared, removed files and units are also removed and stopped on the worker if removed from the OSC.

Availability

Previously the cloud-config-downloader simply restarted the systemd-units on every change to the OSC, regardless which of the services changed. The gardener-node-agent first checks which systemd-unit was changed, and will only restart these. This will remove unneeded kubelet restarts.

Goals

create a new shoot which only uses the new gardener-node-agent
update existing workers from cloud-config-downloader to gardener-node-agent
same behavior when applying the OSC as before.

Tasks (basic functionality)

Initial skaffolding and introduction of new component
- [x] https://github.com/gardener/gardener/pull/8249
- [x] https://github.com/gardener/gardener/pull/8627
Business Logic
- [x] https://github.com/gardener/gardener/pull/8629
- [x] https://github.com/gardener/gardener/pull/8632
- [x] https://github.com/gardener/gardener/pull/8683
- [x] https://github.com/gardener/gardener/pull/8678
- [x] https://github.com/gardener/gardener/pull/8707
  - [x] https://github.com/gardener/gardener/pull/8760
  - [x] https://github.com/gardener/gardener/pull/8814
  - [x] https://github.com/gardener/gardener/pull/8827
  - [x] https://github.com/gardener/gardener/pull/8825
  - [x] https://github.com/gardener/gardener/pull/8838
- [x] https://github.com/gardener/gardener/pull/8826
- [x] https://github.com/gardener/gardener/pull/8839
- [x] https://github.com/gardener/gardener/pull/8767
  - [x] https://github.com/gardener/gardener/pull/8893
  - [x] https://github.com/gardener/gardener/pull/8901
- [x] https://github.com/gardener/gardener/pull/8894
- [x] https://github.com/gardener/gardener/pull/8898
- [x] https://github.com/gardener/gardener/pull/8907
- [x] https://github.com/gardener/gardener/pull/8919
- [x] https://github.com/gardener/gardener/pull/9096
Bootstrapping gardener-node-agent
- [x] https://github.com/gardener/gardener/pull/8726
Seed / Shoot Controllers Adaptation
- [x] https://github.com/gardener/gardener/pull/8756
- [x] https://github.com/gardener/gardener/pull/8836
- [x] https://github.com/gardener/gardener/pull/8847
- [x] https://github.com/gardener/gardener/pull/9073
Adapt OS Extensions (check if they manipulate cloud-config-downloader unit (sample) - if they do, then try to remove these or enable gardener-node-init instead): https://github.com/gardener/gardener/pull/8647
- [x] provider-local: https://github.com/gardener/gardener/pull/8647 and https://github.com/gardener/gardener/pull/8834
- [x] os-gardenlinux: https://github.com/gardener/gardener-extension-os-gardenlinux/pull/130, released with v0.22.0
- [x] os-suse-chost: https://github.com/gardener/gardener-extension-os-suse-chost/pull/112, released with v1.24.0
- [x] os-coreos: https://github.com/gardener/gardener-extension-os-coreos/pull/80, released with v1.19.0
- [x] os-ubuntu: https://github.com/gardener/gardener-extension-os-ubuntu/pull/99, released with v1.24.0
- [x] os-metal: https://github.com/metal-stack/os-metal-extension/pull/33, released with v? (PR is ready, but no need to wait for it as part of this umbrella issue, so let's consider it done)
- [x] ~~os-k3os~~ (not needed since it was never used anywhere and is outdated anyways)
Miscellaneous
- [x] https://github.com/gardener/gardener/pull/7993 (this was found missing during the implementation of the gardener-node-agent)
- [x] Documentation: https://github.com/gardener/gardener/pull/8249
- [x] ~~Extra build job to publish gardener-node-agent and potentially kubelet/kubectl as OCI Image, see below section~~ (not needed since we decided to use native containerd functionality for extracting binaries from images, see https://github.com/gardener/gardener/pull/8678)
- [x] ~~Build gardener-node-agent container image in Prow (multi-arch!)~~ (we decided to drop Prow builds, see https://github.com/gardener/ci-infra/pull/981 and https://github.com/gardener/gardener/pull/8740 for details)
- [x] https://github.com/gardener/gardener/pull/8885
- [x] https://github.com/gardener/gardener/pull/8857
- [x] https://github.com/gardener/gardener/pull/8786
- [x] https://github.com/gardener/gardener/pull/9048 (follow-up of https://github.com/gardener/gardener/pull/8905)
- [x] https://github.com/gardener/gardener/pull/9077
- [x] https://github.com/gardener/gardener/pull/9114
- [x] https://github.com/gardener/gardener/pull/9280
Graduation & Rollout
- [x] Introduce new alpha UseGardenerNodeAgent feature gate in gardenlet controlling whether gardener-node-{init,agent} should be used: https://github.com/gardener/gardener/pull/8647
- [x] https://github.com/gardener/gardener/pull/9161
- [x] https://github.com/gardener/gardener/pull/9208
- [ ] Remove UseGardenerNodeAgent feature gate (planned for v1.92)
  - Cleanup migration code
  - Remove nerdctl from local OS image
  - Announce that correct versions of OS extensions are required: [email protected]+, [email protected]+, [email protected]+, [email protected]+, [email protected]+

Fixes https://github.com/gardener/gardener/issues/4673

OCI Image Support

~~A decision was made to go ahead with the current approach by extracting the binaries (kubelet, gardener-node-agent) out of docker images as currently implemented. Down the way towards GA of gardener-node-agent, the OCI artifact support is implemented and right before GA, this support is merged. The existing docker images of all still supported kubernetes/kubelet versions need to be pushed as OCI artifact and of course the binary of the gardener-node-agent.~~

Update: Not needed since we decided to use native containerd functionality for extracting binaries from images, see https://github.com/gardener/gardener/pull/8678.

No modifications are made to cloud-config-downloader to support OCI artifacts.

Jun 05 '23 05:06 majst01

:+1: for this stuff!

I would like to propose another future goal, which can be turned into an own issue after this one was resolved. The idea is to drop the OSC with -original suffix (purpose reconcile) and migrate its contents into the NodeAgentConfiguration. The -downloader OSC can remain as is.

This way, we can move on to a more concise interface for gardener-node-agent, which has several advantages:

The new interface enhances the visibility and testability for the features that are currently wrapped implicitly inside the OSC's file and unit definitions, making them easier to parametrize and to implement
Gardener and Extensions should not need to struggle with OS-/node-internals, only the gardener-node-agent should implement this kind of code
Extensions that want to hook into node-specific configuration do not need to agree on conventions (an example might be the registry-extension, which wants to add mirror configuration for containerd and has to put it to a specific path in order not to conflict with other extensions that want to modify the containerd configuration as well).
The node-agent can directly read its configuration from the secret instead of the described approach where the configuration is wrapped inside the OSC, which the OSC controller writes to the disk, which will then notify other controllers do apply possible changes.

Here is an example of how the configuration could be extended:

type NodeAgentConfiguration struct {
	metav1.TypeMeta

	// APIServer contains the connection configuration for the gardener-node-agent to
	// access the shoot api server.
	APIServer APIServer `json:"apiServer"`

	// ConfigurationSecretName defines the name of the secret in the shoot cluster, which contains
	// the configuration for the gardener-node-agent.
	ConfigurationSecretName string `json:"configSecretName"`
	// TokenSecretName defines the name of the secret in the shoot cluster, which contains
	// the projected shoot access token for the gardener-node-agent.
	TokenSecretName string `json:"tokenSecretName"`

	// Image is the docker image reference to the gardener-node-agent.
	Image string `json:"image"`
	// HyperkubeImage is the docker image reference to the hyperkube containing kubelet.
	HyperkubeImage string `json:"hyperkubeImage"`

	// KubernetesVersion contains the kubernetes version of the kubelet, used for annotating
	// the corresponding node resource with a kubernetes version annotation.
	KubernetesVersion string `json:"kubernetesVersion"`

	// ContainerdConfiguration configures the containerd runtime according to the given configuration.
	// +optional
	ContainerdConfiguration *ContainerdConfiguration `json:"containerdConfiguration,omitempty`

	// ValitailConfiguration configures valitail according to the given configuration.
	// +optional
	ValitailConfiguration *ValitailConfiguration `json:"valitailConfiguration,omitempty`
    
	// KubeletDataVolumeSize sets the data volume size of an unformatted disk on the worker node,
	// which is the be used for /var/lib on the worker.
	// +optional
	KubeletDataVolumeSize *int64 `json:"kubeletDataVolumeSize,omitempty"`
}

type ContainerdConfiguration struct {
	// RegistryMirrors a list of registry mirror to configure for containerd.
	RegistryMirrors []string `json:"registryMirrors,omitempty`
	// MonitorEnabled deploys a service to monitor the containerd service.
	MonitorEnabled bool `json:"monitorEnabled`
	// LogRotationPeriod is cron schedule for when logs of containerd a rotated.
	LogRotationPeriod string `json:"logRotationPeriod"`

	// ... fields have to be worked out in the future
}

type ValitailConfiguration struct {
	// ... fields have to be worked out in the future
}

Do you think this is feasible?

Jun 05 '23 08:06 Gerrit91

As promised I rebased all currently known future changes from the hackathon following the current PR. Here you can have a look on the additional changes. As I could not come up with an ideal order of upcoming PRs based on the controllers yet, I did not create isolated commits (although it would be easier to cherry-pick and rebase a controller for a pr).

Sadly the codebase has diverged and I couldn't fix all compile issues by myself.

compared to the current pr: https://github.com/metal-stack/gardener/compare/gardener-node-agent-lib...gardener-node-agent-complete
gardener-node-agent-complete on metal-stack/gardener
the original, non-rebased version is gardener-node-agent-2. But due to the unrelated history it displays all changes - even the original and unrevisited ones in the current initial PR: https://github.com/metal-stack/gardener/compare/gardener-node-agent-lib...gardener-node-agent-2

I hope this serves well as a starting point

Sep 15 '23 11:09 vknabel

/assign @oliver-goetz @rfranzke

Oct 09 '23 11:10 rfranzke

It would be very helpful if a future node-agent could try to avoid restarting all kubelets more or less at once but do it in a jittered/delayed/consecutive way to prevent the effects of https://github.com/kubernetes/kubernetes/issues/100277. We regularly see customers having short outages due to the annoying restart behaviour that renders all containers temporarily unready.

Oct 16 '23 11:10 dguendisch

It would be very helpful if a future node-agent could try to avoid restarting all kubelets more or less at once but do it in a jittered/delayed/consecutive way to prevent the effects of kubernetes/kubernetes#100277. We regularly see customers having short outages due to the annoying restart behaviour that renders all containers temporarily unready.

Hi @dguendisch This is currently in discussion if we simply jitter the restarts uncoordinated across all nodes, or in a orchestrated way to achieve some sort of rolling update/restart of the kubelets. This later is much more difficult

Oct 16 '23 11:10 majst01

Sounds reasonable, a jittered restart should be already way better (if spread across a sufficiently big time window) than what we have today.

Oct 16 '23 11:10 dguendisch

@dguendisch We already jitter kubelet restarts today (actually, all cloud-config updates, but anyways). The time window is just 5m, however, see https://github.com/gardener/gardener/blob/d2dcd7dce7c39f16e02cfe081d9e26ed7e751ee9/pkg/operation/shoot/shoot.go#L250-L260 and https://github.com/gardener/gardener/blob/d2dcd7dce7c39f16e02cfe081d9e26ed7e751ee9/pkg/component/extensions/operatingsystemconfig/executor/templates/scripts/execute-cloud-config.tpl.sh#L221-L242. Why isn't this good enough? Of course, with "bad luck", multiple nodes could compute a delay close to each other which would lead to the impression of "restarts at once"...

Oct 16 '23 12:10 rfranzke

I see, I was wrongly under the impression that it's not actively jittered beyond the (30sec?) interval of the cloud-config being executed. Then it indeed rather sounds like "bad luck" that some customers observe.

Oct 16 '23 12:10 dguendisch

OK. I have added a task above for improving this anyways, but for now/for the initial version of gardener-node-agent, we will probably go with the current approach. We can also try to increase the 5m if really necessary. Perhaps you can share some timestamps from clusters that suffered from it for better understanding/analysis :)

Oct 16 '23 12:10 rfranzke

The last case I've noticed seems "somewhat" jittered (though not really randomly), just filtered the logs for one particular kubelet startup msg: (also strange in the case above was that every kubelet actually restarted twice in a 10secs timewindow, but that's maybe something completely different)

Oct 16 '23 12:10 dguendisch

I guess this depends on the entropy of $RANDOM which is a bash built-in

Oct 16 '23 12:10 majst01

Regarding this item:

Restrict Secret cache to only relevant Secrets (access token secrets, OSC secrets) (use SingleObject cache)

This items tries to solve two problems existing:

gardener-node-agent has RBAC permissions for listing all Secrets in the kube-system namespace
gardener-node-agent watches all Secrets in the kube-system namespace (this can drive up memory usage for some shoot clusters, especially when they have a lot of worker pools).

I don't think it's possible to use the single object cache for our use case. This cache only works well when you don't have controllers watching the respective object kind. Today, we only use it in gardenlet, and for all kinds put behind the single object cache, gardenlet indeed does not have any controller reconciling these objects. It only performs GET calls for them as part of other reconciliations.

gardener-node-agent, however, has two controllers watching Secrets: The token controller as well as the operatingsystemconfig controller. The secret names are well-known via its component config. While we can configure label and field selectors via the manager's cache options, Kubernetes does generally not support multiple field selectors that are OR-ed. Unfortunately, this is what we need, something like metadata.name=<osc-secret> OR metadata.name<token-secret1> OR .... This disqualifies the possibility to natively use the standard controller-runtime options. An alternative could be to creates individual caches for the respective secrets, something like

oscSecretCache, err := cache.New(restConfig, cache.Options{
	ByObject: map[client.Object]cache.ByObject{&corev1.Secret{}: {
		Namespaces: map[string]cache.Config{metav1.NamespaceSystem: {}},
		Field:      fields.SelectorFromSet(fields.Set{metav1.ObjectNameField: cfg.Controllers.OperatingSystemConfig.SecretName}),
	}},
})
// repeat the same for the token secrets

The problem is that we also needed to aggregate them properly so that the two controllers work. This is actually quite hard since each of these caches is backed by a dedicated informer and a dedicated WATCH connection. So, we needed to build something like an aggregated informer which gets updated whenever any of these WATCHes update. While this might be possible, I don't think it's reasonable to pursue this path any further.

Instead, I propose to accept the first "problem" (gardener-node-agent is a system component, so allowing it to watch Secrets in the kube-system namespace is not really problematic, and there are also other system components with this permission)), and rather tackle the second problem by working with label selectors for gardener-node-agent's cache/WATCHes:

gardenlet already adds labels to OperatingSystemConfig Secrets that could be used by gardener-node-agent for the its label selector. Access token Secrets don't have labels yet, however, gardenlet could start adding them as well.

Still, there would be the problem that label selectors cannot be OR-ed (we want to list/watch all Secrets with labels gardener.cloud/role=operating-systemconfig AND worker.gardener.cloud/pool=<pool-name>, OR gardener.cloud/role=access-token. This does not work in Kubernetes, and we are back to the same issue from above (we needed aggregation for two caches).

What we could do, though, is creating dedicated caches for the operatingsystemconfig and token controllers, e.g. like this:

diff --git a/pkg/nodeagent/controller/operatingsystemconfig/add.go b/pkg/nodeagent/controller/operatingsystemconfig/add.go
index 0d13e256b..2702d377b 100644
--- a/pkg/nodeagent/controller/operatingsystemconfig/add.go
+++ b/pkg/nodeagent/controller/operatingsystemconfig/add.go
@@ -22,9 +22,12 @@ import (
        "github.com/spf13/afero"
        corev1 "k8s.io/api/core/v1"
        metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"
+       "k8s.io/apimachinery/pkg/fields"
+       "k8s.io/apimachinery/pkg/runtime/schema"
        "k8s.io/apimachinery/pkg/types"
        "k8s.io/client-go/util/workqueue"
        "sigs.k8s.io/controller-runtime/pkg/builder"
+       "sigs.k8s.io/controller-runtime/pkg/cache"
        "sigs.k8s.io/controller-runtime/pkg/client"
        "sigs.k8s.io/controller-runtime/pkg/controller"
        "sigs.k8s.io/controller-runtime/pkg/event"
@@ -34,6 +37,7 @@ import (
        "sigs.k8s.io/controller-runtime/pkg/reconcile"
        "sigs.k8s.io/controller-runtime/pkg/source"

+       kubernetescache "github.com/gardener/gardener/pkg/client/kubernetes/cache"
        predicateutils "github.com/gardener/gardener/pkg/controllerutils/predicate"
        nodeagentv1alpha1 "github.com/gardener/gardener/pkg/nodeagent/apis/config/v1alpha1"
        "github.com/gardener/gardener/pkg/nodeagent/dbus"
@@ -46,9 +50,6 @@ const ControllerName = "operatingsystemconfig"

 // AddToManager adds Reconciler to the given manager.
 func (r *Reconciler) AddToManager(mgr manager.Manager) error {
-       if r.Client == nil {
-               r.Client = mgr.GetClient()
-       }
        if r.Recorder == nil {
                r.Recorder = mgr.GetEventRecorderFor(ControllerName)
        }
@@ -62,16 +63,34 @@ func (r *Reconciler) AddToManager(mgr manager.Manager) error {
                r.Extractor = registry.NewExtractor()
        }

        }

+       oscSecretCache, err := cache.New(mgr.GetConfig(), cache.Options{
+               ByObject: map[client.Object]cache.ByObject{&corev1.Secret{}: {
+                       Namespaces: map[string]cache.Config{metav1.NamespaceSystem: {}},
+                       Field:      fields.SelectorFromSet(fields.Set{metav1.ObjectNameField: r.Config.SecretName}),
+               }},
+       })
+       if err != nil {
+               return err
+       }
+       if err := mgr.Add(oscSecretCache); err != nil {
+               return err
+       }
+       c := kubernetescache.NewAggregator(mgr.GetCache(), map[schema.GroupVersionKind]cache.Cache{corev1.SchemeGroupVersion.WithKind("Secret"): oscSecretCache}, mgr.GetScheme())
+
+       if r.Client == nil {
+               r.Client, err = client.New(mgr.GetConfig(), client.Options{Cache: &client.CacheOptions{Reader: c}})
+               if err != nil {
+                       return err
+               }
+       }
+
        return builder.
                ControllerManagedBy(mgr).
                Named(ControllerName).
                WatchesRawSource(
-                       source.Kind(mgr.GetCache(), &corev1.Secret{}),
+                       source.Kind(c, &corev1.Secret{}),
                        r.EnqueueWithJitterDelay(mgr.GetLogger().WithValues("controller", ControllerName).WithName("jitterEventHandler")),
                        builder.WithPredicates(
-                               predicate.NewPredicateFuncs(func(obj client.Object) bool {
-                                       return obj.GetNamespace() == metav1.NamespaceSystem && obj.GetName() == r.Config.SecretName
-                               }),
                                r.SecretPredicate(),
                                predicateutils.ForEventTypes(predicateutils.Create, predicateutils.Update),
                        ),

(similarly for the token controller)

This would start two WATCHes for Secrets, one with the field selector for the OSC secret name, and one with the label selector for the access token secrets. While this approach is not the most straight-forward one, it should help improving gardener-node-agent's memory usage for large clusters. It still requires permissions to read all Secrets in kube-system, but as discussed above, I think this is acceptable.

Nov 23 '23 17:11 rfranzke

In a sync with @timebertt we discussed more options compared to the above (which is also okay, but a bit cumbersome since the manager's client would not be allowed to get used when reading Secrets, so it increases the development complexity and maintenance efforts):

Add the access tokens as extensionsv1alpha1.Files to the OSC secret. This would allow us to delete the token controller and restrict GNA to exactly the OSC secret (via field selector). The challenge here is that gardener-resource-manager's token requestor would either need to inject this file into the OSC secret which is managed by a ManagedResource (abandoned due to complexity), or the gardenlet would need to inject the tokens during Shoot reconciliation (abandoned as well since it might break clusters which hang in earlier reconciliation steps before reaching the OSC secret reconciliation).
Add the access tokens as separate data keys to the OSC secret. This would again allow us to restrict GNA to exactly the OSC secret (via field selector). The challenge here is that we needed to adapt GRM to not revert these changes (since the OSC secret is managed via ManagedResource), and that we probably should create a dedicated access token for GNA/valitail for each worker pool. We also abandoned this approach because the implementation effort is a bit high compared to the goal/benefits).
Drop the Secret watch in GNA's token controller and replace it with a period resync which uses the API reader instead. Hence, access tokens would no longer be watched, but this is somewhat "over-engineered" anyways, since they only change ~each 12h and are valid for 90d, so even if not synced immediately, the old token is still usable. It would be enough to just read them once each ~12h and sync them again to the node. This would allow us to restrict the GNA secret cache to the OSC secret only (via field selector).

After comparing the implementation effort, complexity, and maintenance concerns with the goals and benefits, we decided to pursue approach 3.

Nov 24 '23 13:11 rfranzke

/area ipcei

Mar 25 '24 09:03 oliver-goetz

All tasks have been completed. /close

Apr 12 '24 08:04 rfranzke

@rfranzke: Closing this issue.

In response to this:

All tasks have been completed. /close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Apr 12 '24 08:04 gardener-prow[bot]

gardener gardener copied to clipboard

☂️ Introduce Gardener Node Agent (a.k.a. replace `cloud-config-downloader`)

gardener
gardener copied to clipboard