gardener
gardener copied to clipboard
☂️ Introduce Gardener Node Agent (a.k.a. replace `cloud-config-downloader`)
How to categorize this issue?
/area quality robustness /kind enhancement epic
What would you like to be added:
Replace the cloud-config-download/executor which are written in the bash
programming language with a binary written in go
which is a kubernetes controller based on controller-runtime.
This was implemented up to a working implementation during the Gardener Hackathon 2023 in Leverkusen
https://github.com/gardener-community/hackathon
Why is this needed:
With the new Architecture we gain a lot, let's describe the most important gains here.
Developer Productivity
Because we all develop in go day by day, writing business logic in bash
is difficult, hard to maintain, almost impossible to test. Getting rid of almost all bash
scripts which are currently in use for this very important part of the cluster creation process will enhance the speed of adding new features and removing bugs.
Speed
Until now, the cloud-config-downloader
runs in a loop every 60sec to check if something changed on the shoot which requires modifications on the worker node. This produces a lot of unneeded traffic on the api-server and wastes time, it will sometimes take up to 60sec until a desired modification is started on the worker node.
By using the controller-runtime we can watch for the node
, theOSC
in the secret
, and the shoot-access-token in the secret
. If any of these object changed, and only then, the required action will take effect immediately.
This will speed up operations and will reduce the load on the api-server of the shoot dramatically.
Scalability
Actually the cloud-config-downloader
add a random wait time before restarting the kubelet
in case the kubelet
was updated or a configuration change was made to it. This is required to reduce the load on the API server and the traffic on the internet uplink. It also reduces the overall downtime of the services in the cluster because every kubelet
restart takes a node for several seconds into NotReady
state which eventually interrupts service availability.
Decision was made to keep the existing jitter mechanism which calculates the kubelet-download-and-restart-delay-seconds on the controller itself.
Correctness
The configuration of the cloud-config-downloader
is actually done by placing a file for every configuration item on the disk on the worker node. This was done because parsing the content of a single file and using this as a value in bash
reduces to something like VALUE=$(cat /the/path/to/the/file)
. Simple but lacks validation, type safety and whatnot.
With the gardener-node-agent
we introduce a new API which is then stored in the gardener-node-agent
secret
and stored on disc in a single yaml file for comparison with the previous known state. This brings all benefits of type safe configuration.
Because actual and previous configuration are compared, removed files and units are also removed and stopped on the worker if removed from the OSC
.
Availability
Previously the cloud-config-downloader
simply restarted the systemd-units
on every change to the OSC
, regardless which of the services changed. The gardener-node-agent
first checks which systemd-unit was changed, and will only restart these. This will remove unneeded kubelet
restarts.
Goals
- create a new shoot which only uses the new gardener-node-agent
- update existing workers from cloud-config-downloader to gardener-node-agent
- same behavior when applying the OSC as before.
Tasks (basic functionality)
- Initial skaffolding and introduction of new component
- [x] https://github.com/gardener/gardener/pull/8249
- [x] https://github.com/gardener/gardener/pull/8627
- Business Logic
- [x] https://github.com/gardener/gardener/pull/8629
- [x] https://github.com/gardener/gardener/pull/8632
- [x] https://github.com/gardener/gardener/pull/8683
- [x] https://github.com/gardener/gardener/pull/8678
- [x] https://github.com/gardener/gardener/pull/8707
- [x] https://github.com/gardener/gardener/pull/8760
- [x] https://github.com/gardener/gardener/pull/8814
- [x] https://github.com/gardener/gardener/pull/8827
- [x] https://github.com/gardener/gardener/pull/8825
- [x] https://github.com/gardener/gardener/pull/8838
- [x] https://github.com/gardener/gardener/pull/8826
- [x] https://github.com/gardener/gardener/pull/8839
- [x] https://github.com/gardener/gardener/pull/8767
- [x] https://github.com/gardener/gardener/pull/8893
- [x] https://github.com/gardener/gardener/pull/8901
- [x] https://github.com/gardener/gardener/pull/8894
- [x] https://github.com/gardener/gardener/pull/8898
- [x] https://github.com/gardener/gardener/pull/8907
- [x] https://github.com/gardener/gardener/pull/8919
- [x] https://github.com/gardener/gardener/pull/9096
- Bootstrapping
gardener-node-agent
- [x] https://github.com/gardener/gardener/pull/8726
-
Seed
/Shoot
Controllers Adaptation- [x] https://github.com/gardener/gardener/pull/8756
- [x] https://github.com/gardener/gardener/pull/8836
- [x] https://github.com/gardener/gardener/pull/8847
- [x] https://github.com/gardener/gardener/pull/9073
- Adapt OS Extensions (check if they manipulate
cloud-config-downloader
unit (sample) - if they do, then try to remove these or enablegardener-node-init
instead): https://github.com/gardener/gardener/pull/8647- [x]
provider-local
: https://github.com/gardener/gardener/pull/8647 and https://github.com/gardener/gardener/pull/8834 - [x]
os-gardenlinux
: https://github.com/gardener/gardener-extension-os-gardenlinux/pull/130, released withv0.22.0
- [x]
os-suse-chost
: https://github.com/gardener/gardener-extension-os-suse-chost/pull/112, released withv1.24.0
- [x]
os-coreos
: https://github.com/gardener/gardener-extension-os-coreos/pull/80, released withv1.19.0
- [x]
os-ubuntu
: https://github.com/gardener/gardener-extension-os-ubuntu/pull/99, released withv1.24.0
- [x]
os-metal
: https://github.com/metal-stack/os-metal-extension/pull/33, released withv?
(PR is ready, but no need to wait for it as part of this umbrella issue, so let's consider it done) - [x] ~~
os-k3os
~~ (not needed since it was never used anywhere and is outdated anyways)
- [x]
- Miscellaneous
- [x] https://github.com/gardener/gardener/pull/7993 (this was found missing during the implementation of the
gardener-node-agent
) - [x] Documentation: https://github.com/gardener/gardener/pull/8249
- [x] ~~Extra build job to publish
gardener-node-agent
and potentiallykubelet
/kubectl
as OCI Image, see below section~~ (not needed since we decided to use nativecontainerd
functionality for extracting binaries from images, see https://github.com/gardener/gardener/pull/8678) - [x] ~~Build
gardener-node-agent
container image in Prow (multi-arch!)~~ (we decided to drop Prow builds, see https://github.com/gardener/ci-infra/pull/981 and https://github.com/gardener/gardener/pull/8740 for details) - [x] https://github.com/gardener/gardener/pull/8885
- [x] https://github.com/gardener/gardener/pull/8857
- [x] https://github.com/gardener/gardener/pull/8786
- [x] https://github.com/gardener/gardener/pull/9048 (follow-up of https://github.com/gardener/gardener/pull/8905)
- [x] https://github.com/gardener/gardener/pull/9077
- [x] https://github.com/gardener/gardener/pull/9114
- [x] https://github.com/gardener/gardener/pull/9280
- [x] https://github.com/gardener/gardener/pull/7993 (this was found missing during the implementation of the
- Graduation & Rollout
- [x] Introduce new alpha
UseGardenerNodeAgent
feature gate ingardenlet
controlling whethergardener-node-{init,agent}
should be used: https://github.com/gardener/gardener/pull/8647 - [x] https://github.com/gardener/gardener/pull/9161
- [x] https://github.com/gardener/gardener/pull/9208
- [ ] Remove
UseGardenerNodeAgent
feature gate (planned forv1.92
)- Cleanup migration code
- Remove
nerdctl
from local OS image - Announce that correct versions of OS extensions are required:
[email protected]+
,[email protected]+
,[email protected]+
,[email protected]+
,[email protected]+
- [x] Introduce new alpha
Fixes https://github.com/gardener/gardener/issues/4673
OCI Image Support
~~A decision was made to go ahead with the current approach by extracting the binaries (kubelet, gardener-node-agent) out of docker images as currently implemented. Down the way towards GA of gardener-node-agent, the OCI artifact support is implemented and right before GA, this support is merged. The existing docker images of all still supported kubernetes/kubelet versions need to be pushed as OCI artifact and of course the binary of the gardener-node-agent.~~
Update: Not needed since we decided to use native containerd
functionality for extracting binaries from images, see https://github.com/gardener/gardener/pull/8678.
No modifications are made to cloud-config-downloader to support OCI artifacts.
:+1: for this stuff!
I would like to propose another future goal, which can be turned into an own issue after this one was resolved. The idea is to drop the OSC with -original
suffix (purpose reconcile
) and migrate its contents into the NodeAgentConfiguration
. The -downloader
OSC can remain as is.
This way, we can move on to a more concise interface for gardener-node-agent, which has several advantages:
- The new interface enhances the visibility and testability for the features that are currently wrapped implicitly inside the OSC's file and unit definitions, making them easier to parametrize and to implement
- Gardener and Extensions should not need to struggle with OS-/node-internals, only the gardener-node-agent should implement this kind of code
- Extensions that want to hook into node-specific configuration do not need to agree on conventions (an example might be the registry-extension, which wants to add mirror configuration for containerd and has to put it to a specific path in order not to conflict with other extensions that want to modify the containerd configuration as well).
- The node-agent can directly read its configuration from the secret instead of the described approach where the configuration is wrapped inside the OSC, which the OSC controller writes to the disk, which will then notify other controllers do apply possible changes.
Here is an example of how the configuration could be extended:
type NodeAgentConfiguration struct {
metav1.TypeMeta
// APIServer contains the connection configuration for the gardener-node-agent to
// access the shoot api server.
APIServer APIServer `json:"apiServer"`
// ConfigurationSecretName defines the name of the secret in the shoot cluster, which contains
// the configuration for the gardener-node-agent.
ConfigurationSecretName string `json:"configSecretName"`
// TokenSecretName defines the name of the secret in the shoot cluster, which contains
// the projected shoot access token for the gardener-node-agent.
TokenSecretName string `json:"tokenSecretName"`
// Image is the docker image reference to the gardener-node-agent.
Image string `json:"image"`
// HyperkubeImage is the docker image reference to the hyperkube containing kubelet.
HyperkubeImage string `json:"hyperkubeImage"`
// KubernetesVersion contains the kubernetes version of the kubelet, used for annotating
// the corresponding node resource with a kubernetes version annotation.
KubernetesVersion string `json:"kubernetesVersion"`
// ContainerdConfiguration configures the containerd runtime according to the given configuration.
// +optional
ContainerdConfiguration *ContainerdConfiguration `json:"containerdConfiguration,omitempty`
// ValitailConfiguration configures valitail according to the given configuration.
// +optional
ValitailConfiguration *ValitailConfiguration `json:"valitailConfiguration,omitempty`
// KubeletDataVolumeSize sets the data volume size of an unformatted disk on the worker node,
// which is the be used for /var/lib on the worker.
// +optional
KubeletDataVolumeSize *int64 `json:"kubeletDataVolumeSize,omitempty"`
}
type ContainerdConfiguration struct {
// RegistryMirrors a list of registry mirror to configure for containerd.
RegistryMirrors []string `json:"registryMirrors,omitempty`
// MonitorEnabled deploys a service to monitor the containerd service.
MonitorEnabled bool `json:"monitorEnabled`
// LogRotationPeriod is cron schedule for when logs of containerd a rotated.
LogRotationPeriod string `json:"logRotationPeriod"`
// ... fields have to be worked out in the future
}
type ValitailConfiguration struct {
// ... fields have to be worked out in the future
}
Do you think this is feasible?
As promised I rebased all currently known future changes from the hackathon following the current PR. Here you can have a look on the additional changes. As I could not come up with an ideal order of upcoming PRs based on the controllers yet, I did not create isolated commits (although it would be easier to cherry-pick and rebase a controller for a pr).
Sadly the codebase has diverged and I couldn't fix all compile issues by myself.
- compared to the current pr: https://github.com/metal-stack/gardener/compare/gardener-node-agent-lib...gardener-node-agent-complete
-
gardener-node-agent-complete
on metal-stack/gardener - the original, non-rebased version is
gardener-node-agent-2
. But due to the unrelated history it displays all changes - even the original and unrevisited ones in the current initial PR: https://github.com/metal-stack/gardener/compare/gardener-node-agent-lib...gardener-node-agent-2
I hope this serves well as a starting point
/assign @oliver-goetz @rfranzke
It would be very helpful if a future node-agent could try to avoid restarting all kubelets more or less at once but do it in a jittered/delayed/consecutive way to prevent the effects of https://github.com/kubernetes/kubernetes/issues/100277. We regularly see customers having short outages due to the annoying restart behaviour that renders all containers temporarily unready.
It would be very helpful if a future node-agent could try to avoid restarting all kubelets more or less at once but do it in a jittered/delayed/consecutive way to prevent the effects of kubernetes/kubernetes#100277. We regularly see customers having short outages due to the annoying restart behaviour that renders all containers temporarily unready.
Hi @dguendisch This is currently in discussion if we simply jitter the restarts uncoordinated across all nodes, or in a orchestrated way to achieve some sort of rolling update/restart of the kubelets. This later is much more difficult
Sounds reasonable, a jittered restart should be already way better (if spread across a sufficiently big time window) than what we have today.
@dguendisch We already jitter kubelet restarts today (actually, all cloud-config updates, but anyways). The time window is just 5m
, however, see https://github.com/gardener/gardener/blob/d2dcd7dce7c39f16e02cfe081d9e26ed7e751ee9/pkg/operation/shoot/shoot.go#L250-L260 and https://github.com/gardener/gardener/blob/d2dcd7dce7c39f16e02cfe081d9e26ed7e751ee9/pkg/component/extensions/operatingsystemconfig/executor/templates/scripts/execute-cloud-config.tpl.sh#L221-L242. Why isn't this good enough? Of course, with "bad luck", multiple nodes could compute a delay close to each other which would lead to the impression of "restarts at once"...
I see, I was wrongly under the impression that it's not actively jittered beyond the (30sec?) interval of the cloud-config being executed. Then it indeed rather sounds like "bad luck" that some customers observe.
OK. I have added a task above for improving this anyways, but for now/for the initial version of gardener-node-agent
, we will probably go with the current approach. We can also try to increase the 5m
if really necessary. Perhaps you can share some timestamps from clusters that suffered from it for better understanding/analysis :)
The last case I've noticed seems "somewhat" jittered (though not really randomly), just filtered the logs for one particular kubelet startup msg:
(also strange in the case above was that every kubelet actually restarted twice in a 10secs timewindow, but that's maybe something completely different)
I guess this depends on the entropy of $RANDOM
which is a bash built-in
Regarding this item:
Restrict Secret cache to only relevant Secrets (access token secrets, OSC secrets) (use SingleObject cache)
This items tries to solve two problems existing:
-
gardener-node-agent
has RBAC permissions for listing allSecret
s in thekube-system
namespace -
gardener-node-agent
watches allSecret
s in thekube-system
namespace (this can drive up memory usage for some shoot clusters, especially when they have a lot of worker pools).
I don't think it's possible to use the single object cache for our use case. This cache only works well when you don't have controllers watching the respective object kind. Today, we only use it in gardenlet
, and for all kinds put behind the single object cache, gardenlet
indeed does not have any controller reconciling these objects. It only performs GET
calls for them as part of other reconciliations.
gardener-node-agent
, however, has two controllers watching Secret
s: The token controller as well as the operatingsystemconfig controller. The secret names are well-known via its component config.
While we can configure label and field selectors via the manager's cache options, Kubernetes does generally not support multiple field selectors that are OR-ed. Unfortunately, this is what we need, something like metadata.name=<osc-secret> OR metadata.name<token-secret1> OR ...
. This disqualifies the possibility to natively use the standard controller-runtime options.
An alternative could be to creates individual caches for the respective secrets, something like
oscSecretCache, err := cache.New(restConfig, cache.Options{
ByObject: map[client.Object]cache.ByObject{&corev1.Secret{}: {
Namespaces: map[string]cache.Config{metav1.NamespaceSystem: {}},
Field: fields.SelectorFromSet(fields.Set{metav1.ObjectNameField: cfg.Controllers.OperatingSystemConfig.SecretName}),
}},
})
// repeat the same for the token secrets
The problem is that we also needed to aggregate them properly so that the two controllers work. This is actually quite hard since each of these caches is backed by a dedicated informer and a dedicated WATCH connection. So, we needed to build something like an aggregated informer which gets updated whenever any of these WATCHes update. While this might be possible, I don't think it's reasonable to pursue this path any further.
Instead, I propose to accept the first "problem" (gardener-node-agent
is a system component, so allowing it to watch Secret
s in the kube-system
namespace is not really problematic, and there are also other system components with this permission)), and rather tackle the second problem by working with label selectors for gardener-node-agent
's cache/WATCHes:
gardenlet
already adds labels to OperatingSystemConfig
Secret
s that could be used by gardener-node-agent
for the its label selector. Access token Secret
s don't have labels yet, however, gardenlet
could start adding them as well.
Still, there would be the problem that label selectors cannot be OR-ed (we want to list/watch all Secret
s with labels gardener.cloud/role=operating-systemconfig
AND worker.gardener.cloud/pool=<pool-name>
, OR gardener.cloud/role=access-token
. This does not work in Kubernetes, and we are back to the same issue from above (we needed aggregation for two caches).
What we could do, though, is creating dedicated caches for the operatingsystemconfig and token controllers, e.g. like this:
diff --git a/pkg/nodeagent/controller/operatingsystemconfig/add.go b/pkg/nodeagent/controller/operatingsystemconfig/add.go
index 0d13e256b..2702d377b 100644
--- a/pkg/nodeagent/controller/operatingsystemconfig/add.go
+++ b/pkg/nodeagent/controller/operatingsystemconfig/add.go
@@ -22,9 +22,12 @@ import (
"github.com/spf13/afero"
corev1 "k8s.io/api/core/v1"
metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"
+ "k8s.io/apimachinery/pkg/fields"
+ "k8s.io/apimachinery/pkg/runtime/schema"
"k8s.io/apimachinery/pkg/types"
"k8s.io/client-go/util/workqueue"
"sigs.k8s.io/controller-runtime/pkg/builder"
+ "sigs.k8s.io/controller-runtime/pkg/cache"
"sigs.k8s.io/controller-runtime/pkg/client"
"sigs.k8s.io/controller-runtime/pkg/controller"
"sigs.k8s.io/controller-runtime/pkg/event"
@@ -34,6 +37,7 @@ import (
"sigs.k8s.io/controller-runtime/pkg/reconcile"
"sigs.k8s.io/controller-runtime/pkg/source"
+ kubernetescache "github.com/gardener/gardener/pkg/client/kubernetes/cache"
predicateutils "github.com/gardener/gardener/pkg/controllerutils/predicate"
nodeagentv1alpha1 "github.com/gardener/gardener/pkg/nodeagent/apis/config/v1alpha1"
"github.com/gardener/gardener/pkg/nodeagent/dbus"
@@ -46,9 +50,6 @@ const ControllerName = "operatingsystemconfig"
// AddToManager adds Reconciler to the given manager.
func (r *Reconciler) AddToManager(mgr manager.Manager) error {
- if r.Client == nil {
- r.Client = mgr.GetClient()
- }
if r.Recorder == nil {
r.Recorder = mgr.GetEventRecorderFor(ControllerName)
}
@@ -62,16 +63,34 @@ func (r *Reconciler) AddToManager(mgr manager.Manager) error {
r.Extractor = registry.NewExtractor()
}
}
+ oscSecretCache, err := cache.New(mgr.GetConfig(), cache.Options{
+ ByObject: map[client.Object]cache.ByObject{&corev1.Secret{}: {
+ Namespaces: map[string]cache.Config{metav1.NamespaceSystem: {}},
+ Field: fields.SelectorFromSet(fields.Set{metav1.ObjectNameField: r.Config.SecretName}),
+ }},
+ })
+ if err != nil {
+ return err
+ }
+ if err := mgr.Add(oscSecretCache); err != nil {
+ return err
+ }
+ c := kubernetescache.NewAggregator(mgr.GetCache(), map[schema.GroupVersionKind]cache.Cache{corev1.SchemeGroupVersion.WithKind("Secret"): oscSecretCache}, mgr.GetScheme())
+
+ if r.Client == nil {
+ r.Client, err = client.New(mgr.GetConfig(), client.Options{Cache: &client.CacheOptions{Reader: c}})
+ if err != nil {
+ return err
+ }
+ }
+
return builder.
ControllerManagedBy(mgr).
Named(ControllerName).
WatchesRawSource(
- source.Kind(mgr.GetCache(), &corev1.Secret{}),
+ source.Kind(c, &corev1.Secret{}),
r.EnqueueWithJitterDelay(mgr.GetLogger().WithValues("controller", ControllerName).WithName("jitterEventHandler")),
builder.WithPredicates(
- predicate.NewPredicateFuncs(func(obj client.Object) bool {
- return obj.GetNamespace() == metav1.NamespaceSystem && obj.GetName() == r.Config.SecretName
- }),
r.SecretPredicate(),
predicateutils.ForEventTypes(predicateutils.Create, predicateutils.Update),
),
(similarly for the token controller)
This would start two WATCHes for Secret
s, one with the field selector for the OSC secret name, and one with the label selector for the access token secrets. While this approach is not the most straight-forward one, it should help improving gardener-node-agent
's memory usage for large clusters. It still requires permissions to read all Secret
s in kube-system
, but as discussed above, I think this is acceptable.
In a sync with @timebertt we discussed more options compared to the above (which is also okay, but a bit cumbersome since the manager's client would not be allowed to get used when reading Secret
s, so it increases the development complexity and maintenance efforts):
- Add the access tokens as
extensionsv1alpha1.File
s to the OSC secret. This would allow us to delete thetoken
controller and restrict GNA to exactly the OSC secret (via field selector). The challenge here is thatgardener-resource-manager
's token requestor would either need to inject this file into the OSC secret which is managed by aManagedResource
(abandoned due to complexity), or thegardenlet
would need to inject the tokens duringShoot
reconciliation (abandoned as well since it might break clusters which hang in earlier reconciliation steps before reaching the OSC secret reconciliation). - Add the access tokens as separate data keys to the OSC secret. This would again allow us to restrict GNA to exactly the OSC secret (via field selector). The challenge here is that we needed to adapt GRM to not revert these changes (since the OSC secret is managed via
ManagedResource
), and that we probably should create a dedicated access token for GNA/valitail for each worker pool. We also abandoned this approach because the implementation effort is a bit high compared to the goal/benefits). - Drop the
Secret
watch in GNA'stoken
controller and replace it with a period resync which uses the API reader instead. Hence, access tokens would no longer be watched, but this is somewhat "over-engineered" anyways, since they only change ~each12h
and are valid for90d
, so even if not synced immediately, the old token is still usable. It would be enough to just read them once each ~12h
and sync them again to the node. This would allow us to restrict the GNA secret cache to the OSC secret only (via field selector).
After comparing the implementation effort, complexity, and maintenance concerns with the goals and benefits, we decided to pursue approach 3.
/area ipcei
All tasks have been completed. /close
@rfranzke: Closing this issue.
In response to this:
All tasks have been completed. /close
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.