gardener icon indicating copy to clipboard operation
gardener copied to clipboard

Add containerd config in `OperatingSystemConfig` (I)

Open timuthy opened this issue 1 year ago • 5 comments

How to categorize this PR?

/area robustness /kind enhancement

What this PR does / why we need it: This PR adds containerd config to the OperatingSystemConfig API, mainly to configure registries and the sandbox image. Gardener, as well as extensions (e.g. gardener-extension-registry-cache), may contribute to this section of the API and gardener-node-agent takes necessary steps on the node. With this enhancement, it is now more structured and easy to add or remove OCI registries while earlier every extension needed to bring their own logic for this (e.g. via additional systemd units).

While working on this topic, we noticed that the container-initializer systemd service can entirely be replaced by logic in the gardener-node-agent, which is performed in this PR along the way.

Which issue(s) this PR fixes: Part of #8929 Also see https://github.com/gardener-community/hackathon/tree/main/2024-05_Schelklingen#-type-safe-configurability-in-operatingsystemconfig-for-containerd-dns-ntp-etc.

Special notes for your reviewer: The groundwork happened at the last Gardener Hackathon (ref). As opposed to the procedure proposed in #8929, we decided against generic containerd patches mainly because:

  • It's easier for Gardener and extensions to contribute to a well structured API.
  • We needed to copy internal patching procedures from kind to do the same.
  • We don't see a need for generic patches because the changes in this PR and the following (see description below) cover most of the use-cases.

It is planned to add more configuration options related to container runtime (see types_operatingsystemconfig.go diff here).

/cc @MrBatschner @Gerrit91 @ialidzhikov

Release note:

Additional containerd registries can directly be added to the `OperatingSystemConfig`. Once specified, `gardener-node-agent` takes care about necessary steps to configure the container runtime. The generic control plane ensurer was enhanced along the way to simplify mutating the relevant part of the `OperatingSystemConfig` ((ref)[https://github.com/timuthy/gardener/blob/74390628e336b0954b3983ea8f4e0519b1ee5c0d/extensions/pkg/webhook/controlplane/genericmutator/mutator.go#L88]).
The `containerd-intializer` systemd unit has been deprecated and will remain functionless on existing nodes for compatibility reasons.

timuthy avatar Jul 02 '24 07:07 timuthy

@timuthy: GitHub didn't allow me to request PR reviews from the following users: Gerrit91.

Note that only gardener members and repo collaborators can review this PR, and authors cannot review their own PRs.

In response to this:

How to categorize this PR?

/area robustness /kind enhancement

What this PR does / why we need it: This PR adds containerd config to the OperatingSystemConfig API, mainly to configure registries and the sandbox image. Gardener, as well as extensions (e.g. gardener-extension-registry-cache), may contribute to this section of the API and gardener-node-agent takes necessary steps on the node. With this enhancement, it is now more structured and easy to add or remove OCI registries while earlier every extension needed to bring their own logic for this (e.g. via additional systemd units).

While working on this topic, we noticed that the container-initializer systemd service can entirely be replaced by logic in the gardener-node-agent, which is performed in this PR along the way.

Which issue(s) this PR fixes: Part of #8929 Also see https://github.com/gardener-community/hackathon/tree/main/2024-05_Schelklingen#-type-safe-configurability-in-operatingsystemconfig-for-containerd-dns-ntp-etc.

Special notes for your reviewer: The groundwork happened at the last Gardener Hackathon (ref). As opposed to the procedure proposed in #8929, we decided against generic containerd patches mainly because:

  • It's easier for Gardener and extensions to contribute to a well structured API.
  • We needed to copy internal patching procedures from kind to do the same.
  • We don't see a need for generic patches because the changes in this PR and the following (see description below) cover most of the use-cases.

It is planned to add more configuration options related to container runtime (see types_operatingsystemconfig.go diff here).

/cc @MrBatschner @Gerrit91 @ialidzhikov

Release note:

Additional containerd registries can directly be added to the `OperatingSystemConfig`. Once specified, `gardener-node-agent` takes care about necessary steps to configure the container runtime. The generic control plane ensurer was enhanced along the way to simplify mutating the relevant part of the `OperatingSystemConfig` ((ref)[https://github.com/timuthy/gardener/blob/74390628e336b0954b3983ea8f4e0519b1ee5c0d/extensions/pkg/webhook/controlplane/genericmutator/mutator.go#L88]).
The `containerd-intializer` systemd unit has been deprecated and will remain functionless on existing nodes for compatibility reasons.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

gardener-prow[bot] avatar Jul 02 '24 07:07 gardener-prow[bot]

/test pull-gardener-e2e-kind

timuthy avatar Jul 02 '24 12:07 timuthy

/assign

ialidzhikov avatar Jul 02 '24 12:07 ialidzhikov

/test pull-gardener-e2e-kind

timuthy avatar Jul 02 '24 19:07 timuthy

/test pull-gardener-e2e-kind-ha-single-zone-upgrade

timuthy avatar Jul 02 '24 20:07 timuthy

/test pull-gardener-e2e-kind-ipv6

timuthy avatar Jul 05 '24 15:07 timuthy

/test pull-gardener-e2e-kind

timuthy avatar Jul 09 '24 12:07 timuthy

/lgtm

ScheererJ avatar Jul 09 '24 12:07 ScheererJ

LGTM label has been added.

Git tree hash: a9c7c26d2cfbc4f29ced0ff0f85f1feb4c52cec3

gardener-prow[bot] avatar Jul 09 '24 12:07 gardener-prow[bot]

/retest-required

timuthy avatar Jul 10 '24 06:07 timuthy

@dimitar-kostadinov, as you already have the required adaptations for the registry-cache extension, can you double check that the latest revision of the PR satisfies the requirements from the registry-cache side? Maybe running our e2e tests locally would be enough.

The e2e test should create Shoot with registry-cache extension enabled with caches for Shoot system components, delete Shoot fails with following node-agent error:

Jul 10 11:02:04 machine-shoot--local--e2e-cache-ssc-local-68ccb-bdf6w gardener-node-agent[333]: {"level":"error","ts":"2024-07-10T11:02:04.594Z","msg":"Reconciler error","controller":"operatingsystemconfig","namespace":"kube-system","name":"gardener-node-agent-local-ee46034b8269353b","reconcileID":"a9021fe9-e4a8-47d6-8a4f-c996eb660c22","error":"failed reconciling containerd registries: retry failed with context deadline exceeded, last error: failed to reach registry http://10.4.219.36:5000 for upstream registry.k8s.io: Get \"http://10.4.219.36:5000\": context deadline exceeded (Client.Timeout exceeded while awaiting headers)","stacktrace":"sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\tsigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:329\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\tsigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:266\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\tsigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:227"}

The pods in kube-system namespace are Pending status and the machine pod is not ready:

% k -n kube-system get po -A
NAMESPACE     NAME                                                  READY   STATUS    RESTARTS   AGE
kube-system   calico-kube-controllers-56f86cc77b-92dtc              0/1     Pending   0          29m
kube-system   calico-node-vertical-autoscaler-6cf9c8bf8d-w6qjr      0/1     Pending   0          29m
kube-system   calico-typha-deploy-5f6c5b7db-hdljm                   0/1     Pending   0          29m
kube-system   calico-typha-horizontal-autoscaler-78ff9f78d8-hftwn   0/1     Pending   0          29m
kube-system   calico-typha-vertical-autoscaler-7c87c474d6-2swcq     0/1     Pending   0          29m
...

% k -n shoot--local--e2e-cache-ssc get po machine-shoot--local--e2e-cache-ssc-local-68ccb-bdf6w
NAME                                                    READY   STATUS    RESTARTS   AGE
machine-shoot--local--e2e-cache-ssc-local-68ccb-bdf6w   0/1     Running   0          10m

All other e2e tests pass OK.

dimitar-kostadinov avatar Jul 10 '24 11:07 dimitar-kostadinov

/test pull-gardener-integration

timuthy avatar Jul 12 '24 06:07 timuthy

Thanks again for the thorough review and feedback @ialidzhikov and @dimitar-kostadinov. PTAL.

timuthy avatar Jul 12 '24 07:07 timuthy

LGTM label has been added.

Git tree hash: c23d629292b7102d7bb7a7b6e0fd04d363f23333

gardener-prow[bot] avatar Jul 12 '24 10:07 gardener-prow[bot]

/approve

timuthy avatar Jul 15 '24 13:07 timuthy

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: dimitar-kostadinov, timuthy

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment Approvers can cancel approval by writing /approve cancel in a comment

gardener-prow[bot] avatar Jul 15 '24 13:07 gardener-prow[bot]

/test pull-gardener-e2e-kind

timuthy avatar Jul 16 '24 05:07 timuthy