cluster-api 📖 Add In-place updates proposal

What this PR does / why we need it: Proposal doc for In-place updates written by the In-place updates feature group.

Starting this as a draft to collect early feedback on the main ideas and high level flow. APIs and some other lower level details are left purposefully as TODOs to focus the conversation on the rest of the doc, speed up consensus and avoid rework.

Fixes #9489

/area documentation

Aug 07 '24 16:08 g-gaston

Skipping CI for Draft Pull Request. If you want CI signal for your change, please convert it to an actual PR. You can still manually trigger a test run with /test all

Aug 07 '24 16:08 k8s-ci-robot

Hey folks :wave:

@g-gaston Dropping by from the Flatcar Container Linux project - we're a container optimised Linux distro; we joined the CNCF a few weeks ago (incubating).

We've been driving implementation spikes of in-place OS and Kubernetes updates in ClusterAPI for some time - at the OS level. Your proposal looks great from our point of view.

While progress has been slower in the recent months due to project resource constraints, Flatcar has working proof-of-concept implementations for both in-place updating the OS and Kubernetes - independently. Our implementation is near production ready on the OS level, update activation can be coordinated via kured, and the worker cluster control plane picks up the correct versions. We do lack any signalling to the management cluster as well as more advanced features like coordinated roll-backs (though this would be easy to implement on the OS level).

In theory, our approach of in-place Kubernetes updates is distro agnostic (given the "mutable sysext" changes in recent versions of systemd starting with release 256).

We presented our work in a CAPZ office hours call earlier this year: https://youtu.be/Fpn-E9832UQ?feature=shared&t=164 (slide deck: https://drive.google.com/file/d/1MfBQcRvGHsb-xNU3g_MqvY4haNJl-WY2/view).

We hope our work can provide some insights that help to further flesh out this proposal. Happy to chat if folks are interested.

(CC: @tormath1 for visibility)

EDIT after initial feedback from @neolit123 : in-place updates of Kubernetes in CAPI are in "proof of concept" stage. Just using sysexts to ship Kubernetes (with and without CAPI) has been in production on (at least) Flatcar for quite some time. Several CAPI providers (OpenStack, Linode) use sysexts as preferred mechanism for Flatcar worker nodes.

Sep 12 '24 12:09 t-lo

systemd-sysext

i don't think i've seen usage of sysext with k8s. it's provisioning of image extensions seems like something users can do, but they might as well stick to the vanilla way of using the k8s package registries and employing update scripts for e.g. containerd.

the kubeadm upgrade docs, just leverage the package manager upgrade way: https://kubernetes.io/docs/tasks/administer-cluster/kubeadm/kubeadm-upgrade/

one concern that i think i have with systemd-sysext that you still have a intermediate build process for the extension, while the k8s package build process is already done by the k8s release folks.

Sep 12 '24 13:09 neolit123

On Flatcar, sysexts are the preferred way to run Kubernetes. "Packaging" is straightforward - create a filesystem from a subdirectory - and does not require any distro specific information. The resulting sysext can be used across many distros.

I'd argue that the overhead is negligible: download release binaries into a sub-directory and run mksquashfs. We might even evangelise sysext releases with k8s upstream if this is a continued concern.

Drawbacks of the packaging process are:

intermediate state: no atomic updates, recovery required if update process fails
distro specific: needs to be re-implemented for every distro
no easy roll-back: going back to a previous version (e.g. because a new release causes issues with user workloads) is complicated and risky (again, intermediate state)

Sysexts are already used by the ClusterAPI OpenStack and the Linode providers with Flatcar (though without in-place updates).

Sep 12 '24 13:09 t-lo

On Flatcar, sysexts are the preferred way to run Kubernetes. "Packaging" is straightforward - create a filesystem from a subdirectory - and does not require any distro specific information. The resulting sysext can be used across many distros.

the kubeadm and kubelet systemd drop-in files (in the official k8s packages) have some distro specific nuances like Debian vs RedHat paths. is sysexts capable of managing different drop-in files if the target distro is different, perhaps even detecting that automatically?

Sep 12 '24 14:09 neolit123

the kubeadm and kubelet systemd drop-in files (in the official k8s packages) have some distro specific nuances like Debian vs RedHat paths. is sysexts capable of managing different drop-in files if the target distro is different, perhaps even detecting that automatically?

Sysexts focus on shipping application bits (Kubernetes in the case at hand); configuration is usually supplied by separate means. That said, a complementary image-based configuration mechanism ("confext") exists for etc. Both approaches have their pros and cons, I'd say it depends on the specifics (I'm not very familiar with kubeadm on Debian vs. Red Hat, I'm more of an OS person :) ). But this should by no means be a blocker.

(Sorry for the sysext nerd sniping. I think we should stick to the topic of this PR - I merely wanted to raise that we have a working PoC of in-place Kubernetes updates. Happy to discuss Kubernetes sysexts elsewhere)

Sep 12 '24 14:09 t-lo

Sysexts focus on shipping application bits (Kubernetes in the case at hand); configuration is usually supplied by separate means. That said, a complementary image-based configuration mechanism ("confext") exists for etc. Both approaches have their pros and cons, I'd say it depends on the specifics (I'm not very familiar with kubeadm on Debian vs. Red Hat, I'm more of an OS person :) ). But this should by no means be a blocker.

while the nuances between distros are subtle in the k8s packages, the drop-in files are critical. i won't argue if they are config or not, but if kubeadm and systemd is used, e.g. without /usr/lib/systemd/system/kubelet.service.d/10-kubeadm.conf the kubelet / kubeadm integration breaks: https://kubernetes.io/docs/setup/production-environment/tools/kubeadm/kubelet-integration/#the-kubelet-drop-in-file-for-systemd

(Sorry for the sysext nerd sniping. I think we should stick to the topic of this PR - I merely wanted to raise that we have a working PoC of in-place Kubernetes updates. Happy to discuss Kubernetes sysexts elsewhere)

i think it's a useful POV. perhaps @g-gaston has comments on the sysext topic. although, this proposal is more about the CAPI integration of the in-place upgrade concept.

Sep 12 '24 14:09 neolit123

while the nuances between distros are subtle in the k8s packages, the drop-in files are critical. i won't argue if they are config or not, but if kubeadm and systemd is used, e.g. without /usr/lib/systemd/system/kubelet.service.d/10-kubeadm.conf the kubelet / kubeadm integration breaks: https://kubernetes.io/docs/setup/production-environment/tools/kubeadm/kubelet-integration/#the-kubelet-drop-in-file-for-systemd

Shipping this file in a sysext is straightforward. In fact, the kubernetes sysexts we publish in our "sysext bakery" include it.

i think it's a useful POV. perhaps @g-gaston has comments on the sysext topic. although, this proposal is more about the CAPI integration of the in-place upgrade concept.

That's what originally motivated me to speak up: the proposal appears to discuss the control plane "upper half" our proof of concept implementation lacks. As stated we're OS folks :) And we're very happy to see this gets some traction.

Sep 12 '24 14:09 t-lo

@t-lo thanks for reaching out! really appreciated I agree with your comment that this proposal is tackling on layer of the problem and your work another.

+1 from me to keep discussion on this PR focused on the first layer

But great to see things are moving for the Flatcar Container Linux project; let's make sure the design work that is happening here does not prevent using Flatcar in place upgrade capabilities (but at the same time, we should make sure it could work with other OS as well, even the ones less "cloud native")

Sep 23 '24 09:09 fabriziopandini

It would be nice also to ensure the process is also compatible or at least gears well with talos.dev. Which is managed completely by a set of controllers that expose just an API. Useful for single-node long-lived clusters. As far as I read I see no complications yet for it.

Sep 24 '24 13:09 daper

Hello folks,

We've briefly discussed systemd-sysext and its potential uses for ClusterAPI in the September 25, 2024 ClusterAPI meeting (https://docs.google.com/document/d/1GgFbaYs-H6J5HSQ6a7n4aKpk0nDLE2hgG2NSOM9YIRw/edit#heading=h.s6d5g3hqxxzt).

Summarising the points made here so you don't need to watch the recording :wink: . Let's wrap up the sysext discussion in this PR so we can get the focus back to in-place updates. If there's more interest in this technology from ClusterAPI folks I'm happy to have a separate discussion (here: https://github.com/kubernetes-sigs/cluster-api/discussions/11227).

systemd-sysext are a distro-independent and vendor-independent way of shipping Kubernetes for clusterAPI. While it doesn't have much traction with CAPI providers at this time, it is supported by a wide range of distros and with recent changes, has become feasible for general purpose distros like Ubuntu (systemd 256 and above). Sysexts allow using stock distro images on vendor clouds, reducing CAPI operators' maintenance load. (no custom-built, self-hosted images required)
1. Sysexts are easy to adapt to non-systemd distros as they use basic Linux mechanisms ("glorified overlayfs mounts").
systemd-sysupdate is a complementary service that allows integration of atomic in-place updates of Kubernetes. It is supported on a wide range of distros and too uses basic mechanisms like HTTPS endpoints, index files, and semver matching. It uses symlinks for staging / applying updates; roll-back is possible by simply sym-linking the previous release. Sysupdate is very easy to integrate with Kubernetes reboot managers like kured.

Sep 26 '24 08:09 t-lo

Would this mechanism as proposed allow me to do a node rebuild on clouds that support that, instead of a create/delete? I think from reading the proposal that the answer is yes, but I am not 100% certain...

Mainly, I am thinking about nodes in a bare-metal cloud using OpenStack Ironic (via Nova). We don't want to keep "spare" bare-metal nodes hanging around in order to be able to do an upgrade, and even if we did have a spare node the create/delete cycle would involve "cleaning" each node which can take a while - O(30m) - before it can be reprovisioned into the cluster. Cleaning is intended to make the node suitable for use with another tenant, so can include operations such as secure erase that are totally unnecessary when the node is being recycled back into the same tenant.

OpenStack supports a REBUILD operation on these hosts that basically re-images the node without having to do a delete/create, and I am hoping to use that in the future for these clusters potentially. The plan in this case would not necessarily be to update the Kubernetes components in place, but to trigger a rebuild of the node using a new image with updated Kubernetes components, and having the node rejoin the cluster without having to go through a cleaning cycle.

Nov 28 '24 17:11 mkjpryor

Would this mechanism as proposed allow me to do a node rebuild on clouds that support that, instead of a create/delete? I think from reading the proposal that the answer is yes, but I am not 100% certain...

Mainly, I am thinking about nodes in a bare-metal cloud using OpenStack Ironic (via Nova). We don't want to keep "spare" bare-metal nodes hanging around in order to be able to do an upgrade, and even if we did have a spare node the create/delete cycle would involve "cleaning" each node which can take a while - O(30m) - before it can be reprovisioned into the cluster. Cleaning is intended to make the node suitable for use with another tenant, so can include operations such as secure erase that are totally unnecessary when the node is being recycled back into the same tenant.

OpenStack supports a REBUILD operation on these hosts that basically re-images the node without having to do a delete/create, and I am hoping to use that in the future for these clusters potentially. The plan in this case would not necessarily be to update the Kubernetes components in place, but to trigger a rebuild of the node using a new image with updated Kubernetes components, and having the node rejoin the cluster without having to go through a cleaning cycle.

Yes, that should be doable. That said, and although I'm not familiar with the rebuild functionality, but that sounds like something that the infra provider could implement today without the in-place update functionality.

Dec 02 '24 17:12 g-gaston

/lgtm

Mar 19 '25 10:03 anmazzotti

@anmazzotti: changing LGTM is restricted to collaborators

In response to this:

/lgtm

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Mar 19 '25 10:03 k8s-ci-robot

Would it make sense to move this PR out of draft status?

Mar 21 '25 10:03 sbueringer

I think this goes in the right direction

Mar 21 '25 12:03 sbueringer

discussing this today at the office hours, plan is to merge by lazy consensus 1 or 2 weeks after kubecon.

Mar 26 '25 17:03 elmiko

FYI we discussed the topic at KubeCon and all the people present confirmed that there are no blockers in merging in 1 or 2 weeks.

@g-gaston please close as much comments as possible

Apr 05 '25 15:04 fabriziopandini

FYI we discussed the topic at KubeCon and all the people present confirmed that there are no blockers in merging in 1 or 2 weeks.

@g-gaston please close as much comments as possible

addressed all comments :)

Apr 08 '25 17:04 g-gaston

Please feel free to add me as reviewer. From the OS / plumbing level side this looks great! Can't wait to have it in CAPI.

Apr 09 '25 17:04 t-lo

Awesome work team! Designing a solution and reaching consensus on next steps for such a complex topic is a great achievement!

Looking forward for the implementation of what is in scope of this first iteration of this proposal. /lgtm /approve

/hold As per office hours discussion, I will lift the hold before EOW

Apr 17 '25 15:04 fabriziopandini

LGTM label has been added.

Git tree hash: 7025bbd7e833f986d5dd49e4810224bca340a704

Apr 17 '25 15:04 k8s-ci-robot

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: fabriziopandini

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [fabriziopandini]

Approvers can indicate their approval by writing /approve in a comment Approvers can cancel approval by writing /approve cancel in a comment

Apr 17 '25 15:04 k8s-ci-robot

/hold cancel

Apr 18 '25 14:04 fabriziopandini

As a CloudNativePG maintainer (CNCF Sandbox project), in-place upgrades are critical for database workloads that use local storage.

Jun 25 '25 16:06 gbartolini

cluster-api cluster-api copied to clipboard

📖 Add In-place updates proposal

cluster-api
cluster-api copied to clipboard