cluster-api
cluster-api copied to clipboard
📖 Add In-place updates proposal
What this PR does / why we need it: Proposal doc for In-place updates written by the In-place updates feature group.
Starting this as a draft to collect early feedback on the main ideas and high level flow. APIs and some other lower level details are left purposefully as TODOs to focus the conversation on the rest of the doc, speed up consensus and avoid rework.
Fixes #9489
/area documentation
Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all
Hey folks :wave:
@g-gaston Dropping by from the Flatcar Container Linux project - we're a container optimised Linux distro; we joined the CNCF a few weeks ago (incubating).
We've been driving implementation spikes of in-place OS and Kubernetes updates in ClusterAPI for some time - at the OS level. Your proposal looks great from our point of view.
While progress has been slower in the recent months due to project resource constraints, Flatcar has working proof-of-concept implementations for both in-place updating the OS and Kubernetes - independently. Our implementation is near production ready on the OS level, update activation can be coordinated via kured, and the worker cluster control plane picks up the correct versions. We do lack any signalling to the management cluster as well as more advanced features like coordinated roll-backs (though this would be easy to implement on the OS level).
In theory, our approach of in-place Kubernetes updates is distro agnostic (given the "mutable sysext" changes in recent versions of systemd starting with release 256).
We presented our work in a CAPZ office hours call earlier this year: https://youtu.be/Fpn-E9832UQ?feature=shared&t=164 (slide deck: https://drive.google.com/file/d/1MfBQcRvGHsb-xNU3g_MqvY4haNJl-WY2/view).
We hope our work can provide some insights that help to further flesh out this proposal. Happy to chat if folks are interested.
(CC: @tormath1 for visibility)
EDIT after initial feedback from @neolit123 : in-place updates of Kubernetes in CAPI are in "proof of concept" stage. Just using sysexts to ship Kubernetes (with and without CAPI) has been in production on (at least) Flatcar for quite some time. Several CAPI providers (OpenStack, Linode) use sysexts as preferred mechanism for Flatcar worker nodes.
systemd-sysext
i don't think i've seen usage of sysext with k8s. it's provisioning of image extensions seems like something users can do, but they might as well stick to the vanilla way of using the k8s package registries and employing update scripts for e.g. containerd.
the kubeadm upgrade docs, just leverage the package manager upgrade way: https://kubernetes.io/docs/tasks/administer-cluster/kubeadm/kubeadm-upgrade/
one concern that i think i have with systemd-sysext that you still have a intermediate build process for the extension, while the k8s package build process is already done by the k8s release folks.
On Flatcar, sysexts are the preferred way to run Kubernetes. "Packaging" is straightforward - create a filesystem from a subdirectory - and does not require any distro specific information. The resulting sysext can be used across many distros.
I'd argue that the overhead is negligible: download release binaries into a sub-directory and run mksquashfs. We might even evangelise sysext releases with k8s upstream if this is a continued concern.
Drawbacks of the packaging process are:
- intermediate state: no atomic updates, recovery required if update process fails
- distro specific: needs to be re-implemented for every distro
- no easy roll-back: going back to a previous version (e.g. because a new release causes issues with user workloads) is complicated and risky (again, intermediate state)
Sysexts are already used by the ClusterAPI OpenStack and the Linode providers with Flatcar (though without in-place updates).
On Flatcar, sysexts are the preferred way to run Kubernetes. "Packaging" is straightforward - create a filesystem from a subdirectory - and does not require any distro specific information. The resulting sysext can be used across many distros.
the kubeadm and kubelet systemd drop-in files (in the official k8s packages) have some distro specific nuances like Debian vs RedHat paths. is sysexts capable of managing different drop-in files if the target distro is different, perhaps even detecting that automatically?
the kubeadm and kubelet systemd drop-in files (in the official k8s packages) have some distro specific nuances like Debian vs RedHat paths. is sysexts capable of managing different drop-in files if the target distro is different, perhaps even detecting that automatically?
Sysexts focus on shipping application bits (Kubernetes in the case at hand); configuration is usually supplied by separate means. That said, a complementary image-based configuration mechanism ("confext") exists for etc. Both approaches have their pros and cons, I'd say it depends on the specifics (I'm not very familiar with kubeadm on Debian vs. Red Hat, I'm more of an OS person :) ). But this should by no means be a blocker.
(Sorry for the sysext nerd sniping. I think we should stick to the topic of this PR - I merely wanted to raise that we have a working PoC of in-place Kubernetes updates. Happy to discuss Kubernetes sysexts elsewhere)
Sysexts focus on shipping application bits (Kubernetes in the case at hand); configuration is usually supplied by separate means. That said, a complementary image-based configuration mechanism ("confext") exists for etc. Both approaches have their pros and cons, I'd say it depends on the specifics (I'm not very familiar with kubeadm on Debian vs. Red Hat, I'm more of an OS person :) ). But this should by no means be a blocker.
while the nuances between distros are subtle in the k8s packages, the drop-in files are critical. i won't argue if they are config or not, but if kubeadm and systemd is used, e.g. without /usr/lib/systemd/system/kubelet.service.d/10-kubeadm.conf the kubelet / kubeadm integration breaks:
https://kubernetes.io/docs/setup/production-environment/tools/kubeadm/kubelet-integration/#the-kubelet-drop-in-file-for-systemd
(Sorry for the sysext nerd sniping. I think we should stick to the topic of this PR - I merely wanted to raise that we have a working PoC of in-place Kubernetes updates. Happy to discuss Kubernetes sysexts elsewhere)
i think it's a useful POV. perhaps @g-gaston has comments on the sysext topic. although, this proposal is more about the CAPI integration of the in-place upgrade concept.
while the nuances between distros are subtle in the k8s packages, the drop-in files are critical. i won't argue if they are config or not, but if kubeadm and systemd is used, e.g. without
/usr/lib/systemd/system/kubelet.service.d/10-kubeadm.confthe kubelet / kubeadm integration breaks: https://kubernetes.io/docs/setup/production-environment/tools/kubeadm/kubelet-integration/#the-kubelet-drop-in-file-for-systemd
Shipping this file in a sysext is straightforward. In fact, the kubernetes sysexts we publish in our "sysext bakery" include it.
i think it's a useful POV. perhaps @g-gaston has comments on the sysext topic. although, this proposal is more about the CAPI integration of the in-place upgrade concept.
That's what originally motivated me to speak up: the proposal appears to discuss the control plane "upper half" our proof of concept implementation lacks. As stated we're OS folks :) And we're very happy to see this gets some traction.
@t-lo thanks for reaching out! really appreciated I agree with your comment that this proposal is tackling on layer of the problem and your work another.
+1 from me to keep discussion on this PR focused on the first layer
But great to see things are moving for the Flatcar Container Linux project; let's make sure the design work that is happening here does not prevent using Flatcar in place upgrade capabilities (but at the same time, we should make sure it could work with other OS as well, even the ones less "cloud native")
It would be nice also to ensure the process is also compatible or at least gears well with talos.dev. Which is managed completely by a set of controllers that expose just an API. Useful for single-node long-lived clusters. As far as I read I see no complications yet for it.
Hello folks,
We've briefly discussed systemd-sysext and its potential uses for ClusterAPI in the September 25, 2024 ClusterAPI meeting (https://docs.google.com/document/d/1GgFbaYs-H6J5HSQ6a7n4aKpk0nDLE2hgG2NSOM9YIRw/edit#heading=h.s6d5g3hqxxzt).
Summarising the points made here so you don't need to watch the recording :wink: . Let's wrap up the sysext discussion in this PR so we can get the focus back to in-place updates. If there's more interest in this technology from ClusterAPI folks I'm happy to have a separate discussion (here: https://github.com/kubernetes-sigs/cluster-api/discussions/11227).
- systemd-sysext are a distro-independent and vendor-independent way of shipping Kubernetes for clusterAPI. While it doesn't have much traction with CAPI providers at this time, it is supported by a wide range of distros and with recent changes, has become feasible for general purpose distros like Ubuntu (systemd 256 and above). Sysexts allow using stock distro images on vendor clouds, reducing CAPI operators' maintenance load. (no custom-built, self-hosted images required)
- Sysexts are easy to adapt to non-systemd distros as they use basic Linux mechanisms ("glorified overlayfs mounts").
- systemd-sysupdate is a complementary service that allows integration of atomic in-place updates of Kubernetes. It is supported on a wide range of distros and too uses basic mechanisms like HTTPS endpoints, index files, and semver matching. It uses symlinks for staging / applying updates; roll-back is possible by simply sym-linking the previous release. Sysupdate is very easy to integrate with Kubernetes reboot managers like kured.
Would this mechanism as proposed allow me to do a node rebuild on clouds that support that, instead of a create/delete? I think from reading the proposal that the answer is yes, but I am not 100% certain...
Mainly, I am thinking about nodes in a bare-metal cloud using OpenStack Ironic (via Nova). We don't want to keep "spare" bare-metal nodes hanging around in order to be able to do an upgrade, and even if we did have a spare node the create/delete cycle would involve "cleaning" each node which can take a while - O(30m) - before it can be reprovisioned into the cluster. Cleaning is intended to make the node suitable for use with another tenant, so can include operations such as secure erase that are totally unnecessary when the node is being recycled back into the same tenant.
OpenStack supports a REBUILD operation on these hosts that basically re-images the node without having to do a delete/create, and I am hoping to use that in the future for these clusters potentially. The plan in this case would not necessarily be to update the Kubernetes components in place, but to trigger a rebuild of the node using a new image with updated Kubernetes components, and having the node rejoin the cluster without having to go through a cleaning cycle.
Would this mechanism as proposed allow me to do a node rebuild on clouds that support that, instead of a create/delete? I think from reading the proposal that the answer is yes, but I am not 100% certain...
Mainly, I am thinking about nodes in a bare-metal cloud using OpenStack Ironic (via Nova). We don't want to keep "spare" bare-metal nodes hanging around in order to be able to do an upgrade, and even if we did have a spare node the create/delete cycle would involve "cleaning" each node which can take a while - O(30m) - before it can be reprovisioned into the cluster. Cleaning is intended to make the node suitable for use with another tenant, so can include operations such as secure erase that are totally unnecessary when the node is being recycled back into the same tenant.
OpenStack supports a REBUILD operation on these hosts that basically re-images the node without having to do a delete/create, and I am hoping to use that in the future for these clusters potentially. The plan in this case would not necessarily be to update the Kubernetes components in place, but to trigger a rebuild of the node using a new image with updated Kubernetes components, and having the node rejoin the cluster without having to go through a cleaning cycle.
Yes, that should be doable. That said, and although I'm not familiar with the rebuild functionality, but that sounds like something that the infra provider could implement today without the in-place update functionality.
/lgtm
@anmazzotti: changing LGTM is restricted to collaborators
In response to this:
/lgtm
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.
Would it make sense to move this PR out of draft status?
I think this goes in the right direction
discussing this today at the office hours, plan is to merge by lazy consensus 1 or 2 weeks after kubecon.
FYI we discussed the topic at KubeCon and all the people present confirmed that there are no blockers in merging in 1 or 2 weeks.
@g-gaston please close as much comments as possible
FYI we discussed the topic at KubeCon and all the people present confirmed that there are no blockers in merging in 1 or 2 weeks.
@g-gaston please close as much comments as possible
addressed all comments :)
Please feel free to add me as reviewer. From the OS / plumbing level side this looks great! Can't wait to have it in CAPI.
Awesome work team! Designing a solution and reaching consensus on next steps for such a complex topic is a great achievement!
Looking forward for the implementation of what is in scope of this first iteration of this proposal. /lgtm /approve
/hold As per office hours discussion, I will lift the hold before EOW
LGTM label has been added.
[APPROVALNOTIFIER] This PR is APPROVED
This pull-request has been approved by: fabriziopandini
The full list of commands accepted by this bot can be found here.
The pull request process is described here
- ~~OWNERS~~ [fabriziopandini]
Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment
/hold cancel
As a CloudNativePG maintainer (CNCF Sandbox project), in-place upgrades are critical for database workloads that use local storage.