cluster-api icon indicating copy to clipboard operation
cluster-api copied to clipboard

Tracking issue for In-place Updates implementation

Open alexander-demicev opened this issue 6 months ago • 9 comments

This is a tracking issue for the implementation of in-place updates. At the moment, it only covers the work required for the initial phase of the project to reach the experimental (alpha) stage.

The design and approach are described in the in-place updates proposal.

  • [ ] Add InPlaceUpdates feature gate

  • [ ] Introduce Runtime Hook API changes, see examples section of the proposal for details.

    • [ ] Add ExternalUpdate runtime hook
    • [ ] Define API contract for CanUpdateMachineRequest and CanUpdateMachineResponse
    • [ ] Define API contract for UpdateMachineRequest and UpdateMachineResponse
  • [ ] Create reference external updater (CAPD Kubeadm Updater)

    • [ ] Set up the project structure
    • [ ] Implement handlers for both runtime endpoints
    • [ ] Implement container commands for upgrading Kubernetes components
    • [ ] Provide config samples for the updater
  • [ ] Modify core controllers

  • [ ] Set up E2E testing

    • [ ] Implement E2E suite that uses the CAPD Kubeadm Updater
  • [ ] Create documentation

    • [ ] Feature flag configuration
    • [ ] New runtime hook and its API contracts
    • [ ] Updater structure and logic
    • [ ] Guide to implementing extensions
    • [ ] Explanation of the CAPD Kubeadm Updater
    • [ ] Tutorials for usage

alexander-demicev avatar May 26 '25 11:05 alexander-demicev

This issue is currently awaiting triage.

If CAPI contributors determine this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

k8s-ci-robot avatar May 26 '25 11:05 k8s-ci-robot

@alexander-demicev: The label(s) /label kind/design cannot be applied. These labels are supported: api-review, tide/merge-method-merge, tide/merge-method-rebase, tide/merge-method-squash, team/katacoda, refactor, ci-short, ci-extended, ci-full. Is this label configured under labels -> additional_labels or labels -> restricted_labels in plugin.yaml?

In response to this:

/label kind/design

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

k8s-ci-robot avatar May 26 '25 11:05 k8s-ci-robot

/kind design

alexander-demicev avatar May 26 '25 11:05 alexander-demicev

/help

chrischdi avatar May 28 '25 14:05 chrischdi

@chrischdi: This request has been marked as needing help from a contributor.

Guidelines

Please ensure that the issue body includes answers to the following questions:

  • Why are we solving this issue?
  • To address this issue, are there any code changes? If there are code changes, what needs to be done in the code and what places can the assignee treat as reference points?
  • Does this issue have zero to low barrier of entry?
  • How can the assignee reach out to you for help?

For more details on the requirements of such an issue, please see here and ensure that they are met.

If this request no longer meets these requirements, the label can be removed by commenting with the /remove-help command.

In response to this:

/help

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

k8s-ci-robot avatar May 28 '25 14:05 k8s-ci-robot

Hi @chrischdi I am interested to work on some implementation part .Could you please assign something from implementation Thank you

arshadd-b avatar Jun 02 '25 06:06 arshadd-b

@arshadd-b thanks for offering to help

The topic discussed in this issue is very complex, and even if we have a proposal defining the high level direction where we want to go + this tracking issue, my impression is that we're not at a stage where we can provide straightforward suggestions on what exactly to do; the help we need here is to figure out this next level of understanding + most probably do so experiment (sorry If we did not made it clear when adding /help during our periodic triage session)

Also, considering the work on v1beta2 will take most of maintainers & contributors focus for this release cycle, my personal assumption is that we will make most of this work in the next release cycle, but I did not discussed this with the rest of the team.

If you are looking for something to start with, may be worth considering other low hanging fruits in our backlog

fabriziopandini avatar Jun 02 '25 08:06 fabriziopandini

@fabriziopandini Should we wait before starting the implementation? Or can we already start opening PRs? It's fine if they stay open for some time if maintainers don't have resources.

alexander-demicev avatar Jun 02 '25 15:06 alexander-demicev

@alexander-demicev feel free to open PRs I'm trying to free up some of my type for this work, but completing v1beta2 is the priority now

fabriziopandini avatar Jun 05 '25 17:06 fabriziopandini

Is there perhaps any time estimate when this will be ready for testing?

I am asking because our use-case for InPlace would be to avoid re-imaging bare-metal machines that would act as CAPI KubeVirt hypervisors for child-clusters. To clarify; we want to run a management cluster that will spin up an HA control plane and manage a set of bare-metal machines that will act as our CAPI backbone. Using RollingUpgrade would be very disruptive to the CAPI-managed VMs (workers and CPs) as we don't intend to invest in RWX storage.

broboa avatar Sep 01 '25 14:09 broboa

Is there perhaps any time estimate when this will be ready for testing?

Current idea is to have something that can be used with CAPI v1.12.0 in December (exact scope of the feature for this release TBD, we try to get done as much as possible until then)

sbueringer avatar Sep 17 '25 17:09 sbueringer

@stmcginnis We talked a bit about e2e tests in the sync today. Some initial ideas.

Main goal is that we verify that CAPI is capable to orchestrate in-place updates. What kind of in-place update we are doing and how the in-place update extension is implementing it is not that important

For a first iteration we think we should just test a simple in-place update, i.e.:

  • Create Cluster
    • with the upgrades-runtimesdk flavor that is using the clusterclass-quick-start-runtimesdk ClusterClass
    • with maxSurge: 0 on KCP
    • with maxUnavailable: 1 on MD
  • Trigger an in-place update by changing files in KCP / KubeadmConfigTemplate by modifying Cluster.spec.topology.variables
  • Check that all Machines have been updated in-place, i.e.:
    • we have to check that the Machines are still the "same" (same Machine "names" as before the update was triggered)
    • we have to check that all KubeadmConfigs of all Machines contain the change
    • check that the in-place update is completed by checking conditions

Tasks:

  • Extend clusterclass-quick-start-runtimesdk ClusterClass by adding a variable and a corresponding patch to the RuntimeExtension (test/extension/handlers/topologymutation/handler.go)
  • Extend Runtime Extension to implement the following hooks: CanUpdateMachine, CanUpdateMachineSet, UpdateMachine
    • CanUpdateMachine/CanUpdateMachineSet should respond with a patch that expresses that the Runtime Extension can change the files
    • UpdateMachine: initially return that the update is in progress, then return success after some time (we don't actually have to do the in-place update on the Machines)
    • For some inspiration see: https://github.com/sbueringer/cluster-api/tree/pr-kcp-in-place
  • Implement e2e test (as described above)

I think all of this can be already implemented today, we just have to comment out some validation steps in the e2e test until the in-place functionality is implemented

sbueringer avatar Oct 14 '25 14:10 sbueringer