cluster-api icon indicating copy to clipboard operation
cluster-api copied to clipboard

Support distinguishing control plane machines from different rolling updates

Open haijianyang opened this issue 2 years ago • 9 comments

User Story

As a developer and end-user, I want cluster-api to be able to distinguish control plane machines from different rolling updates so that they can be managed in groups.

Detailed Description

Currently, when KCP triggers a rolling update to create new machines, there is no clear distinction between the new machines and the old machines (created before the rolling update).

I want to do some extra special works for machines created by rolling update, but don't know how to distinguish which are newly created machines.

// cluster-api/controlplane/kubeadm/internal/controllers/controller.go
// Control plane machines rollout due to configuration changes (e.g. upgrades) takes precedence over other operations.
needRollout := controlPlane.MachinesNeedingRollout()
switch {
case len(needRollout) > 0:
	log.Info("Rolling out Control Plane machines", "needRollout", needRollout.Names())
	conditions.MarkFalse(controlPlane.KCP, controlplanev1.MachinesSpecUpToDateCondition, controlplanev1.RollingUpdateInProgressReason, clusterv1.ConditionSeverityWarning, "Rolling %d replicas with outdated spec (%d replicas up to date)", len(needRollout), len(controlPlane.Machines)-len(needRollout))
	return r.upgradeControlPlane(ctx, cluster, kcp, controlPlane, needRollout)
default:
	// make sure last upgrade operation is marked as completed.
	// NOTE: we are checking the condition already exists in order to avoid to set this condition at the first
	// reconciliation/before a rolling upgrade actually starts.
	if conditions.Has(controlPlane.KCP, controlplanev1.MachinesSpecUpToDateCondition) {
		conditions.MarkTrue(controlPlane.KCP, controlplanev1.MachinesSpecUpToDateCondition)
	}
}

The types that triggers the rolling update.

// cluster-api/controlplane/kubeadm/internal/control_plane.go
// MachinesNeedingRollout return a list of machines that need to be rolled out.
func (c *ControlPlane) MachinesNeedingRollout() collections.Machines {
	// Ignore machines to be deleted.
	machines := c.Machines.Filter(collections.Not(collections.HasDeletionTimestamp))

	// Return machines if they are scheduled for rollout or if with an outdated configuration.
	return machines.AnyFilter(
		// Machines whose certificates are about to expire.
		collections.ShouldRolloutBefore(&c.reconciliationTime, c.KCP.Spec.RolloutBefore),
		// Machines that are scheduled for rollout (KCP.Spec.RolloutAfter set, the RolloutAfter deadline is expired, and the machine was created before the deadline).
		collections.ShouldRolloutAfter(&c.reconciliationTime, c.KCP.Spec.RolloutAfter),
		// Machines that do not match with KCP config.
		collections.Not(MatchesMachineSpec(c.infraResources, c.kubeadmConfigs, c.KCP)),
	)
}

// MatchesMachineSpec returns a filter to find all machines that matches with KCP config and do not require any rollout.
// Kubernetes version, infrastructure template, and KubeadmConfig field need to be equivalent.
func MatchesMachineSpec(infraConfigs map[string]*unstructured.Unstructured, machineConfigs map[string]*bootstrapv1.KubeadmConfig, kcp *controlplanev1.KubeadmControlPlane) func(machine *clusterv1.Machine) bool {
	return collections.And(
		func(machine *clusterv1.Machine) bool {
			return matchMachineTemplateMetadata(kcp, machine)
		},
		collections.MatchesKubernetesVersion(kcp.Spec.Version),
		MatchesKubeadmBootstrapConfig(machineConfigs, kcp),
		MatchesTemplateClonedFrom(infraConfigs, kcp),
	)
}

The rolling update triggered by the kubernetes version update, then we can distinguish which machines are created by rolling update by comparing the kcp.Spec.Version and the machine.Spec.Version (collections.MatchesKubernetesVersion(kcp.Spec.Version)). But rolling updates triggered by other situations cannot be distinguished.

/kind feature

haijianyang avatar Feb 02 '23 08:02 haijianyang

Interesting ideas @haijianyang! Do you have some detail on what use case you're trying to solve with this?

killianmuldoon avatar Feb 02 '23 11:02 killianmuldoon

For the HA of the cluster, we need to deploy nodes to different physical hosts.

For example, when we deploy a cluster of three control plane nodes in an environment of three physical hosts, we will deploy one control plane node on each physical host. We use placement group(AWS-like placement groups) to strictly ensure that all control plane nodes must be on different physical hosts. Let's assume this placement group is called placement-group-1, all three control plane nodes belong to this placement group.

However, the rolling update of KCP will first create a new control plane node and then delete the old control plane node. If the new control plane node also uses this placement-group-1, an error will be reported, because there are only three physical hosts, and it cannot be guaranteed that the four control plane nodes are on different physical hosts. So we want to manage the new control plane nodes by using the new placement group placement-group-2 (the different placement groups are directly independent), so that the rolling update of KCP can be done through three physical hosts.

But currently we have no way to distinguish which machines are new control plane nodes.

haijianyang avatar Feb 03 '23 03:02 haijianyang

Can you distinguish the newer nodes by looking at the template that they reference after doing template rotation?

killianmuldoon avatar Feb 03 '23 13:02 killianmuldoon

Template rotation belongs to the third type of rolling update(MatchesMachineSpec), which can be distinguished according to the update of the template.

But how to distinguish the situations of ShouldRolloutBefore and ShouldRolloutAfter?

// 1. Machines whose certificates are about to expire.
collections.ShouldRolloutBefore(&c.reconciliationTime, c.KCP.Spec.RolloutBefore),
// 2. Machines that are scheduled for rollout (KCP.Spec.RolloutAfter set, the RolloutAfter deadline is expired, and the machine was created before the deadline).
collections.ShouldRolloutAfter(&c.reconciliationTime, c.KCP.Spec.RolloutAfter),
// 3. Machines that do not match with KCP config.
collections.Not(MatchesMachineSpec(c.infraResources, c.kubeadmConfigs, c.KCP)),

haijianyang avatar Feb 04 '23 07:02 haijianyang

Hi @killianmuldoon, do you have any other ideas?

haijianyang avatar Feb 14 '23 08:02 haijianyang

/triage accepted There is no clear way forward yet, but IMO it makes sense to keep the discussion going

fabriziopandini avatar Mar 21 '23 09:03 fabriziopandini

/help

fabriziopandini avatar Mar 21 '23 09:03 fabriziopandini

@fabriziopandini: This request has been marked as needing help from a contributor.

Guidelines

Please ensure that the issue body includes answers to the following questions:

  • Why are we solving this issue?
  • To address this issue, are there any code changes? If there are code changes, what needs to be done in the code and what places can the assignee treat as reference points?
  • Does this issue have zero to low barrier of entry?
  • How can the assignee reach out to you for help?

For more details on the requirements of such an issue, please see here and ensure that they are met.

If this request no longer meets these requirements, the label can be removed by commenting with the /remove-help command.

In response to this:

/help

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot avatar Mar 21 '23 09:03 k8s-ci-robot

This issue has not been updated in over 1 year, and should be re-triaged.

You can:

  • Confirm that this issue is still relevant with /triage accepted (org members only)
  • Close this issue with /close

For more details on the triage process, see https://www.kubernetes.dev/docs/guide/issue-triage/

/remove-triage accepted

k8s-triage-robot avatar Mar 20 '24 10:03 k8s-triage-robot

/priority backlog

fabriziopandini avatar Apr 11 '24 19:04 fabriziopandini

The Cluster API project currently lacks enough active contributors to adequately respond to all issues and PRs.

There is no update since 1yr now, an no other folks are showing interest for this feature. Also, the use case described above seems already covered by placement in failure domains. /close

fabriziopandini avatar May 02 '24 13:05 fabriziopandini

@fabriziopandini: Closing this issue.

In response to this:

The Cluster API project currently lacks enough active contributors to adequately respond to all issues and PRs.

There is no update since 1yr now, an no other folks are showing interest for this feature. Also, the use case described above seems already covered by placement in failure domains. /close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot avatar May 02 '24 13:05 k8s-ci-robot