cluster-api icon indicating copy to clipboard operation
cluster-api copied to clipboard

The description of cluster paused and skipping remediation of MHC should be clearer

Open haijianyang opened this issue 2 years ago • 13 comments

What steps did you take and what happened?

Original link https://cluster-api.sigs.k8s.io/tasks/automated-machine-management/healthchecking.html#skipping-remediation

What did you expect to happen?

Original description: `Implicit skipping when the resource is paused (using cluster.x-k8s.io/paused annotation):

When a cluster is paused, none of the machines in that cluster are considered for remediation. When a machine is paused, only that machine is not considered for remediation. A cluster or a machine is usually paused automatically by Cluster API when it detects a migration.`

MHC Controller

// Return early if the object or Cluster is paused.
if annotations.IsPaused(cluster, m) {
	log.Info("Reconciliation is paused for this object")
	return ctrl.Result{}, nil
}

// IsPaused returns true if the Cluster is paused or the object has the `paused` annotation.
func IsPaused(cluster *clusterv1.Cluster, o metav1.Object) bool {
	if cluster.Spec.Paused {
		return true
	}
	return HasPaused(o)
}

According to the MHC code, only when cluster.Spec.Paused is true, none of the machines in that cluster are considered for remediation.

Cluster with cluster.x-k8s.io/paused annotation will only pause the reconciliation of the cluster, not the reconciliation of other resources including MHC.

Cluster API version

v1.4.4

Kubernetes version

No response

Anything else you would like to add?

No response

Label(s) to be applied

/kind bug One or more /area label. See https://github.com/kubernetes-sigs/cluster-api/labels?q=area for the list of labels.

haijianyang avatar Jul 21 '23 05:07 haijianyang

/triage accepted

You're right that this is confusing - would be a good idea to clean up the documentation around this.

/help

killianmuldoon avatar Jul 21 '23 09:07 killianmuldoon

@killianmuldoon: This request has been marked as needing help from a contributor.

Guidelines

Please ensure that the issue body includes answers to the following questions:

  • Why are we solving this issue?
  • To address this issue, are there any code changes? If there are code changes, what needs to be done in the code and what places can the assignee treat as reference points?
  • Does this issue have zero to low barrier of entry?
  • How can the assignee reach out to you for help?

For more details on the requirements of such an issue, please see here and ensure that they are met.

If this request no longer meets these requirements, the label can be removed by commenting with the /remove-help command.

In response to this:

/triage accepted

You're right that this is confusing - would be a good idea to clean up the documentation around this.

/help

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot avatar Jul 21 '23 09:07 k8s-ci-robot

/assign

whtssub avatar Jul 25 '23 04:07 whtssub

@killianmuldoon the controller spec definition has to be modified or just the documentation needs more clarity?

whtssub avatar Jul 25 '23 05:07 whtssub

For this we just need to clear up the documentation.

killianmuldoon avatar Jul 25 '23 07:07 killianmuldoon

cool, thanks @killianmuldoon

whtssub avatar Jul 25 '23 10:07 whtssub

Should IsPaused check for Cluster annotation as well? It is checking for cluster.spec.paused, but it is not even documented. Only annotations are documented, and MHC explicitly calling func named annotations.IsPaused func, not spec.IsPaused, hinting that it will check the annotations.

Something like:

// util/annotations/helpers.go

// IsPaused returns true if the Cluster or object is paused.
func IsPaused(cluster *clusterv1.Cluster, o metav1.Object) bool {
	if cluster.Spec.Paused {
		return true
	}

        // Cluster or object  has the `paused` annotation
	return HasPaused(cluster) || HasPaused(o)
}

antonblr avatar Dec 06 '23 18:12 antonblr

Not sure if we should introduce a second way to achieve the same thing (i.e. pausing a Cluster). If there is documentation missing we can update the documentation instead of changing the implementation.

sbueringer avatar Dec 08 '23 13:12 sbueringer

/priority backlog

fabriziopandini avatar Apr 11 '24 18:04 fabriziopandini

@fabriziopandini shall we close this in favour of https://github.com/kubernetes-sigs/cluster-api/issues/10130?

enxebre avatar May 29 '24 13:05 enxebre

According to the MHC code, only when cluster.Spec.Paused is true, none of the machines in that cluster are considered for remediation.

AFAIK this is not true, or at least it is only partially true.

  • MHC does not remediate any machine in the cluster if the cluster is paused (pausing the cluster is like pausing all the MHC resources belonging to it).
  • A specific MHC resource does not remediate any target machine if the MHC instance is paused.
  • Paused machines are not remediated by MHC.

Given that, I suggest improving the note in https://cluster-api.sigs.k8s.io/tasks/automated-machine-management/healthchecking.html#skipping-remediation to

There are scenarios where remediation for a machine may be undesirable (eg. during cluster migration using clusterctl move). For such cases, MachineHealthCheck provides the following mechanisms to skip remediation.

- Users can skip remediation for a specific machine by setting the cluster.x-k8s.io/skip-remediation annotation on it.
- Paused Machines (Machines with the cluster.x-k8s.io/paused annotation) are not considered for remediation.
- If a specific MHC resource is paused (using cluster.x-k8s.io/paused annotation), it will stop to remediate the corresponding target machines.
- If the Cluster is paused (using the cluster.x-k8s.io/paused annotation or by setting cluster.spec.paused to true), all the MHC resources belonging to the Cluster will be implicitly paused, and thus stop remediating target machines.

Note: the last option (pausing the Cluster) is the one used by clusterctl move.

/good-first-issue

@fabriziopandini shall we close this in favour of https://github.com/kubernetes-sigs/cluster-api/issues/10130?

This issue is about improving the description of the MHC behaviuor in the book. The linked issue instead is about surfacing what is paused, so I don't think they overlap.

fabriziopandini avatar May 30 '24 15:05 fabriziopandini

@fabriziopandini: This request has been marked as suitable for new contributors.

Guidelines

Please ensure that the issue body includes answers to the following questions:

  • Why are we solving this issue?
  • To address this issue, are there any code changes? If there are code changes, what needs to be done in the code and what places can the assignee treat as reference points?
  • Does this issue have zero to low barrier of entry?
  • How can the assignee reach out to you for help?

For more details on the requirements of such an issue, please see here and ensure that they are met.

If this request no longer meets these requirements, the label can be removed by commenting with the /remove-good-first-issue command.

In response to this:

According to the MHC code, only when cluster.Spec.Paused is true, none of the machines in that cluster are considered for remediation.

AFAIK this is not true, or at least it is only partially true.

  • MHC does not remediate any machine in the cluster if the cluster is paused (pausing the cluster is like pausing all the MHC resources belonging to it).
  • A specific MHC resource does not remediate any target machine if the MHC instance is paused.
  • Paused machines are not remediated by MHC.

Given that, I suggest improving the note in https://cluster-api.sigs.k8s.io/tasks/automated-machine-management/healthchecking.html#skipping-remediation to

There are scenarios where remediation for a machine may be undesirable (eg. during cluster migration using clusterctl move). For such cases, MachineHealthCheck provides the following mechanisms to skip remediation.

- Users can skip remediation for a specific machine by setting the cluster.x-k8s.io/skip-remediation annotation on it.
- Paused Machines (Machines with the cluster.x-k8s.io/paused annotation) are not considered for remediation.
- If a specific MHC resource is paused (using cluster.x-k8s.io/paused annotation), it will stop to remediate the corresponding target machines.
- If the Cluster is paused (using the cluster.x-k8s.io/paused annotation or by setting cluster.spec.paused to true), all the MHC resources belonging to the Cluster will be implicitly paused, and thus stop remediating target machines.

Note: the last option (pausing the Cluster) is the one used by clusterctl move.

/good-first-issue

@fabriziopandini shall we close this in favour of https://github.com/kubernetes-sigs/cluster-api/issues/10130?

This issue is about improving the description of the MHC behaviuor in the book. The linked issue instead is about surfacing what is paused, so I don't think they overlap.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

k8s-ci-robot avatar May 30 '24 15:05 k8s-ci-robot

/assign

SD-13 avatar Jun 29 '24 17:06 SD-13