The description of cluster paused and skipping remediation of MHC should be clearer
What steps did you take and what happened?
Original link https://cluster-api.sigs.k8s.io/tasks/automated-machine-management/healthchecking.html#skipping-remediation
What did you expect to happen?
Original description: `Implicit skipping when the resource is paused (using cluster.x-k8s.io/paused annotation):
When a cluster is paused, none of the machines in that cluster are considered for remediation. When a machine is paused, only that machine is not considered for remediation. A cluster or a machine is usually paused automatically by Cluster API when it detects a migration.`
MHC Controller
// Return early if the object or Cluster is paused.
if annotations.IsPaused(cluster, m) {
log.Info("Reconciliation is paused for this object")
return ctrl.Result{}, nil
}
// IsPaused returns true if the Cluster is paused or the object has the `paused` annotation.
func IsPaused(cluster *clusterv1.Cluster, o metav1.Object) bool {
if cluster.Spec.Paused {
return true
}
return HasPaused(o)
}
According to the MHC code, only when cluster.Spec.Paused is true, none of the machines in that cluster are considered for remediation.
Cluster with cluster.x-k8s.io/paused annotation will only pause the reconciliation of the cluster, not the reconciliation of other resources including MHC.
Cluster API version
v1.4.4
Kubernetes version
No response
Anything else you would like to add?
No response
Label(s) to be applied
/kind bug One or more /area label. See https://github.com/kubernetes-sigs/cluster-api/labels?q=area for the list of labels.
/triage accepted
You're right that this is confusing - would be a good idea to clean up the documentation around this.
/help
@killianmuldoon: This request has been marked as needing help from a contributor.
Guidelines
Please ensure that the issue body includes answers to the following questions:
- Why are we solving this issue?
- To address this issue, are there any code changes? If there are code changes, what needs to be done in the code and what places can the assignee treat as reference points?
- Does this issue have zero to low barrier of entry?
- How can the assignee reach out to you for help?
For more details on the requirements of such an issue, please see here and ensure that they are met.
If this request no longer meets these requirements, the label can be removed
by commenting with the /remove-help command.
In response to this:
/triage accepted
You're right that this is confusing - would be a good idea to clean up the documentation around this.
/help
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.
/assign
@killianmuldoon the controller spec definition has to be modified or just the documentation needs more clarity?
For this we just need to clear up the documentation.
cool, thanks @killianmuldoon
Should IsPaused check for Cluster annotation as well? It is checking for cluster.spec.paused, but it is not even documented. Only annotations are documented, and MHC explicitly calling func named annotations.IsPaused func, not spec.IsPaused, hinting that it will check the annotations.
Something like:
// util/annotations/helpers.go
// IsPaused returns true if the Cluster or object is paused.
func IsPaused(cluster *clusterv1.Cluster, o metav1.Object) bool {
if cluster.Spec.Paused {
return true
}
// Cluster or object has the `paused` annotation
return HasPaused(cluster) || HasPaused(o)
}
Not sure if we should introduce a second way to achieve the same thing (i.e. pausing a Cluster). If there is documentation missing we can update the documentation instead of changing the implementation.
/priority backlog
@fabriziopandini shall we close this in favour of https://github.com/kubernetes-sigs/cluster-api/issues/10130?
According to the MHC code, only when cluster.Spec.Paused is true, none of the machines in that cluster are considered for remediation.
AFAIK this is not true, or at least it is only partially true.
- MHC does not remediate any machine in the cluster if the cluster is paused (pausing the cluster is like pausing all the MHC resources belonging to it).
- A specific MHC resource does not remediate any target machine if the MHC instance is paused.
- Paused machines are not remediated by MHC.
Given that, I suggest improving the note in https://cluster-api.sigs.k8s.io/tasks/automated-machine-management/healthchecking.html#skipping-remediation to
There are scenarios where remediation for a machine may be undesirable (eg. during cluster migration using clusterctl move). For such cases, MachineHealthCheck provides the following mechanisms to skip remediation.
- Users can skip remediation for a specific machine by setting the cluster.x-k8s.io/skip-remediation annotation on it.
- Paused Machines (Machines with the cluster.x-k8s.io/paused annotation) are not considered for remediation.
- If a specific MHC resource is paused (using cluster.x-k8s.io/paused annotation), it will stop to remediate the corresponding target machines.
- If the Cluster is paused (using the cluster.x-k8s.io/paused annotation or by setting cluster.spec.paused to true), all the MHC resources belonging to the Cluster will be implicitly paused, and thus stop remediating target machines.
Note: the last option (pausing the Cluster) is the one used by clusterctl move.
/good-first-issue
@fabriziopandini shall we close this in favour of https://github.com/kubernetes-sigs/cluster-api/issues/10130?
This issue is about improving the description of the MHC behaviuor in the book. The linked issue instead is about surfacing what is paused, so I don't think they overlap.
@fabriziopandini: This request has been marked as suitable for new contributors.
Guidelines
Please ensure that the issue body includes answers to the following questions:
- Why are we solving this issue?
- To address this issue, are there any code changes? If there are code changes, what needs to be done in the code and what places can the assignee treat as reference points?
- Does this issue have zero to low barrier of entry?
- How can the assignee reach out to you for help?
For more details on the requirements of such an issue, please see here and ensure that they are met.
If this request no longer meets these requirements, the label can be removed
by commenting with the /remove-good-first-issue command.
In response to this:
According to the MHC code, only when cluster.Spec.Paused is true, none of the machines in that cluster are considered for remediation.
AFAIK this is not true, or at least it is only partially true.
- MHC does not remediate any machine in the cluster if the cluster is paused (pausing the cluster is like pausing all the MHC resources belonging to it).
- A specific MHC resource does not remediate any target machine if the MHC instance is paused.
- Paused machines are not remediated by MHC.
Given that, I suggest improving the note in https://cluster-api.sigs.k8s.io/tasks/automated-machine-management/healthchecking.html#skipping-remediation to
There are scenarios where remediation for a machine may be undesirable (eg. during cluster migration using clusterctl move). For such cases, MachineHealthCheck provides the following mechanisms to skip remediation. - Users can skip remediation for a specific machine by setting the cluster.x-k8s.io/skip-remediation annotation on it. - Paused Machines (Machines with the cluster.x-k8s.io/paused annotation) are not considered for remediation. - If a specific MHC resource is paused (using cluster.x-k8s.io/paused annotation), it will stop to remediate the corresponding target machines. - If the Cluster is paused (using the cluster.x-k8s.io/paused annotation or by setting cluster.spec.paused to true), all the MHC resources belonging to the Cluster will be implicitly paused, and thus stop remediating target machines. Note: the last option (pausing the Cluster) is the one used by clusterctl move./good-first-issue
@fabriziopandini shall we close this in favour of https://github.com/kubernetes-sigs/cluster-api/issues/10130?
This issue is about improving the description of the MHC behaviuor in the book. The linked issue instead is about surfacing what is paused, so I don't think they overlap.
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.
/assign