cluster-api icon indicating copy to clipboard operation
cluster-api copied to clipboard

MachineDrainRules WaitCompleted behavior still waits for completed Pods

Open dgmorales opened this issue 2 months ago • 7 comments

What steps did you take and what happened?

  1. Apply this Pod manifest to a cluster managed by cluster API
apiVersion: v1
kind: Pod
metadata:
  name: test-wait-completed
  labels:
    cluster.x-k8s.io/drain: wait-completed
spec:
  # nodeName: <node-name> # set this if you want to pick a specific node to drain
  containers:
  - name: sleeper
    image: busybox:latest
    command: ["sleep"]
    args: ["300"]
  restartPolicy: Never
  1. Check which node that Pod landed on. On your cluster API management cluster, delete the machine object associated to the node, to trigger the drain process on it.
  2. Watch the drain logs from the capi-controller-manager pod on capi-system namespace.
  3. Note that ClusterAPI will identify the test-wait-completedpod and wait for it.
  4. After the 300 seconds, the pod will shift to a Completed state.
  5. Note that Cluster API is still waiting on that pod, and will keep waiting forever (or until nodeDrainTimeout, if set).
  6. Delete the pod, and Cluster API will proceed to delete the machine.

What did you expect to happen?

I expected the drain/delete to proceed when the pod shifts to a Completed state.

I believe the "WaitCompleted" wording suggests quite strongly that a matching Pod becoming Completed would unlock the wait. And I think of it as a sensible (maybe optional) behavior for WaitCompleted.

One use case I have for that behavior is to make node rollouts wait for some important long-running jobs to finish. I don't want those jobs to be canceled on eviction by a cluster-api driven node drain.

The argument for optionality of the expected behavior here is stronger if we think about failed jobs. Proceeding with the drain on those cases might remove the opportunity to debug why the job failed, or to even noticing it failed. So one might prefer to wait after jobs get to failure states, but proceed on success ones, for example.

Cluster API version

1.9.6 (I bit outdated I know, but I checked the most recent code, and all PR mentioning MachineDrainRules to make sure the behavior is still as reported here)

Kubernetes version

Client Version: v1.34.1
Kustomize Version: v5.7.1
Server Version: v1.32.8
Warning: version difference between client (1.34) and server (1.32) exceeds the supported minor version skew of +/-1

Anything else you would like to add?

This might be considered a feature request instead of a bug. I categorized as bug because as I said above, I think the WaitCompleted wording leads the user to expect a different behavior than the one available.

As for the optionality discussed above, I believe maybe we could have new fields as this:

spec:
  drain:
    behavior: WaitCompleted
    waitUntil: Success|Removal
    # Removal is the current behavior. Could be the default option 
    # at least initially to not change the semantics for existing users.
    # Success would stop waiting if the pod transitions to a success 
    # completed state.

But such waitUntil field would only make sense for the WaitCompleted behavior. I believe this would be acceptable since the order field also only makes sense for Drain behavior.

I read most of the code and have a somewhat small experience with Kubernetes operators. I will check if I feel capable of providing a PR to this.

Label(s) to be applied

/kind bug /area machine

dgmorales avatar Oct 30 '25 20:10 dgmorales

I think I can generate a PR to address this and I am working on it (seems I don't have permissions to assign this to myself, though).

dgmorales avatar Nov 03 '25 10:11 dgmorales

I think what you are describing as wanted behavior is what we intended it to be.

	// MachineDrainRuleDrainBehaviorWaitCompleted means the Pod should not be evicted,
	// but overall drain should wait until the Pod completes.
	MachineDrainRuleDrainBehaviorWaitCompleted MachineDrainRuleDrainBehavior = "WaitCompleted"

Or am I misunderstanding your issue?

There is some e2e test-coverage around wait-completed, but it looks like this does not test the case of the pod getting into succeeded. It later on force-deletes the pod.

chrischdi avatar Nov 04 '25 17:11 chrischdi

I think what you are describing as wanted behavior is what we intended it to be.

Yeah, it could be so, but I can't be sure. I think there can be valid use cases for small differences on the behavior, like if it should wait or not when a matching pod is on Failed phase.

That's why I am thinking on having a wailUntil field to tune that behavior.

Something like this:

	// waitUntil defines when to stop waiting for a Pod when behavior is "WaitCompleted".
	// Can be either "PodSucceeded", "PodNotRunning", or "PodRemoved".

	// "PodSucceeded" will stop waiting if the Pod is removed OR if pod.Status.Phase is
	// either corev1.PodSucceeded or corev1.PodUnknown.

	// "PodNotRunning" will stop waiting if pod.Status.Phase != corev1.PodRunning.

	// "PodRemoved" will keep waiting while pod exists. This is the default behavior if
	// waitUntil is not specified.

	// waitUntil can only be set if behavior is set to "WaitCompleted".
	// +optional
	WaitUntil *MachineDrainRuleWaitUntil `json:"waitUntil,omitempty"`

I am not sure if we should strive to keep backwards compatibility by keeping the PodRemoved the default behavior.

Early feedback is welcome on that.

dgmorales avatar Nov 05 '25 18:11 dgmorales

That feels like a bad api though, the field only makes sense for WaitCompleted as you noted.

Also: PodRemoved and PodSucceeded together IMHO does not make sense. If a pod is removed, it for us should be the same result as waiting for completed for us.

One question: how does kubectl drain behave for such a pod?

chrischdi avatar Nov 06 '25 11:11 chrischdi

That feels like a bad api though, the field only makes sense for WaitCompleted as you noted.

Yeah, I agree. But it's also not new here. The order field only makes sense with behavior set to drain. It's doc comment says that order can only be set if behavior is set to "Drain".. That's why I think this is acceptable. Even if I come up with a better api idea (not so far), it will diverge from the current api style for order.

Also: PodRemoved and PodSucceeded together IMHO does not make sense. If a pod is removed, it for us should be the same result as waiting for completed for us.

Indeed PodRemoved is the option that preserves current WaitCompleted, that why I proposed as the the default for waitUntil. I can try to come up with other alternatives to the enum options (naming and/or behavior). I thought these as the most explicit and accurate.

One question: how does kubectl drain behave for such a pod?

kubectl drain refuses to drain if it sees any Pod that is not owned by some controller (unless you set --force). Current ClusterAPI code only adds a warning on that case. (is this a bug?)

Our jenkins setup creates pods unowned by any controllers, and I am sure we can find other cases of that. Pods created by Jobs and CronJobs are evicted (not sure if there is any twist or variant behavior on that, I don't think so), because they are marked as owned by the job.

I actually prefer current ClusterAPI take on this, (with a better working WaitCompleted behavior, proposed here). But it is different from kubectl drain and I know you try to keep it similar.

dgmorales avatar Nov 06 '25 18:11 dgmorales

/triage accepted

I need some time to take another look, and am sure @sbueringer has some opinion on it (but just did not yet have the time of taking a look)

chrischdi avatar Nov 12 '25 13:11 chrischdi

Looks like this topic is more complicated than I initially thought. Considering upcoming code freeze and PTO I'll need some time to get back to this and take a closer look

sbueringer avatar Nov 24 '25 07:11 sbueringer