jobset
jobset copied to clipboard
Handling cases where pod is stuck in Terminating state
Hi, I was wondering how to properly handle cases where worker pod is stuck in Terminating state. From my experience, this may happen in various cases:
- Node has got shut down during pod deletion
- Kernel hang on node
- GPU problems
From my quick experiments with JobSet, if worker pod has stuck in Terminating state, JobSet will not trigger restart as it is waiting for underlying pods be terminated. Quick workaround might be something like CronJob that periodically force deletes jobset-controlled pods that stuck in Terminating state for more than N minutes but this is suboptimal as you cannot subsequently manually investigate what actually happened with pod and why it has got stuck in Terminating state.
It would be great if I could specify something like "podTerminationTimeout" after which JobSet will create new Job without waiting for previous pods to be terminated.
We created the PodReplacementPolicy in the job api for this reason.
it’s a beta feature in 1.29 and will only recreate a pod once it is fully terminated.
https://github.com/kubernetes/enhancements/blob/master/keps/sig-apps/3939-allow-replacement-when-fully-terminated/README.md
rereading this not sure if this KEP would help here. Sounds like you want the job to be marked as failed if it goes to terminating.
Yeah, I don't see how this KEP would help. Actually, after a quick reading of this KEP, I think it would introduce the same problem I'm experiencing with JobSet in the vanilla k8s Job – it will never restart if a pod is stuck in the Terminating state.
Yeah I have experienced pods staying in terminating state for a long time when doing training on TPUs as well. One way we could get around this is setting some timeout on the Job foreground deletion call, and then forcibly delete all pods once we hit that timeout.
However, this is not great since forcibly deleting the pod objects from etcd doesn't guarantee the underlying container process has been cleaned up - a problematic container process could still be holding a GPU/TPU resource for example, preventing a newly scheduled pod from using it.
Totally agree with you, I'm currently using hand-crafted argo workflow for launching multi-node training which also requires force deleting pods stuck in Terminating state which just deletes them from etcd and often leads to silently weirdly behaving nodes. I ended up with tainting nodes before force deleting pods which kinda works but is really dirty hack.
That was actually main reason why I wanted to find alternative (like JobSet) for synchronous jobs hoping that this problem will be solved already :)
One possible implementation that comes to my mind (without need to forcefully delete workers) is to name Job created by JobSet with attempt number, like
pytorch-workers-0-attempt-0/pytorch-workers-0-attempt-1/pytorch-workers-0-attempt-2/... (instead of pytorch-workers-0 for each attempt) and providing way to set timeout in JobSet's spec for Job's workers to terminate (default to infinity for backward compability). If time runs out, we just create new job with attempt count increased by one and leave previous Job to just hang for further investigation while new workers will be able to schedule to free nodes and continue training progress.
But at least one important problem I see here is headless service. As the pods for each attempt will be named differently, we have to force users to handle this in user code.
One possible approach would to to env var similar to rank looking like this
- name: MASTER_NAME
value: "pytorch-workers-0-0"
- name: MASTER_CONTAINER
value: "pytorch"
- name: ATTEMPT
valueFrom:
fieldRef:
fieldPath: metadata.annotations['jobset.sigs.k8s.io/restart-attempt']
and setting torchrun --master_addr=$MASTER_ADDR-attempt-$ATTEMPT.$MASTER_CONTAINER as cmd
If I understand this correctly, it sounds like you want the Job to be failed as soon as a pod goes into terminating. I see that we could implement recreation in Jobset or we could allow a way to mark a job as failed as soon as a pod goes to terminating.
@mimowo @alculquicondor any ideas here? Jobset only recreates jobs once they are failed.
I think a Pod stuck in terminating is something we should eliminate in the first place. Or, at least, we need to understand what is the scenario to propose the best approach.
Underneath JobSet the Pod is managed by the batch/Job controller, and there has been some fixes in the recent k8s versions. For example, when the node is gone, the pod phase should be transitioned from Running to Failed by PodGC in k8s 1.26+.
What is your k8s version? Also, can you share your JobSet yaml, and the yaml for the stuck pod?
Also, what does terminating actually mean in this case? is it in phase running and cannot transition to Failed, or it is already in Failed, but there is a finalizer which blocks the final deletion from the API server.
It would be better to understand the exact cases where the Pods are getting stuck and bring this up to kubernetes/kubernetes, instead of adding hacks in Jobset.
cc @SergeyKanzhelev
It would be better to understand the exact cases where the Pods are getting stuck and bring this up to kubernetes/kubernetes, instead of adding hacks in Jobset.
Agreed. I want to prioritize this because it is actually particularly problematic for large scale distributed ML training workloads, as it can substantially increase e2e failure recovery latency. We use foreground deletion when deleting failed Jobs, to prevent exponential backoff of pod creation attempts when the pods from the previous Job iteration still exist. So when pods stay in terminating state, this prevents the JobSet controller from creating a new replacement Job until all pods are finally cleaned up, and only then can the rescheduling of all the new pods begin.
For the cases I've seen, I think it may be due to SIGTERM signal handers in the training code which trigger auto-checkpointing logic on graceful shutdown, and so at least terminationGracePeriodSeconds seconds pass before pod objects are actually deleted from etcd.
I also wonder if the container process is not releasing the accelerator chip cleanly/quickly for some reason.
I will talk with some folks in SIG Node to get their take on this and try to drive a long-term solution for it.
The Kubernetes project currently lacks enough contributors to adequately respond to all issues.
This bot triages un-triaged issues according to the following rules:
- After 90d of inactivity,
lifecycle/staleis applied - After 30d of inactivity since
lifecycle/stalewas applied,lifecycle/rottenis applied - After 30d of inactivity since
lifecycle/rottenwas applied, the issue is closed
You can:
- Mark this issue as fresh with
/remove-lifecycle stale - Close this issue with
/close - Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale
/remove-lifecycle stale
This is an important problem to solve. I did some benchmarking and found for 6k pod JobSet being restarted, the majority of the e2e restart latency was due to waiting for pods in Terminating state to be completely deleted so the JobSet controller can recreate the Jobs (foreground cascading deletion policy).
Please share repro cases. It’s really hard to follow this without those
The Kubernetes project currently lacks enough contributors to adequately respond to all issues.
This bot triages un-triaged issues according to the following rules:
- After 90d of inactivity,
lifecycle/staleis applied - After 30d of inactivity since
lifecycle/stalewas applied,lifecycle/rottenis applied - After 30d of inactivity since
lifecycle/rottenwas applied, the issue is closed
You can:
- Mark this issue as fresh with
/remove-lifecycle stale - Close this issue with
/close - Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale
/remove-lifecycle stale
I am doing some experiments using below spec
apiVersion: jobset.x-k8s.io/v1alpha2
kind: JobSet
metadata:
name: example
spec:
failurePolicy:
maxRestarts: 3
replicatedJobs:
- name: workers
replicas: 1
template:
spec:
backoffLimit: 0
completions: 2
parallelism: 2
template:
spec:
containers:
- name: sleep
image: busybox
command:
- sleep
args:
- infinity
Then I performed below actions:
- Ssh into the node that hosts one of the pods, e.g. example-workers-0-0-XXXX, stopping kubelet
systemctl stop kubelet(node will become NotReady in 40 seconds). - Delete the other pod example-workers-0-1-XXXX
- The
-0-0will be stuck in Terminating state, and since jobset uses foreground deletion policy, new job will not be created.
I am doing some experiments using below spec
apiVersion: jobset.x-k8s.io/v1alpha2 kind: JobSet metadata: name: example spec: failurePolicy: maxRestarts: 3 replicatedJobs: - name: workers replicas: 1 template: spec: backoffLimit: 0 completions: 2 parallelism: 2 template: spec: containers: - name: sleep image: busybox command: - sleep args: - infinityThen I performed below actions:
- Ssh into the node that hosts one of the pods, e.g. example-workers-0-0-XXXX, stopping kubelet
systemctl stop kubelet(node will become NotReady in 40 seconds).- Delete the other pod example-workers-0-1-XXXX
- The
-0-0will be stuck in Terminating state, and since jobset uses foreground deletion policy, new job will not be created.
Just to close the loop, ideally there should be control plane component that taints the node object as unavailable, and that taint should trigger deleting the pods.
The Kubernetes project currently lacks enough contributors to adequately respond to all issues.
This bot triages un-triaged issues according to the following rules:
- After 90d of inactivity,
lifecycle/staleis applied - After 30d of inactivity since
lifecycle/stalewas applied,lifecycle/rottenis applied - After 30d of inactivity since
lifecycle/rottenwas applied, the issue is closed
You can:
- Mark this issue as fresh with
/remove-lifecycle stale - Close this issue with
/close - Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale
if there are more specific problems descriptions that we should handle in SIG node, please let me know.
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.
This bot triages un-triaged issues according to the following rules:
- After 90d of inactivity,
lifecycle/staleis applied - After 30d of inactivity since
lifecycle/stalewas applied,lifecycle/rottenis applied - After 30d of inactivity since
lifecycle/rottenwas applied, the issue is closed
You can:
- Mark this issue as fresh with
/remove-lifecycle rotten - Close this issue with
/close - Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle rotten
/remove-lifecycle rotten
I report an issue in Kueue as our users are hitting it: https://github.com/kubernetes-sigs/kueue/issues/6757