jobset icon indicating copy to clipboard operation
jobset copied to clipboard

Handling cases where pod is stuck in Terminating state

Open qGentry opened this issue 1 year ago • 22 comments

Hi, I was wondering how to properly handle cases where worker pod is stuck in Terminating state. From my experience, this may happen in various cases:

  • Node has got shut down during pod deletion
  • Kernel hang on node
  • GPU problems

From my quick experiments with JobSet, if worker pod has stuck in Terminating state, JobSet will not trigger restart as it is waiting for underlying pods be terminated. Quick workaround might be something like CronJob that periodically force deletes jobset-controlled pods that stuck in Terminating state for more than N minutes but this is suboptimal as you cannot subsequently manually investigate what actually happened with pod and why it has got stuck in Terminating state.

It would be great if I could specify something like "podTerminationTimeout" after which JobSet will create new Job without waiting for previous pods to be terminated.

qGentry avatar May 22 '24 16:05 qGentry

We created the PodReplacementPolicy in the job api for this reason.

it’s a beta feature in 1.29 and will only recreate a pod once it is fully terminated.

https://github.com/kubernetes/enhancements/blob/master/keps/sig-apps/3939-allow-replacement-when-fully-terminated/README.md

rereading this not sure if this KEP would help here. Sounds like you want the job to be marked as failed if it goes to terminating.

kannon92 avatar May 22 '24 16:05 kannon92

Yeah, I don't see how this KEP would help. Actually, after a quick reading of this KEP, I think it would introduce the same problem I'm experiencing with JobSet in the vanilla k8s Job – it will never restart if a pod is stuck in the Terminating state.

qGentry avatar May 22 '24 17:05 qGentry

Yeah I have experienced pods staying in terminating state for a long time when doing training on TPUs as well. One way we could get around this is setting some timeout on the Job foreground deletion call, and then forcibly delete all pods once we hit that timeout.

However, this is not great since forcibly deleting the pod objects from etcd doesn't guarantee the underlying container process has been cleaned up - a problematic container process could still be holding a GPU/TPU resource for example, preventing a newly scheduled pod from using it.

danielvegamyhre avatar May 25 '24 16:05 danielvegamyhre

Totally agree with you, I'm currently using hand-crafted argo workflow for launching multi-node training which also requires force deleting pods stuck in Terminating state which just deletes them from etcd and often leads to silently weirdly behaving nodes. I ended up with tainting nodes before force deleting pods which kinda works but is really dirty hack.

That was actually main reason why I wanted to find alternative (like JobSet) for synchronous jobs hoping that this problem will be solved already :)

One possible implementation that comes to my mind (without need to forcefully delete workers) is to name Job created by JobSet with attempt number, like pytorch-workers-0-attempt-0/pytorch-workers-0-attempt-1/pytorch-workers-0-attempt-2/... (instead of pytorch-workers-0 for each attempt) and providing way to set timeout in JobSet's spec for Job's workers to terminate (default to infinity for backward compability). If time runs out, we just create new job with attempt count increased by one and leave previous Job to just hang for further investigation while new workers will be able to schedule to free nodes and continue training progress.

But at least one important problem I see here is headless service. As the pods for each attempt will be named differently, we have to force users to handle this in user code.

One possible approach would to to env var similar to rank looking like this

              - name: MASTER_NAME
                value: "pytorch-workers-0-0"
              - name: MASTER_CONTAINER
                value: "pytorch"
              - name: ATTEMPT
                valueFrom:
                  fieldRef:
                    fieldPath: metadata.annotations['jobset.sigs.k8s.io/restart-attempt']

and setting torchrun --master_addr=$MASTER_ADDR-attempt-$ATTEMPT.$MASTER_CONTAINER as cmd

qGentry avatar May 26 '24 09:05 qGentry

If I understand this correctly, it sounds like you want the Job to be failed as soon as a pod goes into terminating. I see that we could implement recreation in Jobset or we could allow a way to mark a job as failed as soon as a pod goes to terminating.

@mimowo @alculquicondor any ideas here? Jobset only recreates jobs once they are failed.

kannon92 avatar May 29 '24 14:05 kannon92

I think a Pod stuck in terminating is something we should eliminate in the first place. Or, at least, we need to understand what is the scenario to propose the best approach.

Underneath JobSet the Pod is managed by the batch/Job controller, and there has been some fixes in the recent k8s versions. For example, when the node is gone, the pod phase should be transitioned from Running to Failed by PodGC in k8s 1.26+.

What is your k8s version? Also, can you share your JobSet yaml, and the yaml for the stuck pod?

mimowo avatar May 29 '24 14:05 mimowo

Also, what does terminating actually mean in this case? is it in phase running and cannot transition to Failed, or it is already in Failed, but there is a finalizer which blocks the final deletion from the API server.

mimowo avatar May 29 '24 14:05 mimowo

It would be better to understand the exact cases where the Pods are getting stuck and bring this up to kubernetes/kubernetes, instead of adding hacks in Jobset.

alculquicondor avatar May 29 '24 17:05 alculquicondor

cc @SergeyKanzhelev

alculquicondor avatar May 29 '24 17:05 alculquicondor

It would be better to understand the exact cases where the Pods are getting stuck and bring this up to kubernetes/kubernetes, instead of adding hacks in Jobset.

Agreed. I want to prioritize this because it is actually particularly problematic for large scale distributed ML training workloads, as it can substantially increase e2e failure recovery latency. We use foreground deletion when deleting failed Jobs, to prevent exponential backoff of pod creation attempts when the pods from the previous Job iteration still exist. So when pods stay in terminating state, this prevents the JobSet controller from creating a new replacement Job until all pods are finally cleaned up, and only then can the rescheduling of all the new pods begin.

For the cases I've seen, I think it may be due to SIGTERM signal handers in the training code which trigger auto-checkpointing logic on graceful shutdown, and so at least terminationGracePeriodSeconds seconds pass before pod objects are actually deleted from etcd.

I also wonder if the container process is not releasing the accelerator chip cleanly/quickly for some reason.

I will talk with some folks in SIG Node to get their take on this and try to drive a long-term solution for it.

danielvegamyhre avatar Jul 01 '24 00:07 danielvegamyhre

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot avatar Sep 29 '24 00:09 k8s-triage-robot

/remove-lifecycle stale

This is an important problem to solve. I did some benchmarking and found for 6k pod JobSet being restarted, the majority of the e2e restart latency was due to waiting for pods in Terminating state to be completely deleted so the JobSet controller can recreate the Jobs (foreground cascading deletion policy).

danielvegamyhre avatar Oct 05 '24 17:10 danielvegamyhre

Please share repro cases. It’s really hard to follow this without those

kannon92 avatar Oct 05 '24 18:10 kannon92

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot avatar Jan 03 '25 18:01 k8s-triage-robot

/remove-lifecycle stale

ahg-g avatar Jan 31 '25 04:01 ahg-g

I am doing some experiments using below spec

apiVersion: jobset.x-k8s.io/v1alpha2
kind: JobSet
metadata:
  name: example
spec:
  failurePolicy:
    maxRestarts: 3
  replicatedJobs:
  - name: workers
    replicas: 1
    template:
      spec:
        backoffLimit: 0
        completions: 2
        parallelism: 2
        template:
          spec:
            containers:
            - name: sleep
              image: busybox
              command:
                - sleep
              args:
                - infinity

Then I performed below actions:

  1. Ssh into the node that hosts one of the pods, e.g. example-workers-0-0-XXXX, stopping kubelet systemctl stop kubelet (node will become NotReady in 40 seconds).
  2. Delete the other pod example-workers-0-1-XXXX
  3. The -0-0 will be stuck in Terminating state, and since jobset uses foreground deletion policy, new job will not be created.

SidneyShen avatar Feb 28 '25 05:02 SidneyShen

I am doing some experiments using below spec

apiVersion: jobset.x-k8s.io/v1alpha2
kind: JobSet
metadata:
  name: example
spec:
  failurePolicy:
    maxRestarts: 3
  replicatedJobs:
  - name: workers
    replicas: 1
    template:
      spec:
        backoffLimit: 0
        completions: 2
        parallelism: 2
        template:
          spec:
            containers:
            - name: sleep
              image: busybox
              command:
                - sleep
              args:
                - infinity

Then I performed below actions:

  1. Ssh into the node that hosts one of the pods, e.g. example-workers-0-0-XXXX, stopping kubelet systemctl stop kubelet (node will become NotReady in 40 seconds).
  2. Delete the other pod example-workers-0-1-XXXX
  3. The -0-0 will be stuck in Terminating state, and since jobset uses foreground deletion policy, new job will not be created.

Just to close the loop, ideally there should be control plane component that taints the node object as unavailable, and that taint should trigger deleting the pods.

ahg-g avatar Apr 29 '25 19:04 ahg-g

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot avatar Jul 28 '25 20:07 k8s-triage-robot

if there are more specific problems descriptions that we should handle in SIG node, please let me know.

SergeyKanzhelev avatar Jul 28 '25 20:07 SergeyKanzhelev

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle rotten
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

k8s-triage-robot avatar Aug 27 '25 21:08 k8s-triage-robot

/remove-lifecycle rotten

mimowo avatar Sep 08 '25 14:09 mimowo

I report an issue in Kueue as our users are hitting it: https://github.com/kubernetes-sigs/kueue/issues/6757

mimowo avatar Sep 08 '25 15:09 mimowo