mpi-operator icon indicating copy to clipboard operation
mpi-operator copied to clipboard

ttlSecondsAfterFinished for MPIJob, not only launcher

Open hy00nc opened this issue 1 year ago • 7 comments

Do we have plan to extend ttlsSecondsAfterFinished to the MPIJob-level, not just the launcher?

hy00nc avatar May 27 '24 10:05 hy00nc

do you mean that you want to keep the pod objects until the ttl finishes?

Or do you want to keep them running?

alculquicondor avatar May 27 '24 12:05 alculquicondor

@alculquicondor, thanks for the reply. I want the mpijob resource itself to be deleted after ttl, just like how ttlSecondsAfterFinished works in MPIJob V1. In the current implementation, it remains uncleaned until deleted explicitly, right?

hy00nc avatar May 27 '24 12:05 hy00nc

oh, gotcha. I don't know if that's how other Kubeflow APIs work. If they do, we can bring MPIJob back to parity.

alculquicondor avatar May 27 '24 14:05 alculquicondor

oh, gotcha. I don't know if that's how other Kubeflow APIs work. If they do, we can bring MPIJob back to parity.

Indeed, the other Jobs will be removed after ttlSecondsAfterFinished like this:

https://github.com/kubeflow/training-operator/blob/be5df91eb43e2fdfa1b0a7005f7aeb8cc3a52fb1/pkg/controller.v1/common/job.go#L428-L435

tenzen-y avatar May 27 '24 14:05 tenzen-y

Would it make sense to extend activeDeadlineSeconds and backoffLimit as well? I guess these are also currently limited to launcher, but other kubeflow jobs apply it to the job-level.

hy00nc avatar May 28 '24 00:05 hy00nc

Those should be fine just in Job, because the launcher job is what controls the execution. If it finishes as Failed, the rest of the pods would terminate too, IIRC.

alculquicondor avatar May 28 '24 12:05 alculquicondor

@hy00nc could we do this now?

GautamSinghania avatar Jun 05 '25 13:06 GautamSinghania