kuberay icon indicating copy to clipboard operation
kuberay copied to clipboard

[Feature] `RayJob` with `Waiting` status needs a ttl mechanism

Open danielgafni opened this issue 3 months ago • 5 comments

Search before asking

  • [x] I had searched in the issues and found no similar feature requirement.

Description

I'm using RayJob with InteractiveMode. Sometimes the RayJob is left in an incomplete state where it has been created, but a Ray job has not been submitted / .spec.jobId field is missing.

Currently, these RayJobs are left to hang forever in Waiting status.

It would be great if they could be cleaned up instead. ttlSecondsAfterFinished/ActiveDeadlineSeconds aren't good options to control this behavior since they track active run time, not waiting time, so perhaps a new field is needed for this purpose.

Use case

No response

Related issues

No response

Are you willing to submit a PR?

  • [ ] Yes I am willing to submit a PR!

danielgafni avatar Sep 02 '25 22:09 danielgafni

Hi @Future-Outlier , mind if I take this one? I have an idea for how to implement it.

EagleLo avatar Sep 03 '25 03:09 EagleLo

Does ActiveDeadlineSeconds already work for this?

rueian avatar Sep 03 '25 03:09 rueian

In my opinion, ActiveDeadlineSeconds is a "total lifetime" timer that includes running time, and it kills healthy running jobs if they exceed the deadline. A better option is a field that specifically targets jobs stuck in Waiting status that doesn't affect jobs that are legitimately running or waiting for resources.

EagleLo avatar Sep 03 '25 04:09 EagleLo

Hi @seanlaii, would this be possible to be a part of the https://github.com/ray-project/kuberay/issues/4018?

rueian avatar Sep 03 '25 23:09 rueian

Hi @seanlaii, would this be possible to be a part of the #4018?

In my opinion, this functionality might be better served as its own distinct feature, separate from the DeletionStrategy. From what I can see, DeletionStrategy is primarily focused on handling the cleanup after a job has reached a terminal state (SUCCEEDED or FAILED). The logic is more about "post-completion" actions.

On the other hand, a timeout for the Waiting status feels more like a "pre-run" guardrail. It's semantically closer to the existing activeDeadlineSeconds field, which also manages timeouts during a job's active lifecycle.

By keeping these three concepts separate, one for post-completion cleanup (DeletionStrategy), one for running timeouts (activeDeadlineSeconds), and another for pre-run (waitingTTLSeconds), we can keep the API clearer. Each field would have a single responsibility.

WDYT? Thanks!

seanlaii avatar Sep 04 '25 02:09 seanlaii