[Feature] `RayJob` with `Waiting` status needs a ttl mechanism
Search before asking
- [x] I had searched in the issues and found no similar feature requirement.
Description
I'm using RayJob with InteractiveMode. Sometimes the RayJob is left in an incomplete state where it has been created, but a Ray job has not been submitted / .spec.jobId field is missing.
Currently, these RayJobs are left to hang forever in Waiting status.
It would be great if they could be cleaned up instead. ttlSecondsAfterFinished/ActiveDeadlineSeconds aren't good options to control this behavior since they track active run time, not waiting time, so perhaps a new field is needed for this purpose.
Use case
No response
Related issues
No response
Are you willing to submit a PR?
- [ ] Yes I am willing to submit a PR!
Hi @Future-Outlier , mind if I take this one? I have an idea for how to implement it.
Does ActiveDeadlineSeconds already work for this?
In my opinion, ActiveDeadlineSeconds is a "total lifetime" timer that includes running time, and it kills healthy running jobs if they exceed the deadline. A better option is a field that specifically targets jobs stuck in Waiting status that doesn't affect jobs that are legitimately running or waiting for resources.
Hi @seanlaii, would this be possible to be a part of the https://github.com/ray-project/kuberay/issues/4018?
Hi @seanlaii, would this be possible to be a part of the #4018?
In my opinion, this functionality might be better served as its own distinct feature, separate from the DeletionStrategy.
From what I can see, DeletionStrategy is primarily focused on handling the cleanup after a job has reached a terminal state (SUCCEEDED or FAILED). The logic is more about "post-completion" actions.
On the other hand, a timeout for the Waiting status feels more like a "pre-run" guardrail. It's semantically closer to the existing activeDeadlineSeconds field, which also manages timeouts during a job's active lifecycle.
By keeping these three concepts separate, one for post-completion cleanup (DeletionStrategy), one for running timeouts (activeDeadlineSeconds), and another for pre-run (waitingTTLSeconds), we can keep the API clearer. Each field would have a single responsibility.
WDYT? Thanks!