kuberay
kuberay copied to clipboard
[Feature] Reduce race condition between sequential job submission
Search before asking
- [X] I had searched in the issues and found no similar feature requirement.
Description
When we submit jobs to existing cluster, there's an issue that the time job2 is created, job1 might not be fully deleted in the cluster.
- T1 - job1 CR submitted
- T2 - job1 CR is deleted
- T3 - job2 CR is created
It probably has two jobs running in the cluster at the same time. As a user, I want to submit job2 only if job1 is fully terminated.
/cc @Basasuya
Are you willing to submit a PR?
- [ ] Yes I am willing to submit a PR!
I think this PR could help wrt to queuing and gang dispatching: https://github.com/ray-project/kuberay/pull/598
This is more a Ray issue than a KubeRay issue, but we can definitely discuss here.
If I understand right, the concern is Ray-internal: It's hard to tell if the first Ray job is completely done before sending the second one to the same cluster. @architkulkarni are you the main owner for the Ray job API? Do you have thoughts on how to guarantee clean job termination?
@DmitriGekhtman Yup that's me. Ray jobs supports concurrently running jobs and internally there's no notion of waiting for a job to finish before scheduling the next one. To do this with the Ray jobs SDK, you'd need to check the status in a loop until the first job returns a terminal status, like in the code sample here https://docs.ray.io/en/latest/cluster/running-applications/job-submission/sdk.html#submitting-a-ray-job. If using the Ray jobs CLI, ray job submit by default blocks the terminal and prints logs until the job reaches a terminal state.
@Jeffwan do you think polling for completed job is enough to enable sequential job submission?
This seems to be a Ray issue rather than KubeRay issue. Close this issue.