kuberay icon indicating copy to clipboard operation
kuberay copied to clipboard

[Feature] Reduce race condition between sequential job submission

Open Jeffwan opened this issue 3 years ago • 4 comments
trafficstars

Search before asking

  • [X] I had searched in the issues and found no similar feature requirement.

Description

When we submit jobs to existing cluster, there's an issue that the time job2 is created, job1 might not be fully deleted in the cluster.

  1. T1 - job1 CR submitted
  2. T2 - job1 CR is deleted
  3. T3 - job2 CR is created

It probably has two jobs running in the cluster at the same time. As a user, I want to submit job2 only if job1 is fully terminated.

/cc @Basasuya

Are you willing to submit a PR?

  • [ ] Yes I am willing to submit a PR!

Jeffwan avatar Sep 27 '22 18:09 Jeffwan

I think this PR could help wrt to queuing and gang dispatching: https://github.com/ray-project/kuberay/pull/598

asm582 avatar Sep 28 '22 01:09 asm582

This is more a Ray issue than a KubeRay issue, but we can definitely discuss here.

If I understand right, the concern is Ray-internal: It's hard to tell if the first Ray job is completely done before sending the second one to the same cluster. @architkulkarni are you the main owner for the Ray job API? Do you have thoughts on how to guarantee clean job termination?

DmitriGekhtman avatar Sep 28 '22 02:09 DmitriGekhtman

@DmitriGekhtman Yup that's me. Ray jobs supports concurrently running jobs and internally there's no notion of waiting for a job to finish before scheduling the next one. To do this with the Ray jobs SDK, you'd need to check the status in a loop until the first job returns a terminal status, like in the code sample here https://docs.ray.io/en/latest/cluster/running-applications/job-submission/sdk.html#submitting-a-ray-job. If using the Ray jobs CLI, ray job submit by default blocks the terminal and prints logs until the job reaches a terminal state.

architkulkarni avatar Sep 28 '22 16:09 architkulkarni

@Jeffwan do you think polling for completed job is enough to enable sequential job submission?

DmitriGekhtman avatar Sep 28 '22 16:09 DmitriGekhtman

This seems to be a Ray issue rather than KubeRay issue. Close this issue.

kevin85421 avatar Aug 31 '23 08:08 kevin85421