kuberay icon indicating copy to clipboard operation
kuberay copied to clipboard

[Feature] can RayJob run in a local cluster ?

Open shuaiyy opened this issue 1 year ago • 7 comments
trafficstars

Search before asking

  • [X] I had searched in the issues and found no similar feature requirement.

Description

run a job in local cluster, without build a remote RayCluster

Use case

in my case, I have many small jobs which can be run in single node with a few resource and will be finished in 60s; when use RayCluster, it will cost additional 60s(about) to build a RayCluster before job run it's code.

Related issues

No response

Are you willing to submit a PR?

  • [ ] Yes I am willing to submit a PR!

shuaiyy avatar Jun 26 '24 03:06 shuaiyy

run a job in local cluster, without build a remote RayCluster

A RayJob can run on an existing cluster using the clusterSelector field. This way you can create a single RayCluster and then run multiple RayJob against the RayCluster .Would that work for you?

andrewsykim avatar Jun 26 '24 14:06 andrewsykim

run a job in local cluster, without build a remote RayCluster

you can create a single RayCluster and then run multiple RayJob against the RayCluster .

Thx, It's not work in my case. We want a quick run and return result, so there are some kinds of images with different pre-installed dependencies. Second, even with an exist cluster with HPA, kuberay still need to create a k8s job to submit job. If no enough resource, raycluster HPA will cost more seconds.


If we can run rayjob in the Submit Job's Pod?

shuaiyy avatar Jun 27 '24 03:06 shuaiyy

kuberay still need to create a k8s job to submit job

There's a HTTP submission mode that doesn't use submitter Job https://github.com/ray-project/kuberay/blob/master/ray-operator/apis/ray/v1/rayjob_types.go#L92

andrewsykim avatar Jun 27 '24 03:06 andrewsykim

You can use the runtime environment to install the dependencies: https://docs.ray.io/en/latest/ray-core/handling-dependencies.html#runtime-environments. Simply add them here: https://github.com/ray-project/kuberay/blob/master/ray-operator/apis/ray/v1/rayjob_types.go#L87.

Yicheng-Lu-llll avatar Jun 27 '24 13:06 Yicheng-Lu-llll

You can use the runtime environment to install the dependencies: https://docs.ray.io/en/latest/ray-core/handling-dependencies.html#runtime-environments. Simply add them here: https://github.com/ray-project/kuberay/blob/master/ray-operator/apis/ray/v1/rayjob_types.go#L87.

Thx. we already used these features when run distribute training jobs.

shuaiyy avatar Jun 28 '24 02:06 shuaiyy

Finally, I decided to use kubeRayjob to run distriube trainings, and use VolcanoJob or K8sNativeJob to run a single pod job.

When run in a single pod, I'm not sure If it's okay to run ray start --head && ray job submit

shuaiyy avatar Jun 28 '24 02:06 shuaiyy

I'm not sure If it's okay to run ray start --head && ray job submit

ray job submit sends requests to the Ray dashboard. However, it still takes a while for the Ray dashboard to be ready for job requests after the ray start command returns.

when use RayCluster, it will cost additional 60s(about) to build a RayCluster before job run it's code.

If you only create a single Pod RayCluster, there should be no overhead compared to a single K8s Pod. The RayJob CRD works as follows:

  1. Create a RayCluster.
  2. Wait for the RayCluster to be "ready".
  3. Create a submitter K8s Job.
  4. Use ray job submit to submit the job to the Ray head node.

The overhead likely comes from step 3. We are currently working on a doc https://docs.google.com/document/d/1hCJsrCFYPJLS3Zusdr8N_4Y5leWUMy4bQEbsqSQp2mw/edit which can avoid the overhead of step 3. It is still WIP, but feel free to comment to give us feedback.

kevin85421 avatar Jul 07 '24 22:07 kevin85421

Closed this issue because https://docs.google.com/document/d/1hCJsrCFYPJLS3Zusdr8N_4Y5leWUMy4bQEbsqSQp2mw/edit?tab=t.0 has already been implemented.

kevin85421 avatar Dec 14 '24 22:12 kevin85421