kuberay icon indicating copy to clipboard operation
kuberay copied to clipboard

[Feature] Users should be able to define custom resources in worker groups

Open ebr opened this issue 2 years ago • 7 comments

Search before asking

  • [X] I had searched in the issues and found no similar feature requirement.

Description

When using the original ray-operator, the cluster.ray.io/v1 API included a spec for rayResource, which could be used for tagging worker groups as providers of custom, user-defined resources. This seems to be missing from the ray.io/v1alpha1 API, and it would be useful to have it back.

Use case

A use case for this might be to deploy a heterogenous cluster with multiple worker groups, where each worker group uses a different image packaged with different 3rd-party utilities. Some tasks that require specific utilities could then be marked as requiring such resource, and only execute on the workers that provide it.

Related issues

No response

Are you willing to submit a PR?

  • [ ] Yes I am willing to submit a PR!

ebr avatar Mar 02 '22 20:03 ebr

@ebr Are you using an old ray-operator? The CRD has changed last year, and the rayResource doesn't exist any more.

chenk008 avatar Mar 03 '22 09:03 chenk008

That is possible - we've been using ray-operator for a few months. I'm mainly asking whether there are any plans to bring back some kind of mechanism for defining custom resources. Also, we were using rayResources: {"CPU":0} on the head node to prevent the head from doing computational workloads. Wondering how we can achieve this now without rayResources.

ebr avatar Mar 03 '22 17:03 ebr

@ebr in https://github.com/ray-project/kuberay/blob/master/docs/best-practice/worker-head-reconnection.md indicates how to set 0 cpus, by setting the startup parameter num-cpus.

We also have been trying to configure custom resources for our worker groups. But we didn't achieve to launch the pod.

If we configure the worker group with:

rayStartParams:
  redis-password: 'foobared'
  node-ip-address: $MY_POD_IP
  block: 'true'
  resources: '{ "only_cpu" : 9001 }'

But the pod creation fails with:

Error: Got unexpected extra arguments (only_cpu : 9001 })

The weird thing is when change it a little bit a different error appears:

rayStartParams:
  redis-password: 'foobared'
  node-ip-address: $MY_POD_IP
  block: 'true'
  resources: '{"only_cpu":9001}'`
  Pod fails with: `2022-03-07 23:36:03,162 PANIC scripts.py:503 -- Valid values look like this: `--resources='{"CustomResource3": 1, "CustomResource2": 2}'

Error:

Valid values look like this: `{}`
2022-03-07 23:36:03,162 ERR scripts.py:500 -- `--resources` is not a valid JSON string.`

In the commit https://github.com/ray-project/kuberay/commit/d54ea709b12da4c0537b5289fba3e9cd24ef9fc9 there is the following comment:

      # Use `resources` to optionally specify custom resource annotations for the Ray node.
      # The value of `resources` is a string-integer mapping.
      # Currently, `resources` must be provided in the unfortunate format demonstrated below.

But there is no "demostration below".

juangtato-ds avatar Mar 08 '22 11:03 juangtato-ds

@juangtato-ds Thank you for pointing me at this! i figured it out - the "unfortunate format" is that you must escape the double quotes. So when deploying Ray clusters using the Helm chart, this worked:

resources: "'{\"customRes\": 1, \"anotherOne\": 2}'"

resulting in the following command in the pod spec:

ray start --resources='{"customRes": 1, "anotherOne": 2}' --block ....

ebr avatar Mar 08 '22 22:03 ebr

@ebr thanks! Didn't try out that one. It also worked for us.

For this scenario, maybe resources attribute specificación should admit a map, something like:

rayStartParams:
  # ...
  resources: 
    customRes: 1
    anotherOne: 2

juangtato-ds avatar Mar 09 '22 13:03 juangtato-ds

Current way of specifying resource in Ray start params is pretty painful, definitely this should be fixed.

DmitriGekhtman avatar May 12 '22 02:05 DmitriGekhtman

Starting to work on this now.

DmitriGekhtman avatar May 31 '22 23:05 DmitriGekhtman

Make head more stable: when creating the cluster, allocate sufficient amount of resources on head pod such that it tends to be stable and not easy to crash. You can also set {"num-cpus": "0"} in "rayStartParams" of "headGroupSpec" such that Ray scheduler will skip the head node when scheduling workloads. This also helps to maintain the stability of the head.

https://github.com/ray-project/kuberay/blob/master/docs/best-practice/worker-head-reconnection.md#best-practice

^ Ray scheduler will skip the head node when scheduling workloads.

kevin85421 avatar Feb 21 '23 18:02 kevin85421

We decided to stick with rayStartParams["resources"] as the way do this: https://docs.ray.io/en/latest/cluster/kubernetes/user-guides/config.html#id1 It could be possible to simplify the required format for the resource string, though.

DmitriGekhtman avatar Feb 21 '23 20:02 DmitriGekhtman