kuberay icon indicating copy to clipboard operation
kuberay copied to clipboard

[Roadmap] KubeRay (or anything for Ray on K8s) Wishlist

Open kevin85421 opened this issue 10 months ago • 29 comments

What feature do you want to have in KubeRay or Ray? Please add an emoji to the following comments if you find them useful. Please briefly explain the feature you want in a single comment. This issue is not for discussion, only for voting and proposing. For discussion, send a message to #kuberay-discuss.

kevin85421 avatar Feb 10 '25 21:02 kevin85421

[Done: Ray 2.48.0]

Make Autoscaler V2 to be the default autoscaler option:

  • https://github.com/ray-project/kuberay/issues/2600
  • It should have better stability and observability compared to v1 after fixing issues.
  • Run Autoscaler V2 in a separate Pod instead of a container in the head Pod.

kevin85421 avatar Feb 10 '25 21:02 kevin85421

[WIP: v1.5.0]

RayService incremental upgrade:

  • https://github.com/ray-project/enhancements/pull/58
  • This avoids 2X computing resources during the zero downtime upgrade process.

kevin85421 avatar Feb 10 '25 21:02 kevin85421

[Done: v1.4.0] Standardize KubeRay API server

  • We found that some users built their own KubeRay API server. Standardizing the KubeRay API server speeds up future user adoption because they won't need to build their own KubeRay API server again.
  • Make KubeRay API server flexible: Currently, the KubeRay API server interface is not flexible. Users need to open a PR if a field is not exposed.

kevin85421 avatar Feb 10 '25 21:02 kevin85421

Idle cluster termination: https://github.com/ray-project/kuberay/issues/2998

  • Terminate a RayCluster if there is no running Ray job to save $$$$.

kevin85421 avatar Feb 10 '25 21:02 kevin85421

Documentation and Terraform for the reference architecture

  • For example, GPU scheduling, reduce image pulling overhead, logging, notifications, ... etc

kevin85421 avatar Feb 10 '25 22:02 kevin85421

Light-weight job submitter:

  • https://github.com/ray-project/kuberay/issues/2537
  • This allows the K8s job submitter to avoid pulling the Ray image, which is typically over 1 GB even in its thinnest version without ML libraries. The light-weight job submitter I expected to be less than 20 MB. This will enhance the startup time of RayJob.

kevin85421 avatar Feb 10 '25 22:02 kevin85421

[WIP: v1.5.0]

Integrate Volcano with RayJob: currently, Volcano only integrates with RayCluster.

kevin85421 avatar Feb 10 '25 22:02 kevin85421

[WIP: v1.5.0]

Integrate YuniKorn with RayJob: currently, YuniKorn only integrates with RayCluster.

kevin85421 avatar Feb 10 '25 22:02 kevin85421

[Done: v1.5.0]

Support cron scheduling in RayJob

  • https://github.com/ray-project/kuberay/issues/2426

kevin85421 avatar Feb 10 '25 22:02 kevin85421

[Done: v1.4.0] KubeRay dashboard:

  • A frontend to visualize and manage (e.g. create / delete) KubeRay custom resources.
  • Something similar to the frontend from Roblox's Ray Summit talk

Image

kevin85421 avatar Feb 10 '25 22:02 kevin85421

[Done: v1.4.0] KubeRay operator emits metrics about cluster startup time and others

  • https://github.com/ray-project/kuberay/issues/2681
  • Example: https://github.com/ray-project/kuberay/issues/2681#issuecomment-2593807747

kevin85421 avatar Feb 10 '25 22:02 kevin85421

Multi-k8s support:

  • Better integration with https://kueue.sigs.k8s.io/docs/concepts/multikueue/

kevin85421 avatar Feb 11 '25 01:02 kevin85421

Multi-k8s / Multi-cloud support:

  • Better integration with SkyPilot

kevin85421 avatar Feb 11 '25 01:02 kevin85421

[Done: v1.4.0] Better support post-training libraries such as veRL and OpenRLHF

kevin85421 avatar Feb 11 '25 07:02 kevin85421

Ray IPv6 support, currently it is not possible to use Ray on an IPv6 only kubernetes cluster

  • https://github.com/ray-project/ray/pull/44252. / https://github.com/ray-project/ray/pull/40332 . https://github.com/ray-project/ray/issues/6967

aqemia-aymeric-alixe avatar Feb 11 '25 09:02 aqemia-aymeric-alixe

Update (Kai-Hsun): [Done: v1.4.0]

@kevin85421, @andrewsykim and I wrote some ideas in this Google doc Ray Kubectl Plugin 1.4.0 Wishlist. Let us know if you'd like the ideas as individual comments here.

davidxia avatar Feb 11 '25 15:02 davidxia

Ability to limit total size of Ray cluster (across all worker groups, or a ideally for selected subsets of groups) in terms of amount of resources (cpus, gpus), rather than number of workers. That's what Kubernetes nodepools support, for example, but it is not usable in Kuberay because the autoscaler only thinks in terms of Ray worker groups, not underlying nodepools and will happily provision pods beyond the CPU limits of the available nodepools for example.

jleben avatar Feb 22 '25 10:02 jleben

Documentation and Terraform for the reference architecture For example, GPU scheduling, reduce image pulling overhead, logging, notifications, ... etc

Interested in how notification should work. We are currently using a very jank solution of using Kyverno to inject a command in the job submitter pod to deposit a notification event on our kafka queue. So we turn the job submitter pod command to something like bash -c "ray submit ... && send notification". But this solution has all sorts of Kyverno bugs, and we are working on migrating the logic to the controller. What is a good way to open source notification sending in the controller?

han-steve avatar Feb 24 '25 19:02 han-steve

Multi-k8s / Multi-cloud support:

  • Better integration with SkyPilot

This architecture seems like a great reference for considering KubeRay’s multi-cloud support and its integration with SkyPilot.

SkyRay: Seamlessly Extending KubeRay to Multi-Cluster Multi-Cloud Operation

nadongjun avatar Feb 25 '25 01:02 nadongjun

very small, low prio request: Make raycluster_webhook.go fail hard here https://github.com/ray-project/kuberay/blob/35bbd62c7c7ef9d47c6b1cd4200164c985028221/ray-operator/pkg/webhooks/v1/raycluster_webhook.go#L86 with actionable error message during RayCluster creation for these worker group replica user errors that are currently silently handled by the controller here. https://github.com/ray-project/kuberay/blob/35bbd62c7c7ef9d47c6b1cd4200164c985028221/ray-operator/controllers/ray/utils/util.go#L336-L345

Idea is to leave the controller behavior as is, but add more validation to the webhook for users who have chosen to enable the webhook.

davidxia avatar Feb 26 '25 16:02 davidxia

can you help to support huge LLM inference in cross node case? https://github.com/ray-project/kuberay/issues/2323

ganisback avatar Mar 04 '25 11:03 ganisback

Add https://github.com/ray-project/kuberay/issues/3271

rueian avatar Apr 03 '25 02:04 rueian

RayService incremental upgrade:

It would be nice to have ANY cluster upgrade. Right now the only approach that works is: delete cluster, re-create cluster. I don't understand why ray operator cannot do it for me?

pkit avatar Jul 29 '25 23:07 pkit

It would be nice to have ANY cluster upgrade.

@pkit do you want this for RayService only or also for RayCluster?

andrewsykim avatar Jul 29 '25 23:07 andrewsykim

It would be nice to have ANY cluster upgrade.

@pkit do you want this for RayService only or also for RayCluster?

I use only RayCore. I don't use RayServe and don't care about it. What I would like is that when I upgrade the docker image that runs on the worker pods the pods would shutdown and restart with the new image. That's it. I manage the lifecycle of the workers myself. But I cannot manage lifecycle of the cluster right now.

pkit avatar Jul 30 '25 00:07 pkit

@pkit thanks for clarifying. So basically you want the ability to modify the image of a worker group in RayCluster and have kuberay automatically do a rolling upgrades of the workers. This feature request comes up a lot and is simliar to https://github.com/ray-project/kuberay/issues/2534. I think we should consider it for v1.5 especially if we are already looking at incremental upgrade support in RayService. Much of the same code should be re-usable

andrewsykim avatar Jul 30 '25 00:07 andrewsykim

@pkit thanks for clarifying. So basically you want the ability to modify the image of a worker group in RayCluster and have kuberay automatically do a rolling upgrades of the workers. This feature request comes up a lot and is simliar to #2534. I think we should consider it for v1.5 especially if we are already looking at incremental upgrade support in RayService. Much of the same code should be re-usable

Yup, I read the plan for RayServe upgrade a while ago and it was nice. But I don't plan on using RayServe any time soon. I don't even care if upgrades are rolling, just shutting down all workers and restarting them is good enough. We can probably PR that feature too, as we have some spare capacity to spend on implementing it.

pkit avatar Jul 30 '25 00:07 pkit

@pkit Would you mind opening a new issue and linking it in your comment instead? I’d like to keep this issue focused on the proposal, and we can continue the discussion there. Thanks!

kevin85421 avatar Jul 30 '25 00:07 kevin85421

@kevin85421 #3905

pkit avatar Jul 30 '25 00:07 pkit