kuberay
kuberay copied to clipboard
[Feature] Add/remove instances from an active Ray Cluster
Search before asking
- [X] I had searched in the issues and found no similar feature requirement.
Description
With a ray cluster up and running, it would be nice that we could add more instances to the cluster (or remove inactive instances).
An enhancement to this feature would be to provide a min/max in the spec, and the ray cluster automatically allocate/deallocates based on active work-load.
@Jeffwan
Use case
From time to time, we may run into the situation where we didn't allocate enough instances for a ray cluster, therefore would like more instances without start over everything.
Related issues
No response
Are you willing to submit a PR?
- [ ] Yes I am willing to submit a PR!
Technically, this has been supported by adding/removing workgroup by modifying RayCluster custom resource. However, remove operation manually is kind of dangerous because operator is not aware of actors running on those nominated nodes to be deleted.
This feature is reasonable and I think people use it add GPU machine group or group with different labels or groups etc. Let's check if we have enough documentation for this feature.
/cc @chenk008 @akanso I think you have similar usage?
yes today we can add/remove workers from the worker groups. We can remove random workers by changing the replicas or remove specific ones by specifying the pod name in the scaleStrategy
and decrementing the replicas atribute.