[Serve] Allow customize Ray Serve auto scaler scale down logic
Description
Now, Ray Serve auto scaler logic is fixed. Allow user to customize the scale down logic based on: worker node, deployment, replica to fit different kinds of business logic.
Use case
For our use case, we want to utilize both on-demand and spot instance nodes. We want to keep a minimum number of replicas always stay on on-demand workers. By customizing the scale down logic, we can ensure this by scaling down replicas on spot instances node first
Have you taken a stab at https://docs.ray.io/en/latest/serve/advanced-guides/advanced-autoscaling.html#custom-autoscaling-policies yet?
@ok-scale Yes. But it only supports returning the desired number of replicas, not which replica to scale down/terminate
there is currently an effort to support label selector in serve https://github.com/ray-project/ray/pull/57694. When that happens, maybe this feature can be achieved in the following way
- on your deployment, set label_selector = on-demand and fallback = spot
- in your k8s set a maximum limit on number of on-demand instances
- which means beyond your acceptable threshold, new replicas will be scheduled on spot instances.
- downscaling would also evict replicas from spot first.
Not a clean API though, but wdyt?
@abrarsheikh That's perfect!