ray [Serve] Allow customize Ray Serve auto scaler scale down logic

Description

Now, Ray Serve auto scaler logic is fixed. Allow user to customize the scale down logic based on: worker node, deployment, replica to fit different kinds of business logic.

Use case

For our use case, we want to utilize both on-demand and spot instance nodes. We want to keep a minimum number of replicas always stay on on-demand workers. By customizing the scale down logic, we can ensure this by scaling down replicas on spot instances node first

Nov 25 '25 03:11 manhld0206

Have you taken a stab at https://docs.ray.io/en/latest/serve/advanced-guides/advanced-autoscaling.html#custom-autoscaling-policies yet?

Nov 25 '25 16:11 ok-scale

@ok-scale Yes. But it only supports returning the desired number of replicas, not which replica to scale down/terminate

Nov 26 '25 01:11 manhld0206

there is currently an effort to support label selector in serve https://github.com/ray-project/ray/pull/57694. When that happens, maybe this feature can be achieved in the following way

on your deployment, set label_selector = on-demand and fallback = spot
in your k8s set a maximum limit on number of on-demand instances
which means beyond your acceptable threshold, new replicas will be scheduled on spot instances.
downscaling would also evict replicas from spot first.

Not a clean API though, but wdyt?

Dec 10 '25 06:12 abrarsheikh

@abrarsheikh That's perfect!

Dec 11 '25 02:12 manhld0206