ray Add new serve autoscaling parameter `scaling

Why are these changes needed?

Currently, the serve autoscaler makes scaling decisions only based on the most recent Serve Controller computation, even if the serve controller has made many scaling calculations over the scaling delay period. This results in poor autoscaling when clusters utilize long upscale/downscale delays.

Related issue number

https://github.com/ray-project/ray/issues/46497

Checks

[ ✓] I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
[ ✓] I've run scripts/format.sh to lint the changes in this PR.
[✓ ] I've included any doc changes needed for https://docs.ray.io/en/master/.
- [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in doc/source/tune/api/ under the corresponding .rst file.
[ ✓] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- [✓ ] Unit tests
- [✓ ] Release tests
- [ ] This PR is not tested :(

Sep 27 '24 10:09 Stack-Attack

@edoakes @zcin @GeneDer Any eta on getting some eyes on this?

Oct 15 '24 04:10 kyle-v6x

@Stack-Attack Could you help me understand the PR better? My understanding is this PR does two things:

Records a history of ongoing request data, and averages the decisions made over the past history (since the last time a scaling decision was made) for the final decision.
Provides the option to swap out averaging for min/maxing.

However, for (1), there is an existing field named look_back_period_s over which the number of ongoing requests is averaged. So although we are only using the "most recent computation" when deciding how many replicas to scale to, that computation is taking into account the data over a window of look_back_period_s seconds. If you have very long upscale/downscale delays and would prefer to increase the window of time over which the data is aggregated, you can also increase look_back_period_s.

As for (2), it is true we currently don't support min/max as functions for deciding the number of replicas to scale to. Could you go into more detail on what kind of use case you are looking to support?

Oct 15 '24 23:10 zcin

@Stack-Attack Could you help me understand the PR better? My understanding is this PR does two things:

Records a history of ongoing request data, and averages the decisions made over the past history (since the last time a scaling decision was made) for the final decision.

Provides the option to swap out averaging for min/maxing.

However, for (1), there is an existing field named look_back_period_s over which the number of ongoing requests is averaged. So although we are only using the "most recent computation" when deciding how many replicas to scale to, that computation is taking into account the data over a window of look_back_period_s seconds. If you have very long upscale/downscale delays and would prefer to increase the window of time over which the data is aggregated, you can also increase look_back_period_s.

As for (2), it is true we currently don't support min/max as functions for deciding the number of replicas to scale to. Could you go into more detail on what kind of use case you are looking to support?

Good questions!

(1) look_back_period_s is a moving average on the onging requests, which inherently means the longer you make it the less sensitive the autoscaler becomes for all scaling decisions (up or down). In general, I think this is an important metric, but it's not a solution to the problem. It could be feasible to add additional aggregate function options here, but since this is calculated upstream of all scaling logic, it doesn't seem preferable. Currently all of the important autoscaling configs (i.e upscaling delay/smoothing) are calculated based on replica count decisions, and I think this is a good pattern.

(2) The most basic use case would be to easilly ensure you always scale to accomodate the maximum load. If the decision is based on the maximum value, you can more easilly ensure availability. Alternatively, avg and min could be used for more conservative scaling strategies.

Oct 17 '24 10:10 kyle-v6x

The intended goal here is to simplify auto-scaling configs for deployments where you aren't sure what the traffic pattern is yet. A common problem we've been having when going to production with Serve. Also, I think the fact that downscale_delay does not consider interim decisions is rather unintuitive.

If you have any suggustions I'm happy to make some changes!

Oct 17 '24 11:10 kyle-v6x

Ah, one more comment on (1). Increasing look_back_period_s also makes tuning the target_num_ongoing_requests very difficult. Usually you can calculate a good value based on the processing time of one request, but as the look_back_period increases this doesn't neccessarily keep.

Oct 17 '24 11:10 kyle-v6x

@zcin Checking in again. This PR would greatly improve our production deployments.

Oct 31 '24 08:10 kyle-v6x

Bumping this again. Trying to avoid building and managing our own fork, but this might be useful enough to us to do so for the time being. (cc. @zcin @edoakes)

Nov 12 '24 04:11 kyle-v6x

@kyle-v6x Sorry for the delay!

look_back_period_s is a moving average on the onging requests, which inherently means the longer you make it the less sensitive the autoscaler becomes for all scaling decisions (up or down).

Hmm, if increasing look_back_period_s will make the autoscaler less sensitive to changes in metrics, wouldn't using a second-layer average window also affect the sensitivity to metrics? What is the difference?

The most basic use case would be to easilly ensure you always scale to accomodate the maximum load. If the decision is based on the maximum value, you can more easilly ensure availability. Alternatively, avg and min could be used for more conservative scaling strategies.

This makes sense. I also did some research into how Kubernetes does horizontal pod autoscaling, and they also implement a stabilization window of default 5 minutes, during which they select the max value calculated and use that for the next scaling decision. I think it is reasonable to implement this for Serve, and will provide more options for users. What I'm not sure about though, is whether we should use the min/max of the averaged metrics sent from the replicas to the controller, which could be strange to reason about, or take the min/max of the raw request metrics un-averaged. I think the latter might make more sense.

Also, I think the fact that downscale_delay does not consider interim decisions is rather unintuitive.

By interim decisions I assume you mean the calculations that are made every control loop cycle for the desired number of replicas. I think what's confusing here is that although we are currently only using the "most recent decision", that decision is incorporating request metrics that were sampled over the past (e.g.) 30 seconds. So you can actually view it as we made decisions each time we sampled from the replicas, then averaged all those decisions for the final decision at the end of that 30-second window.

I do believe it is useful to separate downscale_delay_s and look_back_period_s, since that gives you more knobs to control. That is also how Kubernetes does it.

Nov 21 '24 18:11 zcin

@zcin No worries!

Some good points.

Hmm, if increasing look_back_period_s will make the autoscaler less sensitive to changes in metrics, wouldn't using a second-layer average window also affect the sensitivity to metrics? What is the difference?

I hadn't put much thought on the double average use case as I tunnel visioned on the other functions a bit. Fair point. However, the raw values are extremely volotile, which could result in a huge upscaling/downscaling decisions when using max/min. It seems like there has to be some averaging before applying such functions.

Actually, I like being able to control the look_back_period as well as the functions implimented here, as an infinitely small look_back_period approaches the single average anyway. look_back_period is a smoothing operation over time, and we can apply our scaling functions on top of that.

I think there is a strong case for averaging before taking the min or max, meaning we would have to do it somewhere anyway.

Nov 22 '24 03:11 kyle-v6x

Ah, one more comment on (1). Increasing look_back_period_s also makes tuning the target_num_ongoing_requests very difficult. Usually you can calculate a good value based on the processing time of one request, but as the look_back_period increases this doesn't neccessarily keep.

For this reason, and because it couples smoothing for upscaling and downscaling, I actually find that look_back_period should generally never be changed. It tends to completely destroy any balanced scaling parameters. It's important, but generally best left alone.

Nov 22 '24 03:11 kyle-v6x

Bumping this again in hopes we can get it merged this month. I believe the tests failed due to un-related reasons, so I'll push a commit to have them re-run.

Dec 05 '24 03:12 kyle-v6x

Bumping again. Happy to complete ane work needed to get this merged ASAP, but I'm helpless without approval from a code owner.

Dec 16 '24 05:12 kyle-v6x

However, the raw values are extremely volotile, which could result in a huge upscaling/downscaling decisions I think there is a strong case for averaging before taking the min or max

True, I think request metrics can be pretty volatile. My only concern is that if we ever extend autoscaling to include more metrics apart from request metrics, like CPU/memory resource usage, it's more standard to not do any averaging before applying the chosen aggregation function. And it's not as easy to reason with if we have a double layer of aggregation (min/max over an average).

Maybe we can try to do a simple test for both options, and see if removing the averaging will actually lead to a lot of volatility?

Dec 20 '24 23:12 zcin

@kyle-v6x Giving this some more thought, and I am still leaning more towards not having nested aggregation. I was previously thinking about the number of ongoing/queued requests at a single replica can be very volatile if we're just getting the min/max since routing is random, but we should be getting the total number of requests, at any single point in time, across all replicas. This is simply the total number of requests being processed in the system, which should be much less volatile and simply indicative of the traffic being sent to the system, and how well the system is handling that traffic.

To implement this, we will have to change the way autoscaling metrics are collected. Currently metrics are aggregated at the deployment handle first, and the aggregated results are sent to the controller. This was fine when the only aggregation method was averaging. However to do min/max, we will have to send the raw metrics from the deployment handle to the controller, and perform the aggregation at the controller. Let me know what you think, if we go down this route I can help support you.

Dec 21 '24 04:12 zcin

@zcin Can you give an example of future hardware metric processing? Since look_back_period can be used to approximate the no-average case (when set close to 0), I think it would still lead to rather intuitive scaling. However it's not clear to me how such metrics would be aggrigated and sent off to the autoscaler since I'm not super familiar with the internal code.

Overall, whichever method you think is best and gives a clear path to merging these features, I'll try to make some time to impliment! I think the choice now is to add a simple solution and maintain full compatability for current autoscaling settings, or do some major refactoring to the aggregation. If we do refactor aggregation, it will be difficult to maintain current autoscaler settings 1-to-1 (the max of the average over look_back_period).

Whichever decision, just happy to get this moving as we could really use it now more than ever.

Dec 23 '24 06:12 kyle-v6x

Can you give an example of future hardware metric processing? For example upscaling when the CPU or memory usage exceeds a certain amount. This is similar to the "resource requests" concept in kubernetes.

@kyle-v6x Okay, if it works for your use case then, let's do a single layer of aggregation at the controller then! Then we can define an experimental API for specifying min/max instead of averaging, and that can be swapped out in the controller. I've prototyped the part that requires switching to sending raw metrics from the handles to the controller - essentially, instead of keeping an in-memory metrics store at the handles, keep one per-handle at the controller, and push lists of raw metrics collected at the handles, instead of a single aggregated metric, from handle -> controller. https://github.com/ray-project/ray/pull/49412 let me know if this helps!

Dec 26 '24 19:12 zcin

@zcin I have some core work that will take me to the end of the month, so I plan to work on this in early February. Thanks for the draft PR, it will be very helpful.

Dec 30 '24 08:12 kyle-v6x

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed in 14 days if no further activity occurs. Thank you for your contributions.

If you'd like to keep this open, just leave any comment, and the stale label will be removed.

Jan 31 '25 23:01 stale[bot]

Still planning to work on this, but haven't had time.

Feb 20 '25 05:02 kyle-v6x

Finally hopping back on this. I'll close this PR as it's no longer the planned implimentation.

Apr 01 '25 07:04 kyle-v6x

Add new serve autoscaling parameter `scaling_function`

Why are these changes needed?

Related issue number

Checks