serving Unwanted scale-down during sustained load triggers an eventual explosion in replicas

In what area(s)?

/area autoscale

What version of Knative?

1.9.2

Expected Behavior

Under sustained load for 15-20 minutes to the same service, the service should remain operational with minimal fluctuation in pod / replica count.

(The service in question is extremely simple and does very little work and does not incur any errors of its own)

Actual Behavior

Under sustained load as part of a soak load test (~6,000 RPS to a trivial POST for ~30 minutes) we witness the following:

The replica count grows to accommodate the incoming requests, as expected
The replica count remains mostly stable for 10-15 minutes and all traffic is served correctly
Slowly, the autoscaler starts to cut back on 'desired pods' for the service - for seemingly no reason. The traffic is still the same as before
This eventually results in a kind of chain reaction, where the number of replicas is now too low to service the volume of traffic. The remaining replicas can't cope (one of the pods' queue proxies actually OOMs and exits), and the autoscaler enters panic and boosts the replica count to a suddenly high value (in our tests this went from 2-3 pods to 54)

Exact time points for above graphs:

Test starts at 10:00
The Desired and Wanted pods is cut from 4 to 3 by 10:16:21.122
The Excess Burst Capacity is logged as deficient (-19) in the autoscaler at 10:16:25.078
One of the replicas' queue proxies OOMs and exits at 10:16:26 (having had a steep increase in memory from 10:16:21 onward when the pod count was cut)
The Actual Pods drops to 2 (observed at 10:16:28)
Panic mode is entered at 10:16:29.078 with an observed excess capacity of -230 and the wanted pods is set to 7
Malfunction then occurs with the wanted pods continuing to increase over time as the observed EBC has not improved

Steps to Reproduce the Problem

We are knative 1.9.2 with net-istio. With the exception of boosting the resources for activator/autoscaler, we are using the default configuration for all components with the exception of the following settings in autoscaler configmap:

max-scale-down-rate: "1.05"                                                                                                                                                          
scale-down-delay: 5m

The service is also scaled from zero.

We have seen this type of behaviour (replica/pod explosions) several times previously, although this is the first controlled tests where we have reportable metrics and timestamps we can share

Apr 15 '24 12:04 DavidR91

@DavidR91 hi, I will try to reproduce could you also paste/attach your logs from the autoscaler side with debug enabled? I am looking for statements like: "Delaying scale to 0, staying at X".

Apr 26 '24 11:04 skonto

Attached an autoscaler log in debug. This is less of a less dramatic scale down than mentioned above but I think still a valid repro export.csv

The log starts at the point just after a load spike switches into a load 'soak' for 30 minutes at ~6k RPS (although the full 30 minutes are not included).

Notable is that ~7:51:40 is the point just after a pod is removed where request durations spike upward as a result (which coincides with a scale from 8 to 7 in the log):

(Charts are in UTC so 8:51 is the relevant time below):

autoscaler

Load test load-test

May 01 '24 08:05 DavidR91

So I've been attempting to debug this, and not really finding much up of use.

Here I added my own scraper to get the stat.proto values from individual pods off of :9090 of the queue-proxy, and plot the reported concurrency for each pod etc. over time:

image (1)

Request volume does go up slightly at 1642, and the request concurrency does too, and the pod count is decreased at this time.

The effect can be manipulated with scale down delay, stable window time etc. but it doesn't completely go away: after an initial panic eventually there will be a scale down eventually, even in the middle of consistent loads, and the time configs only delay it.

So I have a few questions about concurrency, since maybe we're just misusing it?

What does it actually mean? Is it totally unit less or is it analogous to RPS? Our target is always the default 100 no matter the service, is this incorrect/unworkable?
If we switch to RPS but increase the RPS to e.g. 500 (so 70% util. brings it to 350) the pods stick around for the entire run and do not scale down at all - should we be setting concurrency to match these proportions? (500.0?)

May 20 '24 17:05 DavidR91

This issue is stale because it has been open for 90 days with no activity. It will automatically close after 30 more days of inactivity. Reopen the issue with /reopen. Mark the issue as fresh by adding the comment /remove-lifecycle stale.

Aug 19 '24 01:08 github-actions[bot]

serving serving copied to clipboard

Unwanted scale-down during sustained load triggers an eventual explosion in replicas

In what area(s)?

What version of Knative?

Expected Behavior

Actual Behavior

Steps to Reproduce the Problem

serving
serving copied to clipboard