serving
serving copied to clipboard
Unwanted scale-down during sustained load triggers an eventual explosion in replicas
In what area(s)?
/area autoscale
What version of Knative?
1.9.2
Expected Behavior
Under sustained load for 15-20 minutes to the same service, the service should remain operational with minimal fluctuation in pod / replica count.
(The service in question is extremely simple and does very little work and does not incur any errors of its own)
Actual Behavior
Under sustained load as part of a soak load test (~6,000 RPS to a trivial POST for ~30 minutes) we witness the following:
-
The replica count grows to accommodate the incoming requests, as expected
-
The replica count remains mostly stable for 10-15 minutes and all traffic is served correctly
-
Slowly, the autoscaler starts to cut back on 'desired pods' for the service - for seemingly no reason. The traffic is still the same as before
-
This eventually results in a kind of chain reaction, where the number of replicas is now too low to service the volume of traffic. The remaining replicas can't cope (one of the pods' queue proxies actually OOMs and exits), and the autoscaler enters panic and boosts the replica count to a suddenly high value (in our tests this went from 2-3 pods to 54)
Exact time points for above graphs:
- Test starts at
10:00
- The Desired and Wanted pods is cut from 4 to 3 by
10:16:21.122
- The Excess Burst Capacity is logged as deficient (-19) in the autoscaler at
10:16:25.078
- One of the replicas' queue proxies OOMs and exits at
10:16:26
(having had a steep increase in memory from 10:16:21 onward when the pod count was cut) - The Actual Pods drops to 2 (observed at
10:16:28
) - Panic mode is entered at
10:16:29.078
with an observed excess capacity of -230 and the wanted pods is set to 7 - Malfunction then occurs with the wanted pods continuing to increase over time as the observed EBC has not improved
Steps to Reproduce the Problem
We are knative 1.9.2 with net-istio. With the exception of boosting the resources for activator/autoscaler, we are using the default configuration for all components with the exception of the following settings in autoscaler configmap:
max-scale-down-rate: "1.05"
scale-down-delay: 5m
The service is also scaled from zero.
We have seen this type of behaviour (replica/pod explosions) several times previously, although this is the first controlled tests where we have reportable metrics and timestamps we can share
@DavidR91 hi, I will try to reproduce could you also paste/attach your logs from the autoscaler side with debug enabled? I am looking for statements like: "Delaying scale to 0, staying at X".
Attached an autoscaler log in debug. This is less of a less dramatic scale down than mentioned above but I think still a valid repro export.csv
The log starts at the point just after a load spike switches into a load 'soak' for 30 minutes at ~6k RPS (although the full 30 minutes are not included).
Notable is that ~7:51:40 is the point just after a pod is removed where request durations spike upward as a result (which coincides with a scale from 8 to 7 in the log):
(Charts are in UTC so 8:51 is the relevant time below):
Load test
So I've been attempting to debug this, and not really finding much up of use.
Here I added my own scraper to get the stat.proto
values from individual pods off of :9090 of the queue-proxy, and plot the reported concurrency for each pod etc. over time:
Request volume does go up slightly at 1642, and the request concurrency does too, and the pod count is decreased at this time.
The effect can be manipulated with scale down delay, stable window time etc. but it doesn't completely go away: after an initial panic eventually there will be a scale down eventually, even in the middle of consistent loads, and the time configs only delay it.
So I have a few questions about concurrency, since maybe we're just misusing it?
- What does it actually mean? Is it totally unit less or is it analogous to RPS? Our target is always the default 100 no matter the service, is this incorrect/unworkable?
- If we switch to RPS but increase the RPS to e.g. 500 (so 70% util. brings it to 350) the pods stick around for the entire run and do not scale down at all - should we be setting concurrency to match these proportions? (500.0?)
This issue is stale because it has been open for 90 days with no
activity. It will automatically close after 30 more days of
inactivity. Reopen the issue with /reopen
. Mark the issue as
fresh by adding the comment /remove-lifecycle stale
.