serving Settings for a "conservative" panic mode with slow starting workloads.

/area autoscale

Describe the feature

I have a reference application made up of multiple workloads/components where one workload sends out a batch events (possibly 1000s) periodically over a RabbitMQ exchange with a RabbitMQ Source listening on the other side. The RabbitMQ source subsequently routes the events to a “processor” workload.

The “processor” workload is a Java application and is configured to scale to 0. The Java app is currently not able to be compiled into a native image, so there is decent amount of overhead incurred before the application can start accepting HTTP requests (roughly 20-30 seconds if you include the container creation time and all other necessary initialization). The problem that arises is that the KNative panic mode is not aware of the startup overhead and with the backlog of events increasing much faster than capacity can be created, KNative is quickly scheduling a large number of pods leading to the degradation of the entire cluster.

I’ve been able to “control” the issue via “maxScale” but this feels like a dirty solution and that maybe there is/should be a better way to control capacity creation from zero (or maybe even flow controlling the source).

This request is provide some type of tuning or configuration for the autoscaler when in "panic mode" when starting from 0 instances so that the autoscaler does not get over aggressive when scheduling pods.

Sep 06 '22 19:09 gm2552

Investigating this issue

Sep 06 '22 19:09 nader-ziada

We should be able to reproduce this with just Serving:

A client which sends e.g. 200 requests to an app at once.
A go server which does a sleep(delayFlag * time.Seconds) before starting the HTTP server.

Sep 06 '22 19:09 evankanderson

A few thoughts:

If we're currently scaled to zero, we don't know how long it takes to actually respond to an in-flight request. I don't know if we can start in this case with some sort of assumed time like 0.1s or 1s to respond to help escape panic mode.
In the scale-from-zero case, there's no throughput until at least one pod is live, and we might not have an estimate on the time-from-zero-to-serving. We may need to either special-case here, or treat pods being started as "live" with some capacity to keep from scaling wildly out of control.

Separately, from talking with Greg, it sounds like his pods didn't have CPU / memory limits, so a reasonable maxScale would have been number of nodes (more specifically, the node capacity / pod limits, but this devolves to number of nodes)

Sep 06 '22 19:09 evankanderson

I'm think something similar to scale-down-delay but for scaling up, so a user that has a slow service like this can set this scale-up-delay config to allow for some start up time before any scaling decisions are made, what do you all think?

adding @psschwei for feedback as well

Sep 08 '22 14:09 nader-ziada

@nader-ziada Would this be a special case for scaling up from 0, a case for panic-mode (even for non 0 instances), or global to all cases?

Sep 08 '22 15:09 gm2552

I was thinking for all cases

Sep 08 '22 15:09 nader-ziada

Thinking out loud, but would adding an initial delay on the readiness probe help here?

Sep 08 '22 16:09 psschwei

Thinking out loud, but would adding an initial delay on the readiness probe help here?

I think this can help with the scaling up, but would give more errors when the readiness is not ready

Sep 08 '22 19:09 nader-ziada

Played around with this a bit this afternoon using Serving v1.7.1 on minikube and https://github.com/psschwei/delayed (which is what @evankanderson recommended for reproducing), hey for load testing, i.e. hey -n 1000 -c 100 -t 0 $(kn service describe delayed -ourl), scaling from zero each time:

parameters	number of pods
-n 1000 -c 100	2
-n 1000 -c 1000	15
-n 10000 -c 1000	15
-n 100000 -c 1000	15
-n 1000000 -c 1000	15-19

Note: there's little rhyme or reason to the values I selected, it was mostly just "let's try some big numbers".

Here's what I noticed:

all the pods are created within a second or two of when the hey command is run
- with the exception of -n 1000000 -c 1000, where there were 2-3 scale up -> scale down -> scale up cycles
in the -n 10000 -c 1000 case, there was only 1 panic window that lasted for about 1m30s
pods don't get ready until we are listening and serving (user-container is ready, but QP is not until the sleep is over)

Tweaking a few other things

still panicked when using an initialDelaySeconds on the readiness probe, so not sure this would help
even with min-scale=1, still panicked with 10K requests (though interestingly the min-scale pod handled them all before any other pods were ready)

I didn't have time today to look into load balancing, but part of me wonders if there's some knobs we could turn here (for example, activator capacity?) to ameliorate this situation...

Sep 09 '22 21:09 psschwei

You can use maxScale to avoid overshoot, but my $0.02 is that shouldn't be necessary.

(We should also improve event-sending components to try to do congestion control, but that's not necessarily feasible for e.g. container or cron sources)

Sep 09 '22 22:09 evankanderson

This issue is stale because it has been open for 90 days with no activity. It will automatically close after 30 more days of inactivity. Reopen the issue with /reopen. Mark the issue as fresh by adding the comment /remove-lifecycle stale.

Dec 09 '22 01:12 github-actions[bot]

This issue or pull request is stale because it has been open for 90 days with no activity.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle rotten
Close this issue or PR with /close

/lifecycle stale

Jan 08 '23 02:01 knative-prow-robot

Hi, is there any conclusion on this? I have exactly the same scenario (with the same 30s pod startup time). I also wondered if something like scale-up-delay would help. I see tick-interval setting but it wouldn't help to set this to e.g. 35s, it would slow down the reaction time too much. Would either scale-up-delay or scaling-event-frequency parameters would make sense for general audience?

Feb 03 '23 15:02 alaneckhardt

This issue is stale because it has been open for 90 days with no activity. It will automatically close after 30 more days of inactivity. Reopen the issue with /reopen. Mark the issue as fresh by adding the comment /remove-lifecycle stale.

May 05 '23 01:05 github-actions[bot]