serving Is there a way to specify a minimum, non-zero scaling value while keeping scale-to-zero behavior?

trafficstars

In what area(s)?

/area autoscale

Other classifications:

/kind good-first-issue /kind process /kind spec

Ask your question here:

We have a few very "bursty" services that get zero traffic for a while and then get 100s of concurrent requests.

What I would like is to be able to do the following:

Enable scale-to-zero for that service
When the service has a non-zero scale, it is at least some minimum value (say, 10).

As far as I understand, setting minScale to 10 prevents scale to zero, and initialScale only applies on first deployment.

Is there some other way (directly via config) or pattern (i.e. workaround) for doing this?

Thanks!

May 06 '21 20:05 danieltahara

Interesting! We don't support this currently as you've found out but it is and interesting thought.

Do you see too long of a delay in scaling up to the required scale in your workloads? If all requests come in at once, I'd expect us to scale to, say, 10 almost immediately. If you have a ramping workload, that'd be different of course.

May 10 '21 07:05 markusthoemmes

So I would need to set aside time to confirm the behavior in the scale-from-zero case (but maybe you might know just based on your understanding of the architecture), but in the "warm" case after a burst (where we scale back down to 1 for 15 minutes), we effectively get a cold start that's potentially even worse than the true cold case.

The situation in that case is that we have 1 pod sitting around available to immediately serve requests.

However, we have containerConcurrency: 1 and requests take ~3 seconds to complete. Therefore, since the panic scaling window only kicks in after 10 seconds, we have most requests (~40 at a time) blocking for the full 10 seconds + pod spin up time.

Since the panic window is globally configured, I don't want to change its value. Therefore it seems like a reasonable solution to this problem is to keep the floor # of pods at some value > 1.

May 10 '21 16:05 danieltahara

Hmm, that sounds kind of odd though. The panic window is 6s by default, but that doesn't mean that scaling should only happen after 6s but that the amount of historic data the decision is based on is at most 6s old. As such, especially with an aggressive containerConcurrency setting, I would expect the workload to scale almost instantaneously.

It'd be awesome to get a reproducer for this behavior if possible, if we have indeed a bug here. Could you share the Knative Service YAML that you're using here? Any other settings tweaked?

May 10 '21 16:05 markusthoemmes

The panic window is 6s by default, but that doesn't mean that scaling should only happen after 6s but that the amount of historic data the decision is based on is at most 6s old.

Ah yep. 10% != 10s. And also duh, re: how the windowing works.

I doubt there's a "bug" here so much as a workload-specific thing (pattern, large docker image that doesn't seem to cache well). I definitely owe a repro if this is going to turn into a full-fledged feature request, but I did want to see if folks had encountered this or you could reason through the behavior offhand (or if there was a configuration variable I was unaware of).

(Also we're on 0.18, but afaict nothing between 0.18 and tip changes this behavior).

May 10 '21 16:05 danieltahara

Yeah if there's anyway to repo with a script where you simulate your traffic bursts that'd be helpful

Jun 03 '21 22:06 dprotaso

+1 for the feature. We are using knative for our production workload @gojek. However, we don't use scale-to-zero feature since we want to guarantee that at least more than 1 replica of an application is available when it is serving traffic.

Jun 09 '21 05:06 pradithya

I think @danieltahara 's case was for avoiding situations between 0 and (for example) 10. If you want to ensure that there minimum 10 replicas, I think the autoscaling.knative.dev/minScale annotation should work for you.

With respect to the original bug report, it would be good to get a repro case and see how much scaling delay we're getting in that scenario. Maybe a simple repro case would be:

Container of any size that takes 10s to process a request (e.g. see the autoscale test image with ?sleep=10s parameter)
ContainerConcurrency=1
Issue 100 simultaneous requests, count time from first and last request issued (to rule out delay on client) until all requests served(?)

Our expectation would be that the duration of the test would be <20s with 100 Pods spawned, correct?

Jun 23 '21 22:06 evankanderson

Yes I'd expect us to hit 100 pods requested (deployed is another story :sweat_smile: ) almost immediately.

Jun 24 '21 07:06 markusthoemmes

avoiding situations between 0 and (for example) 10

@evankanderson That's exactly what I think the ideal state too. So if the service is not serving traffic, the number of replicas should be 0. But if there is traffic it has to be at least x, where x can be configurable similar to autoscaling.knative.dev/minScale. This avoid scenario where my service has only 1 replica and could cause disruption if this replica become nonoperational due to for example getting rescheduled by K8S.

I do use autoscaling.knative.dev/minScale currently, but the scale to zero will not be triggered even under no traffic scenario.

Jun 24 '21 08:06 pradithya

This issue is stale because it has been open for 90 days with no activity. It will automatically close after 30 more days of inactivity. Reopen the issue with /reopen. Mark the issue as fresh by adding the comment /remove-lifecycle stale.

Sep 23 '21 01:09 github-actions[bot]

Adding an extra use case for this that I've seen a couple times and recently in this slack thread: there are cases - especially bursty eventing workloads - where at a low number of instances we struggle to handle the load at all causing autoscaler to react by dramatically over-shooting the needed number of pods. In this case once we've scaled up way too far we notice concurrency is now much too high, and /under-shoot/ back down to a number of pods that can't handle the load at all. This causes concurrency to shoot up (because requests are backing up) and we react by dramatically over-scaling again. Rinse, repeat. (Another example)

We currently don't have a great way of dealing with these use cases: scale-down-delay helps a bit, but eventually still wears off, and similarly tweaking the max-scale-down rate helps a bit but still ends the same way. Having an explicit "if there's any load give me at least N instances to handle it" is maybe not ideal (it'd be nicer if autoscaler was more magic, obviously, but short of something more stateful it's difficult to see how it can be), but would address these use cases that we currently struggle with in a relatively low-hanging-fruit way.

I think this is worth thinking about, and I might even take a crack at it if no-one hates the idea.

/reopen /remove-lifecycle stale

Nov 09 '21 10:11 julz

@julz: Reopened this issue.

In response to this:

Adding an extra use case for this that I've seen a couple times and recently in this slack thread: there are cases - especially bursty eventing workloads - where at a low number of instances we struggle to handle the load at all causing autoscaler to react by dramatically over-shooting the needed number of pods. In this case once we've scaled up way too far we notice concurrency is now much too high, and /under-shoot/ back down to a number of pods that can't handle the load at all. This causes concurrency to shoot up (because requests are backing up) and we react by dramatically over-scaling again. Rinse, repeat. (Another example)

We currently don't have a great way of dealing with these use cases: scale-down-delay helps a bit, but eventually still wears off, and similarly tweaking the max-scale-down rate helps a bit but still ends the same way. Having an explicit "if there's any load give me at least N instances to handle it" is maybe not ideal (it'd be nicer if autoscaler was more magic, obviously, but short of something more stateful it's difficult to see how it can be), but would address these use cases that we currently struggle with in a relatively low-hanging-fruit way.

I think this is worth thinking about, and I might even take a crack at it if no-one hates the idea.

/reopen /remove-lifecycle stale

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Nov 09 '21 10:11 knative-prow-robot

This issue is stale because it has been open for 90 days with no activity. It will automatically close after 30 more days of inactivity. Reopen the issue with /reopen. Mark the issue as fresh by adding the comment /remove-lifecycle stale.

Feb 08 '22 01:02 github-actions[bot]

/reopen

Mar 24 '22 11:03 psschwei

@psschwei: Reopened this issue.

In response to this:

/reopen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Mar 24 '22 11:03 knative-prow-robot

/remove-lifecycle stale

Mar 24 '22 11:03 psschwei

I'm going to poke around with this some /assign

Mar 24 '22 11:03 psschwei

This issue is stale because it has been open for 90 days with no activity. It will automatically close after 30 more days of inactivity. Reopen the issue with /reopen. Mark the issue as fresh by adding the comment /remove-lifecycle stale.

Jun 23 '22 01:06 github-actions[bot]

/lifecycle frozen

Jun 23 '22 02:06 psschwei

@psschwei any news on this one?

Jul 13 '22 12:07 skonto

I got pulled into some other things and haven't had a chance to look into this yet... still on my list, but if you (or anyone else) wants to take it I don't mind letting it go...

Jul 13 '22 13:07 psschwei

It is fine just came up in internal discussions. If I come up with something will ping you :)

Jul 13 '22 13:07 skonto

Leaving this open since we want to add an e2e test

/reopen

Jul 27 '22 21:07 dprotaso

Going to close this out - e2e PR is open

Nov 09 '22 16:11 dprotaso

serving serving copied to clipboard

Is there a way to specify a minimum, non-zero scaling value while keeping scale-to-zero behavior?

In what area(s)?

Ask your question here:

serving
serving copied to clipboard