serving
serving copied to clipboard
Is there a way to specify a minimum, non-zero scaling value while keeping scale-to-zero behavior?
In what area(s)?
/area autoscale
Other classifications:
/kind good-first-issue /kind process /kind spec
Ask your question here:
We have a few very "bursty" services that get zero traffic for a while and then get 100s of concurrent requests.
What I would like is to be able to do the following:
- Enable scale-to-zero for that service
- When the service has a non-zero scale, it is at least some minimum value (say, 10).
As far as I understand, setting minScale to 10 prevents scale to zero, and initialScale only applies on first deployment.
Is there some other way (directly via config) or pattern (i.e. workaround) for doing this?
Thanks!
Interesting! We don't support this currently as you've found out but it is and interesting thought.
Do you see too long of a delay in scaling up to the required scale in your workloads? If all requests come in at once, I'd expect us to scale to, say, 10 almost immediately. If you have a ramping workload, that'd be different of course.
So I would need to set aside time to confirm the behavior in the scale-from-zero case (but maybe you might know just based on your understanding of the architecture), but in the "warm" case after a burst (where we scale back down to 1 for 15 minutes), we effectively get a cold start that's potentially even worse than the true cold case.
The situation in that case is that we have 1 pod sitting around available to immediately serve requests.
However, we have containerConcurrency: 1 and requests take ~3 seconds to complete. Therefore, since the panic scaling window only kicks in after 10 seconds, we have most requests (~40 at a time) blocking for the full 10 seconds + pod spin up time.
Since the panic window is globally configured, I don't want to change its value. Therefore it seems like a reasonable solution to this problem is to keep the floor # of pods at some value > 1.
Hmm, that sounds kind of odd though. The panic window is 6s by default, but that doesn't mean that scaling should only happen after 6s but that the amount of historic data the decision is based on is at most 6s old. As such, especially with an aggressive containerConcurrency setting, I would expect the workload to scale almost instantaneously.
It'd be awesome to get a reproducer for this behavior if possible, if we have indeed a bug here. Could you share the Knative Service YAML that you're using here? Any other settings tweaked?
The panic window is 6s by default, but that doesn't mean that scaling should only happen after 6s but that the amount of historic data the decision is based on is at most 6s old.
Ah yep. 10% != 10s. And also duh, re: how the windowing works.
I doubt there's a "bug" here so much as a workload-specific thing (pattern, large docker image that doesn't seem to cache well). I definitely owe a repro if this is going to turn into a full-fledged feature request, but I did want to see if folks had encountered this or you could reason through the behavior offhand (or if there was a configuration variable I was unaware of).
(Also we're on 0.18, but afaict nothing between 0.18 and tip changes this behavior).
Yeah if there's anyway to repo with a script where you simulate your traffic bursts that'd be helpful
+1 for the feature. We are using knative for our production workload @gojek. However, we don't use scale-to-zero feature since we want to guarantee that at least more than 1 replica of an application is available when it is serving traffic.
I think @danieltahara 's case was for avoiding situations between 0 and (for example) 10. If you want to ensure that there minimum 10 replicas, I think the autoscaling.knative.dev/minScale annotation should work for you.
With respect to the original bug report, it would be good to get a repro case and see how much scaling delay we're getting in that scenario. Maybe a simple repro case would be:
- Container of any size that takes 10s to process a request (e.g. see the autoscale test image with
?sleep=10sparameter) - ContainerConcurrency=1
- Issue 100 simultaneous requests, count time from first and last request issued (to rule out delay on client) until all requests served(?)
Our expectation would be that the duration of the test would be <20s with 100 Pods spawned, correct?
Yes I'd expect us to hit 100 pods requested (deployed is another story :sweat_smile: ) almost immediately.
avoiding situations between 0 and (for example) 10
@evankanderson That's exactly what I think the ideal state too. So if the service is not serving traffic, the number of replicas should be 0. But if there is traffic it has to be at least x, where x can be configurable similar to autoscaling.knative.dev/minScale. This avoid scenario where my service has only 1 replica and could cause disruption if this replica become nonoperational due to for example getting rescheduled by K8S.
I do use autoscaling.knative.dev/minScale currently, but the scale to zero will not be triggered even under no traffic scenario.
This issue is stale because it has been open for 90 days with no
activity. It will automatically close after 30 more days of
inactivity. Reopen the issue with /reopen. Mark the issue as
fresh by adding the comment /remove-lifecycle stale.
Adding an extra use case for this that I've seen a couple times and recently in this slack thread: there are cases - especially bursty eventing workloads - where at a low number of instances we struggle to handle the load at all causing autoscaler to react by dramatically over-shooting the needed number of pods. In this case once we've scaled up way too far we notice concurrency is now much too high, and /under-shoot/ back down to a number of pods that can't handle the load at all. This causes concurrency to shoot up (because requests are backing up) and we react by dramatically over-scaling again. Rinse, repeat. (Another example)
We currently don't have a great way of dealing with these use cases: scale-down-delay helps a bit, but eventually still wears off, and similarly tweaking the max-scale-down rate helps a bit but still ends the same way. Having an explicit "if there's any load give me at least N instances to handle it" is maybe not ideal (it'd be nicer if autoscaler was more magic, obviously, but short of something more stateful it's difficult to see how it can be), but would address these use cases that we currently struggle with in a relatively low-hanging-fruit way.
I think this is worth thinking about, and I might even take a crack at it if no-one hates the idea.
/reopen /remove-lifecycle stale
@julz: Reopened this issue.
In response to this:
Adding an extra use case for this that I've seen a couple times and recently in this slack thread: there are cases - especially bursty eventing workloads - where at a low number of instances we struggle to handle the load at all causing autoscaler to react by dramatically over-shooting the needed number of pods. In this case once we've scaled up way too far we notice concurrency is now much too high, and /under-shoot/ back down to a number of pods that can't handle the load at all. This causes concurrency to shoot up (because requests are backing up) and we react by dramatically over-scaling again. Rinse, repeat. (Another example)
We currently don't have a great way of dealing with these use cases: scale-down-delay helps a bit, but eventually still wears off, and similarly tweaking the max-scale-down rate helps a bit but still ends the same way. Having an explicit "if there's any load give me at least N instances to handle it" is maybe not ideal (it'd be nicer if autoscaler was more magic, obviously, but short of something more stateful it's difficult to see how it can be), but would address these use cases that we currently struggle with in a relatively low-hanging-fruit way.
I think this is worth thinking about, and I might even take a crack at it if no-one hates the idea.
/reopen /remove-lifecycle stale
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.
This issue is stale because it has been open for 90 days with no
activity. It will automatically close after 30 more days of
inactivity. Reopen the issue with /reopen. Mark the issue as
fresh by adding the comment /remove-lifecycle stale.
/reopen
@psschwei: Reopened this issue.
In response to this:
/reopen
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.
/remove-lifecycle stale
I'm going to poke around with this some /assign
This issue is stale because it has been open for 90 days with no
activity. It will automatically close after 30 more days of
inactivity. Reopen the issue with /reopen. Mark the issue as
fresh by adding the comment /remove-lifecycle stale.
/lifecycle frozen
@psschwei any news on this one?
I got pulled into some other things and haven't had a chance to look into this yet... still on my list, but if you (or anyone else) wants to take it I don't mind letting it go...
It is fine just came up in internal discussions. If I come up with something will ping you :)
Leaving this open since we want to add an e2e test
/reopen
Going to close this out - e2e PR is open