serving Autoscaling in Knative and in the Cluster

Right now, when the cluster is full, Knative will try to deploy the revision for ProgressDeadline seconds (120s by default in Knative). If there are not enough resources and CAS does not do its job fast enough, the PD will expire and then Autoscaler will wait some 10s more and finally give up and scale to 0 basically marking the revision as failed.

@jonjohnsonjr added this clutch long ago to deal with the revisions that fail to ever progress and we end up with zombie revisions that can never succeed.

But in the case above it is possible that the deployment will succeed.

Now, simple suggestion might be to just crank the PD setting to 11. But this has unintended consequences of all deployments now waiting that much to fail even if they are to fail anyway.

Another suggestion is to check in Autoscaler after PD+10s passed, whether the reason for pod unavailability is resource insufficiency, and if so — wait additional Xs rather than mark the revision failed right away. Otherwise, behave as we do now.

Deployments already watch for quota and will deploy the resources if they become available (https://kubernetes.io/docs/concepts/workloads/controllers/deployment/#failed-deployment)

This seems like a mostly simple change that will help us to avoid making random knative users tweak PD for this reason.

Or I am missing something :-)

/cc @julz @markusthoemmes @mattmoor

Sep 22 '20 23:09 vagababov

Sounds fair enough. This should fix #9432

Sep 23 '20 06:09 markusthoemmes

I'm a little torn. Part of me thinks we should maintain the upstream "k8s-native" (ie very declarative but user hostile ;) behaviour and have ProgressDeadline not be terminal at all. Otherwise we act subtly differently from other k8s things which eventually converge when resources become available, in particular Deployment itself.

For example, why special case insufficient resources with more time but not eg a registry image that isn't available yet? And isn't it odd that insufficient resources gives you a bit more time, but still fails if your cluster is a bit slower today? How do I ask for more time (or less) since we've already overloaded ProgressDeadline?

OTOH getting a clear "this didn't work and we gave up" is a much nicer UX, and being broken with cluster autoscaler is pretty incontrovertibly not a good thing so 🤷‍♂️. I guess tldr Id be ok if we special cased this, but I wish there was a nicer / more generic option.

Sep 23 '20 06:09 julz

To be clear, I don't like a special cased "let's wait some longer" either. If we can get a signal from the deployment that stuff eventually moved forward though, I don't see why we wouldn't pick that up and make the revision succeed too :thinking:.

Retrying on a timer is obviously brittle but if the deployment eventually succeeds, we should get a reconcile event anyway, I think?

Sep 23 '20 06:09 markusthoemmes

Thing is we need to decide when to keep waiting for Deployment to progress (i.e. we still need to read the pod statuses and find out the reason is insufficient resources). If we do find out then we can keep trying achieving initial scale (forever? or some other timeout?)

Sep 23 '20 06:09 vagababov

Ah, my reading of the above was we wait the extra X seconds once if the we see insufficient resources. We could indeed wait infinitely for resources (just don't treat ProgressDeadlineExceeded as terminal if we see ResourcesUnavailable), though it still seems kind of odd that we wait infinitely for resources to become available but not, say, for a registry to come back, or for an image to become available, or for a dependent service to come up (or any of a variety of other things). Again, subtly different from upstream, which is a bit sad (but if it was the only want to make cluster autoscaler reliably work 🤷).

Just to ask the obvious question out loud.... "how awful would it be if.." we just generically dropped ProgressDeadlineExceeded being a terminal state? i.e. we dont scale to zero but maintain our desired scale until the user/client tells us to give up. Bonus: that'd also solve the race here which was causing quite a few flakes till we bumped the timeout in https://github.com/knative/serving/pull/8964 iirc. Opposite-of-bonus: backwards-incompatible behaviour if you expect a failed revision to stay failed!

Sep 23 '20 06:09 julz

That's sort of what I was getting at: If the deployment is able to recover any of those failure cases, why wouldn't we? I guess we're saying the same things @julz. And I ack that it should apply to all error scenarios the deployment can recover from itself.

The registry stuff is a little tougher I guess, given that we want to resolve the digest and it's unclear when we can retry that :thinking:. I guess an exponential backoff in the background resolve code could solve that though.

Sep 23 '20 07:09 markusthoemmes

Ugh yeah I forgot about background resolve for image digests. I vote we pretend I never said images above and treat that as a different issue.

Sep 23 '20 07:09 julz

So in general, I'd rather keep the number of cases we consider extendable, possibly, infinitely in pending state fixed (rather the other way around). Since in general it requires intimate knowledge of whether deployment will retry some errors and won't others (crashing container, cannot pull image)?

Sep 23 '20 19:09 vagababov

For posterity, https://github.com/knative/serving/pull/5077 and https://github.com/knative/serving/pull/4614 are where the scale to zero and propagate deployment status changes were introduced.

There was some discussion about further changes to the condition lifecycle in comments on the PR. I haven't kept up with this project, so I'm not sure if they're still interesting to the conversation.

Sep 23 '20 19:09 jonjohnsonjr

This issue is stale because it has been open for 90 days with no activity. It will automatically close after 30 more days of inactivity. Reopen the issue with /reopen. Mark the issue as fresh by adding the comment /remove-lifecycle stale.

Jan 04 '21 02:01 github-actions[bot]

/reopen /lifecycle-frozen /assign

Preliminary assigned to me

Feb 04 '21 01:02 vagababov

@vagababov: Reopened this issue.

In response to this:

/reopen /lifecycle-frozen /assign

Preliminary assigned to me

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Feb 04 '21 01:02 knative-prow-robot

/triage accepted

Mar 22 '21 03:03 evankanderson

/unassign /assign @dprotaso

Dave's looking at the lifecycle of revision, including retryable deployment errors.

Mar 22 '21 17:03 vagababov

Now, simple suggestion might be to just crank the PD setting to 11. But this has unintended consequences of all deployments now waiting that much to fail even if they are to fail anyway.

This was actually done in #10495 (PD was set to 10 minutes to mirror k8s), and with #12751 PD can be customized on a per-revision basis, so I think we may have addressed the core issue already.

Apr 14 '22 15:04 psschwei

Interestingly, it seems like this may be related to #12691 -- it seems like the ProgressDeadlineSeconds value corresponds with how long a Revision takes to shut down and fail, but not how long it takes for the Configuration / Service to give up on the Revision.

Apr 14 '22 17:04 evankanderson