serving Failed revisions can cause unlimited pods to be rapidly created

What version of Knative?

1.5

Expected Behavior

A Knative revision should fail gracefully without flailing after n pod failures on initial deployment, and this value should be user customizable. This will help avoid operator errors that trigger the creation of an unbounded number of pods.

This can be a corollary setting to progress-deadline in the config-deployment ConfigMap.

An alternative option could be to detect when a revision is flailing and fail immediately (e.g., more than nfailures/sec).

Actual Behavior

Our default knative default settings for ephemeral storage was 750M. The yaml file for a kservice did not include any values for resource.request.ephemeral-storage, nor resource.limit.ephemeral-storage. Both resource.requests and resource.limits for CPU and memory were 1 and 8144MI.

The pod for this kservice would be scheduled and start, then fail ~1-2 minutes after starting but before becoming ready. During this time the pod would attempt to load a large data model, exhaust its ephemeral storage, and get evicted.

The revision would then try to reschedule the pod again, but this time the pod will fail due to OutOfCPU. At the point pods are created rapidly (~dozen every few seconds), all failing immediately due to OutofCPU.

Steps to Reproduce the Problem

Create a kservice yaml with a low ephemeral storage (2Mi), CPU (1), and memory (8144Mi). These apply to both the resource limits and requests.
The application in the kservice should consume more than the ephemeral storage limit during startup, before it becomes ready. This should force it to be evicted during startup.
Apply the yaml. After the first failure, the revision should rapidly churn out pods until the "progress-deadline" is met. How quickly the pods are created happens depends on node sizing and other configs.

Jul 26 '22 19:07 xtaje

Out of curiosity, does the same behavior happen if this were a Kubernetes service, or is it something specific to Knative?

Jul 27 '22 12:07 psschwei

Theoretically I think it could but this would require confirmation. A Deployment could have a low ephemeral storage limit, and the Pods would probably end up flailing.

I did find these two related Kubernetes issues, so it may be that the PodSpec will be updated to handle this edge case. I'm not sure if changing the restartPolicy will help with the behavior described above.

https://github.com/kubernetes/kubernetes/issues/65797
https://github.com/kubernetes/enhancements/issues/3322

The two differences with Knative are that a global ephemeral storage limit is easily configurable via config-defaults, and the progress-deadline config will globally set a deadline for the revision.

Jul 27 '22 17:07 xtaje

re: restart policy, we're locked in to using Always as discussed in #13041

I could see adding something like a fail-fast option, where the revision would be marked as failed after the first crash. Doing it after N crashes would be a bit more complicated, though perhaps we could just pull it from the pod's restart count (?).... On the other hand, then we're starting to modify the underlying Kubernetes behavior (not that there's anything wrong with that)...

@dprotaso what do you think? I could go either way here.

Jul 27 '22 21:07 psschwei

There is a comment in #13401 that saw similar behavior, but due to reasons other than ephemeral storage.

There is an outstanding KEP-3322 that adds a maxRestartTimes to the PodSpec restartPolicy. If that is added to the DeploymentSpec also, then maybe what will enable a clean solution for the Kservice.

Jul 28 '22 17:07 xtaje

The general problem we have is knowing whether a failure is transient or permanent. ie. image pulls fail due to blips in the network. Even in your example an exodus of other Pods could unblock/allow your container to become ready.

For some obvious failures we might want to signal earlier - ie. failed to schedule due to CPU/mem quota

I'll take a look at the KEP and circle back - it's good the K8s community is trying to tackle this. Hopefully they're thinking about the semantics of how this bubbles up to the deployment status etc.

Jul 29 '22 19:07 dprotaso

This issue is stale because it has been open for 90 days with no activity. It will automatically close after 30 more days of inactivity. Reopen the issue with /reopen. Mark the issue as fresh by adding the comment /remove-lifecycle stale.

Oct 28 '22 01:10 github-actions[bot]

This issue or pull request is stale because it has been open for 90 days with no activity.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle rotten
Close this issue or PR with /close

/lifecycle stale

Nov 27 '22 02:11 knative-prow-robot

This issue is stale because it has been open for 90 days with no activity. It will automatically close after 30 more days of inactivity. Reopen the issue with /reopen. Mark the issue as fresh by adding the comment /remove-lifecycle stale.

Feb 26 '23 01:02 github-actions[bot]