serving Pod failures

This is a tracking issue for detecting and surfacing problems with a user's pods. There are a variety of failure modes, and so far we've been dealing with them in a very ad-hoc manner. Let's enumerate them here and start a discussion towards a more deliberate solution so we don't have to continue playing whack-a-mole.

Detection

We currently try to detect pod failures in the revision reconciler when reconciling a deployment. This logic will probably move to the autoscaler, but remains largely the same.

We look at a single pod to determine if:

It could not be scheduled.
The user container terminated.
The user container is waiting for too long.

Since we only look at a single pod, we can only surface issues that always affect every pod in a deployment, e.g. the image cannot be pulled, the container crashes on start, or the cluster has no resources. We should fix this, likely by looking at every pod's status.

It's unclear to me if there's a way to generically detect all of these issues.

Categorization

Ideally we could distill these issues down to a small set of buckets so we can deal with the issues in a generic way. I don't have a good answer here, but a non-exhaustive list of things we've encountered thus far:

We can't schedule pods because the cluster has insufficient resources: https://github.com/knative/serving/issues/4153 https://github.com/knative/serving/issues/3593
We can't create the deployment because we are out of ResourceQuota: https://github.com/knative/serving/issues/496
We can't scale up the deployment because we are out of ResourceQuota: https://github.com/knative/serving/issues/4416
We can't start the container because we can't pull the image: https://github.com/knative/serving/issues/4192
The container crashes upon starting: https://github.com/knative/serving/issues/499 https://github.com/knative/serving/issues/2145
The container starts, but is eventually killed with OOMKilled: https://github.com/knative/serving/issues/4534

A: For 1, 2, 4, and 5, the revision may never be able to serve traffic, but also may be caused by a temporary issue.

B: For 1 and 3, the revision may be serving traffic, but we are unable to continue scaling.

C: For 6, the revision can serve traffic, but will experience intermittent failures. This could be caused by a memory leak, a query of death, a bug in the code, or insufficient resource limits.

I invite suggestions for names/conditions for these categories. I suspect we'd want to surface these different kinds of failures in different ways...

Reporting

For category A, we definitely want to surface a fatal condition in the Revision status, which should get propagated up to the Revision status, because the user needs to take some action in order to fix their Revision.

For category B, I suspect we want to do something similar, but not be a fatal condition -- just informational. The user should take action to unblock the autoscaler, perhaps by notifying the cluster operator. In the case where we can't scale up to min_scale, this should probably be fatal.

For category C, the problem will be intermittent, and kubernetes is designed to handle these failures. The best we could do here is to somehow help the user diagnose these issues by surfacing what happened -- possibly by injecting some information into their logs?

Jun 27 '19 18:06 jonjohnsonjr

/cc @mattmoor

Took a stab at this just to get my thoughts written down. I'll try to keep this updated to reflect any discussion here or newly discovered failure modes. LMK if I missed anything.

Jun 27 '19 19:06 jonjohnsonjr

@jonjohnsonjr had a chance to read through. Great writeup, and I agree with your categorization. Looking forward to seeing it in action :)

Jun 29 '19 02:06 mattmoor

I agree for the ways to go at this for Category A and B.

As for Category C: Are K8s Events not sufficient for this currently? There should already be events sent when a Pod dies. OOM conditions etc. should be detected by K8s already.

Jul 10 '19 13:07 markusthoemmes

I agree for the ways to go at this for Category A and B.

As for Category C: Are K8s Events not sufficient for this currently? There should already be events sent when a Pod dies. OOM conditions etc. should be detected by K8s already.

The event doesn't include the information to identify the revision/pod/container where the event happened.

Jul 10 '19 16:07 yanweiguo

1 & 3 presume you're not scaled to 0. Otherwise you won't even start to serve the traffic.

5 - might be temporary, since lots a code written as

conn, err := db.Connect()
if err != nil {logger.Fatal() }

Totally doesn't mean it won't be able to restart next time around.

4 - probably depends on why we can't pull image. Registry unreachable? Retry might solve. Permission issues? Might be solved but god knows when, so I guess terminal is fine.

On Wed, Jul 10, 2019 at 9:27 AM Yanwei Guo [email protected] wrote:

I agree for the ways to go at this for Category A and B.

As for Category C: Are K8s Events not sufficient for this currently? There should already be events sent when a Pod dies. OOM conditions etc. should be detected by K8s already.

The event doesn't include the information to identify the revision/pod/container where the event happened.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/knative/serving/issues/4557?email_source=notifications&email_token=AAF2WX36GQPJRO6ZD3M3ZILP6YEVLA5CNFSM4H37IEEKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODZUAKMA#issuecomment-510133552, or mute the thread https://github.com/notifications/unsubscribe-auth/AAF2WX435BLGEAXOKU27BTLP6YEVLANCNFSM4H37IEEA .

Jul 10 '19 18:07 vagababov

The event doesn't include the information to identify the revision/pod/container where the event happened.

To expand on this a bit...

For the pod, OOMKilled is determined here and eventually gets surfaced via the pod status:

apiVersion: v1
kind: Pod
status:
  containerStatuses:
  - containerID: docker://5c1f763e17c2ed6ae7a3d911ee48624ed544a8be028e3e8019841b8e03a03613
    lastState:
      terminated:
        containerID: docker://ce6263b07bfb2ea2b844737fb79346e424d44965e70a00159108c015f82fd0e1
        exitCode: 137
        finishedAt: "2019-07-10T18:25:00Z"
        message: ""
        reason: OOMKilled
          startedAt: "2019-07-10T18:24:49Z"

The actual event is generated by a different component:

{
  "apiVersion": "v1",
  "count": 1,
  "eventTime": null,
  "firstTimestamp": "2019-07-10T18:13:20Z",
  "involvedObject": {
    "kind": "Node",
    "name": "gke-knative-test-default-pool-27843825-6t10",
    "uid": "gke-knative-test-default-pool-27843825-6t10"
  },
  "kind": "Event",
  "lastTimestamp": "2019-07-10T18:13:20Z",
  "message": "Memory cgroup out of memory: Kill process 2299221 (autoscale) score 2219 or sacrifice child\nKilled process 2299221 (autoscale) total-vm:10751196kB, anon-rss:49156kB, file-rss:15872kB, shmem-rss:0kB",
  "metadata": {
    "creationTimestamp": "2019-07-10T18:13:20Z",
    "name": "gke-knative-test-default-pool-27843825-6t10.15b01e503ea2837f",
    "namespace": "default",
    "resourceVersion": "137114",
    "selfLink": "/api/v1/namespaces/default/events/gke-knative-test-default-pool-27843825-6t10.15b01e503ea2837f",
    "uid": "660d8130-a33e-11e9-bb1b-42010a800221"
  },
  "reason": "OOMKilling",
  "reportingComponent": "",
  "reportingInstance": "",
  "source": {
    "component": "kernel-monitor",
    "host": "gke-knative-test-default-pool-27843825-6t10"
  },
  "type": "Warning"
}

There's nothing in here to associate the event with the pod. This part needs to get fixed upstream, but even if that's fixed, it's not a great user experience. Events are garbage collected pretty aggressively, so I might miss that event. Kubernetes will happily restart my pod, which is what I want it to do, but I might miss that pod status.

I'd really like to know that this is happening in case it's a bug. Having something in our logs when this happens would be nice (since logs are usually persisted longer than events) and would allow me to correlate which requests are causing my service to OOM.

If we can get pod info into the OOM events, we could listen for OOM events, extract the pod (this part is currently too expensive, though we could work around it), associate that pod with the revision/configuration/service (via our labels), and log that somewhere.

We might not want to do this for this particular case, but it would be great to have some mechanism for logging ephemeral issues with pods so that they can be associated with a revision to aid in debugging.

Jul 10 '19 19:07 jonjohnsonjr

@vagababov you make a good point, the cause of the issue is separate from whether or not we are serving any traffic. There's a difference between:

"there is something wrong that probably needs to be fixed by someone" and
"there is something wrong and we should not route traffic to this revision".

These categories might be not so great... maybe we should think in terms of why it's a problem:

A: Unable to scale to minScale. B: Able to scale to minScale, but unable to scale further. C: Some ephemeral problem is affecting our deployment.

Then the cause of the issue can be separated into: pulling, scheduling, running... or something?

Jul 10 '19 19:07 jonjohnsonjr

Move to 0.11.

Oct 23 '19 18:10 dgerd

Issues go stale after 90 days of inactivity. Mark the issue as fresh by adding the comment /remove-lifecycle stale. Stale issues rot after an additional 30 days of inactivity and eventually close. If this issue is safe to close now please do so by adding the comment /close.

Send feedback to Knative Productivity Slack channel or file an issue in knative/test-infra.

/lifecycle stale

May 25 '20 00:05 knative-housekeeping-robot

Stale issues rot after 30 days of inactivity. Mark the issue as fresh by adding the comment /remove-lifecycle rotten. Rotten issues close after an additional 30 days of inactivity. If this issue is safe to close now please do so by adding the comment /close.

Send feedback to Knative Productivity Slack channel or file an issue in knative/test-infra.

/lifecycle rotten

Jun 24 '20 12:06 knative-housekeeping-robot

/remove-lifecycle rotten /remove-lifecycle stale /lifecycle frozen

Jun 24 '20 16:06 vagababov

It looks like we froze this 6 months ago.

It seems like the next step is probably a proposal similar to https://github.com/knative/serving/issues/4557#issuecomment-510059033

/triage accepted

Mar 22 '21 07:03 evankanderson

I've been looking at this recently and this definitely feels like something where there's still significant room for improvement, especially in terms of propagating errors after the first scale-up, I'll write up a proposal (for some MVP part of this to get started, at least)

cc @duglin FYI because I know you're interested in this area and may have thoughts also

cc @dprotaso because I think you've been looking in a similar area too?

Related: https://github.com/knative/serving/issues/11717, https://github.com/knative/serving/issues/6504

/assign

Sep 08 '21 07:09 julz

To add more fun... aside from errors that can be correlated back to some user action (e.g. deploying the ksvc, trying to scale up/down, ...) another flavor of issues are around cases where everything gets into a happy state but then something goes wrong. For example, permission to the image in the registry is revoked so only new pods fail. Similar to an OOM issue, but a bit more (potentially) long term and not something that Kube can fix on its own.

From our experience, the real issue is how the end user gets notified that something is wrong,aside from their app misbehaving. Knative does a good job of abstracting away the infrastructure and complexity of Kube in the happy path cases, so I'm glad to see this issue is trying to help in the not-so-happy cases. And if we can find a way to let people continue to live at the "ksvc" level and not drop down into deployments/rs/pods to see there was an error, that would be great.

Just to comment on one particular line from the original comment:

The user container is waiting for too long.

I'm putting the error case of "the user specified the wrong port number" into this category. This is probably one of the most common error cases our users run into. And it's not always clear to them why things look like they're hung. In some cases we get lucky and the user logged a message like "listening on port 7070" and then we can ask them "did you specify a port on the ksvc?". But, the problem here is that some people don't think to check the logs on their own. They'll just look at the status of the Ksvc. Sometimes we don't even get that lucky to have log output to help.

Cases like this aren't really "errors" in the normal sense that we get an error message and then need to find a way to bubble it up. Rather, it's more around something just not happening quickly enough and so it might be an indication that something is wrong.... but no guarantee. In cases like this, I've wondered if there's something relatively minor we can do - for example, rather than having a message in the ksvc that says RevisionMissing : Configuration "badport" is waiting for a Revision to become ready, a simple tweak to the message to add something like "waiting for the service to be available on port xxx". Just something to put the idea into the user's head that perhaps there's something they should check even though we don't have a concrete "error condition" yet.

Sep 08 '21 11:09 duglin

And if we can find a way to let people continue to live at the "ksvc" level and not drop down into deployments/rs/pods to see there was an error, that would be great.

This is exactly what we want. Is there anyone working on this now?

Jul 12 '22 02:07 wuyafang

serving serving copied to clipboard

Pod failures

Detection

Categorization

Reporting

serving
serving copied to clipboard