serving Failing pod never times out

In what area(s)?

/area autoscale

What version of Knative?

HEAD

Expected Behavior

When a pod fails to start properly it should eventually be terminated.

Actual Behavior

If an instance/pod fails to start and it is the first time the revision is starting a pod then the pod will eventually be terminated. But, if the first instance of the revision starts ok, then scales down to zero, if the next instance/pod that is created fails to start then the pod will continually crash-loop (which is expected) but it'll never be terminated and never goes away.

It seems like there should be consistency between a "first time pod" and a "2+ time pod" w.r.t. what happens when it crashes.

Steps to Reproduce the Problem

You can reproduce this by running this bash script:

#!/bin/bash

set -e
kubectl delete ksvc/bugsvc > /dev/null 2>&1 || true
kubectl delete ksvc/bugsvc2 > /dev/null 2>&1 || true

export CRASH=$(( $(date -u '+%s') + 120))

echo "Time now: ${now:15:5}
echo "Will die: ${CRASH}

kubectl apply -f - <<EOF
apiVersion: serving.knative.dev/v1
kind: Service
metadata:
  name: bugsvc
spec:
  template:
    spec:
      containers:
      - image: duglin/echo
        env:
          - name: CRASH
            value: ${CRASH}
EOF
sleep 10
URL=$(kubectl get ksvc/bugsvc -o custom-columns=URL:.status.url --no-headers)

echo "Send curl just to make sure it works"
curl $URL

echo "Wait for it to scale to zero"
while kubectl get pods | grep bugsvc ; do
  sleep 10
done

echo "Sleep for 2 minutes just to make sure we're past the crash time"
sleep 120

echo "Create bugsvc2 so it fails immediately"
kubectl apply -f - <<EOF
apiVersion: serving.knative.dev/v1
kind: Service
metadata:
  name: bugsvc2
spec:
  template:
    spec:
      containers:
      - image: duglin/echo
        env:
          - name: CRASH
            value: "true"
EOF
echo "Now curl bugsvc again to force it to scale up to 1"
curl $URL &

echo "Pods should be failing, but bugsvc2 will eventually vanish"
kubectl get pods -w

The image used will crash if it is started after the time (hour:min) of the CRASH env var. So, in the case of bugsvc we create the ksvc before CRASH time, let it scale down to zero, then hit it after CRASH time so that the pod fails. KnService bugsvc2 crashes immediately to show how the pod will be removed (for me after about 2 minutes) while bugsvc's pod seems to live forever (or at least a LOT longer).

Jan 12 '20 01:01 duglin

This issue is stale because it has been open for 90 days with no activity. It will automatically close after 30 more days of inactivity. Reopen the issue with /reopen.Mark the issue as fresh by adding the comment /remove-lifecycle stale.

Aug 25 '20 16:08 github-actions[bot]

/lifecycle frozen

Aug 25 '20 20:08 duglin

I think we should start thinking about this more holistically. This is also related to #4557 which is a broader issue about dealing with and surfacing pod issues past the first pod.

Sep 02 '20 07:09 markusthoemmes

@vagababov is looking at this realm now. I'll assign since it'll likely impact the behavior here too.

/assign @vagababov

Oct 05 '20 12:10 markusthoemmes

Have we made any progress here?

It looks like Doug produced at least a partial template for re-creating this. Does the Pod hang around forever even if there are no more requests going to it and the request timeout has passed?

/triage needs-user-input

(So that next week's oncall reads the answers)

Mar 22 '21 06:03 evankanderson

/unassign

Mar 22 '21 17:03 vagababov

/help /remove-triage needs-user-input /triage accepted

Jun 23 '21 21:06 evankanderson

@evankanderson: This request has been marked as needing help from a contributor.

Please ensure the request meets the requirements listed here.

If this request no longer meets these requirements, the label can be removed by commenting with the /remove-help command.

In response to this:

/help /remove-triage needs-user-input /triage accepted

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Jun 23 '21 21:06 knative-prow-robot

yes I believe it will live on even w/o requests

Jun 24 '21 02:06 duglin

I started looking into this issue. After going through the reproduction steps in the description, I observed that the pods get terminated eventually. I had written a small application on developer machine with similar behaviour(fails/panics after a couple of mins.) and I can still see pods getting terminated. It seems like this may not be an issue anymore.

Aug 21 '23 14:08 xtreme-vikram-yadav

/assign

Oct 27 '23 10:10 gabo1208

related to https://github.com/knative/serving/issues/13677 @jsanin-vmw Digging a bit better is related but not solved by #13677 so let's keep it separated

Nov 06 '23 11:11 gabo1208