Failing pod never times out
In what area(s)?
/area autoscale
What version of Knative?
HEAD
Expected Behavior
When a pod fails to start properly it should eventually be terminated.
Actual Behavior
If an instance/pod fails to start and it is the first time the revision is starting a pod then the pod will eventually be terminated. But, if the first instance of the revision starts ok, then scales down to zero, if the next instance/pod that is created fails to start then the pod will continually crash-loop (which is expected) but it'll never be terminated and never goes away.
It seems like there should be consistency between a "first time pod" and a "2+ time pod" w.r.t. what happens when it crashes.
Steps to Reproduce the Problem
You can reproduce this by running this bash script:
#!/bin/bash
set -e
kubectl delete ksvc/bugsvc > /dev/null 2>&1 || true
kubectl delete ksvc/bugsvc2 > /dev/null 2>&1 || true
export CRASH=$(( $(date -u '+%s') + 120))
echo "Time now: ${now:15:5}
echo "Will die: ${CRASH}
kubectl apply -f - <<EOF
apiVersion: serving.knative.dev/v1
kind: Service
metadata:
name: bugsvc
spec:
template:
spec:
containers:
- image: duglin/echo
env:
- name: CRASH
value: ${CRASH}
EOF
sleep 10
URL=$(kubectl get ksvc/bugsvc -o custom-columns=URL:.status.url --no-headers)
echo "Send curl just to make sure it works"
curl $URL
echo "Wait for it to scale to zero"
while kubectl get pods | grep bugsvc ; do
sleep 10
done
echo "Sleep for 2 minutes just to make sure we're past the crash time"
sleep 120
echo "Create bugsvc2 so it fails immediately"
kubectl apply -f - <<EOF
apiVersion: serving.knative.dev/v1
kind: Service
metadata:
name: bugsvc2
spec:
template:
spec:
containers:
- image: duglin/echo
env:
- name: CRASH
value: "true"
EOF
echo "Now curl bugsvc again to force it to scale up to 1"
curl $URL &
echo "Pods should be failing, but bugsvc2 will eventually vanish"
kubectl get pods -w
The image used will crash if it is started after the time (hour:min) of the CRASH env var. So, in the case of bugsvc we create the ksvc before CRASH time, let it scale down to zero, then hit it after CRASH time so that the pod fails. KnService bugsvc2 crashes immediately to show how the pod will be removed (for me after about 2 minutes) while bugsvc's pod seems to live forever (or at least a LOT longer).
This issue is stale because it has been open for 90 days with no
activity. It will automatically close after 30 more days of
inactivity. Reopen the issue with /reopen.Mark the issue as
fresh by adding the comment /remove-lifecycle stale.
/lifecycle frozen
I think we should start thinking about this more holistically. This is also related to #4557 which is a broader issue about dealing with and surfacing pod issues past the first pod.
@vagababov is looking at this realm now. I'll assign since it'll likely impact the behavior here too.
/assign @vagababov
Have we made any progress here?
It looks like Doug produced at least a partial template for re-creating this. Does the Pod hang around forever even if there are no more requests going to it and the request timeout has passed?
/triage needs-user-input
(So that next week's oncall reads the answers)
/unassign
/help /remove-triage needs-user-input /triage accepted
@evankanderson: This request has been marked as needing help from a contributor.
Please ensure the request meets the requirements listed here.
If this request no longer meets these requirements, the label can be removed
by commenting with the /remove-help command.
In response to this:
/help /remove-triage needs-user-input /triage accepted
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.
yes I believe it will live on even w/o requests
I started looking into this issue. After going through the reproduction steps in the description, I observed that the pods get terminated eventually. I had written a small application on developer machine with similar behaviour(fails/panics after a couple of mins.) and I can still see pods getting terminated. It seems like this may not be an issue anymore.
/assign
related to https://github.com/knative/serving/issues/13677 @jsanin-vmw Digging a bit better is related but not solved by #13677 so let's keep it separated