serving
serving copied to clipboard
Initial Revisions (at least) with crash-looping pods take a long time to terminate/clean up Pods
What version of Knative?
v1.0 at least
Expected Behavior
When creating a Revision with a pod which exits immediately, the Revision should (fairly quickly) report that Ready
is False
and terminate the Pods.
Actual Behavior
The pods stick around in CrashLoopBackOff
for many restarts, and the Revision remains in "unknown" status for many minutes and eventually times out.
Steps to Reproduce the Problem
In one shell:
kn service create crasher --image nicolaka/netshoot # or even a "bash" image
Watch this stall out, check on the pods with kubectl get po
, etc
In a second shell:
kn service update crasher --image projects.registry.vmware.com/tanzu_serverless/hello-yeti
The first kn service create
will complete, and the service will be ready to serve!
BUT
The first Revision will still be in unknown
status, and the Pod will still be present in CrashLoopBackOff
, even many minutes after the failure.
After approximately 10 minutes, the Pod will finally be cleaned up, but the reported status.desiredScale
for the KPA resource is still -1
at the end of that time.
(It seems to take about 10 minutes to fill in the spec.Reachability
field, which I suspect is what is causing the scale back down to zero)
/good-first-issue
@evankanderson: This request has been marked as suitable for new contributors.
Please ensure the request meets the requirements listed here.
If this request no longer meets these requirements, the label can be removed
by commenting with the /remove-good-first-issue
command.
In response to this:
/good-first-issue
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.
Poked around at this a bit, using Serving 1.2. Here's what I saw:
It looks like the initial revision gets set to Ready: False
pretty quickly (almost simultaneous with the second revision getting ready).
The long time to remove the pod comes from the default ProgressDeadline
, which is set to 10m by default. Once that is hit, QP gets a TERM signal and goes through it's usual shutdown process.
I think KPA want stayed at -1
until I curled the working service (I don't remember if I curled it or not right when the second revision came up).
(also tried Serving 1.0 and also saw Ready: False pretty quick)
Assuming I'm understanding what's going on here correctly (insert usual caveat about assuming :smile: ), would this be a fix to a special case where revision 1 is replaced by revision 2 before rev1 becomes ready?
I haven't tested, but I think this also happens when you deploy a crashing version after a successful version is installed.
I suspect that there is a special case here (since initialization is a bit of a special case already).
hey @evankanderson I am a beginner in devops and learned linux, docker and kubernetes beginners course on Go and Python basics. Am I eligible with this much knowledge to be able to contribute to the tools. If not then could you suggest what all things I should learn more in order to get started.
Hi @muhammedanaskhan,
I think this could be a good first issue for you.
I'd start out by figuring out how to get a dev build of Knative serving running in your environment. (Just built from main
or a branch with no changes.)
After that, I'd try to reproduce the issue on your cluster without making any code changes. Once you can do that, I'd start making the changes suggested earlier in this thread (maybe start by initializing the serverlessservice scale
to 1
rather than -1
). Once you think you have the problem solved, I'd push the change to a branch on your fork and send a PR which references this issue.
If you can figure out how, a unit test would also be a good addition to your PR. If you can't figure out the best test scenario, we can certainly help with that - - mention it in your PR.
@muhammedanaskhan -- if you're still interested, I think I saw an interesting hint in https://github.com/knative/serving/issues/9531#issuecomment-1099411896; I suspect that we should be connecting how long a Configuration takes to give up on a Revision with the ProgressDeadlineSeconds of the Deployment underlying the Revision.
A couple of extra details, in case they help.
Here's the KPA status changes over the course of the steps that Evan described:
$ k get kpa -w
NAME DESIREDSCALE ACTUALSCALE READY REASON
crasher-00001 -1 0 Unknown Queued
crasher-00002
crasher-00002 -1 0 Unknown Queued
crasher-00002 -1 0 Unknown Queued
crasher-00002 -1 1 Unknown NotReady
crasher-00002 -1 1 Unknown NotReady
crasher-00002 -1 1 True
crasher-00002 -1 1 True
crasher-00002 -1 1 True
crasher-00001 -1 0 Unknown Queued
crasher-00001 -1 0 Unknown Queued
crasher-00002 1 1 True
Final state:
$ k get kpa
NAME DESIREDSCALE ACTUALSCALE READY REASON
crasher-00001 -1 0 Unknown Queued
crasher-00002 1 1 True
Also, desired scale remains at -1 for the first KPA, even after the progress-deadline period:
$ k get kpa crasher-00001
NAME DESIREDSCALE ACTUALSCALE READY REASON
crasher-00001 -1 0 False NoTraffic
$ k get revision crasher-00001
NAME CONFIG NAME K8S SERVICE NAME GENERATION READY REASON ACTUAL REPLICAS DESIRED REPLICAS
crasher-00001 crasher 1 False ProgressDeadlineExceeded 0
This issue is stale because it has been open for 90 days with no
activity. It will automatically close after 30 more days of
inactivity. Reopen the issue with /reopen
. Mark the issue as
fresh by adding the comment /remove-lifecycle stale
.
/reopen
@evankanderson: Reopened this issue.
In response to this:
/reopen
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.
/remove-lifecycle stale
/assign
FYI: https://github.com/kubernetes/kubernetes/issues/106697
my observation is similar to Paul's comment above, the revision get set to Ready False
right away and the pod waits for the 10 mins ProgressDeadline default to get cleaned up
I suspect that we should be connecting how long a Configuration takes to give up on a Revision with the ProgressDeadlineSeconds of the Deployment underlying the Revision.
and I wonder if this recent PR Fix LatestReadyRevision semantics - it only advances forward has fixed this edge case of how the configuration doesn't fail if the latest revision has an issue
/unassign @nader-ziada
/assign itsdarshankumar
@dprotaso: GitHub didn't allow me to assign the following users: itsdarshankumar.
Note that only knative members with read permissions, repo collaborators and people who have commented on this issue/PR can be assigned. Additionally, issues/PRs can only have 10 assignees at the same time. For more information please see the contributor guide
In response to this:
/assign itsdarshankumar
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.
hey @dprotaso i'll take this and try my hands on it
/assign @itsdarshankumar
related: I didn't realize maxUnavailable
affects the Available
status condition
https://github.com/kubernetes/kubernetes/issues/106697#issuecomment-1369672284
hey @itsdarshankumar are you still working on it?
hey @itsdarshankumar are you still working on it?
yea I was working on it but due to some other works, wasn't able to wrap this up, you may start on this if you wish:)
ok sure
/assign @keshavcodex /unassign @itsdarshankumar
Hii @dprotaso , I am new to kantive and I am looking into this issue from last many days but could not figure out what would be the right direction to solve this issue. I have reproduced the same error .Till now I have observed that
- When we run command kubectl describe pods we get below output
Here readiness probe failed is the issue. Here application is exiting immediately after completing and therefore restarting again and again . But I am unable to find the exact files where should I look into to solve this issue .Can you please suggest me how should I proceed further in order to solve this issue?
@keshavcodex are you still working on this issue?
@SomilJain0112 we expect you to run a container that runs an http service on $PORT
(default value is 8080). We probe that port to see if it is ready.
I created https://github.com/knative/serving/pull/14248 to delete the old revisions but that's unlikely to be correct. We probably want them still around, just not using up compute. How can I shut down the pods without deleting the entire revision? Please correct me if my work is heading in the wrong direction as well.