serving Initial Revisions (at least) with crash-looping pods take a long time to terminate/clean up Pods

Initial Revisions (at least) with crash-looping pods take a long time to terminate/clean up Pods

Open evankanderson opened this issue 2 years ago • 17 comments

What version of Knative?

v1.0 at least

Expected Behavior

When creating a Revision with a pod which exits immediately, the Revision should (fairly quickly) report that Ready is False and terminate the Pods.

Actual Behavior

The pods stick around in CrashLoopBackOff for many restarts, and the Revision remains in "unknown" status for many minutes and eventually times out.

Steps to Reproduce the Problem

In one shell:

kn service create crasher --image nicolaka/netshoot  # or even a "bash" image

Watch this stall out, check on the pods with kubectl get po, etc

In a second shell:

kn service update crasher --image projects.registry.vmware.com/tanzu_serverless/hello-yeti

The first kn service create will complete, and the service will be ready to serve!

BUT

The first Revision will still be in unknown status, and the Pod will still be present in CrashLoopBackOff, even many minutes after the failure.

After approximately 10 minutes, the Pod will finally be cleaned up, but the reported status.desiredScale for the KPA resource is still -1 at the end of that time.

Mar 03 '22 20:03 evankanderson

(It seems to take about 10 minutes to fill in the spec.Reachability field, which I suspect is what is causing the scale back down to zero)

Mar 03 '22 20:03 evankanderson

/good-first-issue

Mar 03 '22 20:03 evankanderson

@evankanderson: This request has been marked as suitable for new contributors.

Please ensure the request meets the requirements listed here.

If this request no longer meets these requirements, the label can be removed by commenting with the /remove-good-first-issue command.

In response to this:

/good-first-issue

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Mar 03 '22 20:03 knative-prow-robot

Poked around at this a bit, using Serving 1.2. Here's what I saw:

It looks like the initial revision gets set to Ready: False pretty quickly (almost simultaneous with the second revision getting ready).

The long time to remove the pod comes from the default ProgressDeadline, which is set to 10m by default. Once that is hit, QP gets a TERM signal and goes through it's usual shutdown process.

I think KPA want stayed at -1 until I curled the working service (I don't remember if I curled it or not right when the second revision came up).

(also tried Serving 1.0 and also saw Ready: False pretty quick)

Assuming I'm understanding what's going on here correctly (insert usual caveat about assuming :smile: ), would this be a fix to a special case where revision 1 is replaced by revision 2 before rev1 becomes ready?

Mar 03 '22 21:03 psschwei

I haven't tested, but I think this also happens when you deploy a crashing version after a successful version is installed.

I suspect that there is a special case here (since initialization is a bit of a special case already).

Mar 04 '22 15:03 evankanderson

hey @evankanderson I am a beginner in devops and learned linux, docker and kubernetes beginners course on Go and Python basics. Am I eligible with this much knowledge to be able to contribute to the tools. If not then could you suggest what all things I should learn more in order to get started.

Mar 20 '22 16:03 muhammedanaskhan

Hi @muhammedanaskhan,

I think this could be a good first issue for you.

I'd start out by figuring out how to get a dev build of Knative serving running in your environment. (Just built from main or a branch with no changes.)

After that, I'd try to reproduce the issue on your cluster without making any code changes. Once you can do that, I'd start making the changes suggested earlier in this thread (maybe start by initializing the serverlessservice scale to 1 rather than -1). Once you think you have the problem solved, I'd push the change to a branch on your fork and send a PR which references this issue.

Mar 20 '22 17:03 evankanderson

If you can figure out how, a unit test would also be a good addition to your PR. If you can't figure out the best test scenario, we can certainly help with that - - mention it in your PR.

Mar 20 '22 17:03 evankanderson

@muhammedanaskhan -- if you're still interested, I think I saw an interesting hint in https://github.com/knative/serving/issues/9531#issuecomment-1099411896; I suspect that we should be connecting how long a Configuration takes to give up on a Revision with the ProgressDeadlineSeconds of the Deployment underlying the Revision.

Apr 14 '22 17:04 evankanderson

A couple of extra details, in case they help.

Here's the KPA status changes over the course of the steps that Evan described:

$ k get kpa -w
NAME            DESIREDSCALE   ACTUALSCALE   READY     REASON
crasher-00001   -1             0             Unknown   Queued
crasher-00002                                          
crasher-00002   -1             0             Unknown   Queued
crasher-00002   -1             0             Unknown   Queued
crasher-00002   -1             1             Unknown   NotReady
crasher-00002   -1             1             Unknown   NotReady
crasher-00002   -1             1             True      
crasher-00002   -1             1             True      
crasher-00002   -1             1             True      
crasher-00001   -1             0             Unknown   Queued
crasher-00001   -1             0             Unknown   Queued
crasher-00002   1              1             True

Final state:

$ k get kpa 
NAME            DESIREDSCALE   ACTUALSCALE   READY     REASON
crasher-00001   -1             0             Unknown   Queued
crasher-00002   1              1             True

Also, desired scale remains at -1 for the first KPA, even after the progress-deadline period:

$ k get kpa crasher-00001 
NAME            DESIREDSCALE   ACTUALSCALE   READY   REASON
crasher-00001   -1             0             False   NoTraffic

$ k get revision crasher-00001
NAME            CONFIG NAME   K8S SERVICE NAME   GENERATION   READY   REASON                     ACTUAL REPLICAS   DESIRED REPLICAS
crasher-00001   crasher                          1            False   ProgressDeadlineExceeded   0

Apr 14 '22 18:04 psschwei

This issue is stale because it has been open for 90 days with no activity. It will automatically close after 30 more days of inactivity. Reopen the issue with /reopen. Mark the issue as fresh by adding the comment /remove-lifecycle stale.

Jul 14 '22 01:07 github-actions[bot]

/reopen

Sep 07 '22 18:09 evankanderson

@evankanderson: Reopened this issue.

In response to this:

/reopen

Sep 07 '22 18:09 knative-prow[bot]

/remove-lifecycle stale

Sep 07 '22 18:09 evankanderson

/assign

Sep 12 '22 20:09 nader-ziada

FYI: https://github.com/kubernetes/kubernetes/issues/106697

Sep 13 '22 14:09 dprotaso

my observation is similar to Paul's comment above, the revision get set to Ready False right away and the pod waits for the 10 mins ProgressDeadline default to get cleaned up

I suspect that we should be connecting how long a Configuration takes to give up on a Revision with the ProgressDeadlineSeconds of the Deployment underlying the Revision.

and I wonder if this recent PR Fix LatestReadyRevision semantics - it only advances forward has fixed this edge case of how the configuration doesn't fail if the latest revision has an issue

Sep 19 '22 20:09 nader-ziada

/unassign @nader-ziada

Dec 05 '22 22:12 dprotaso

/assign itsdarshankumar

Jan 20 '23 17:01 dprotaso

@dprotaso: GitHub didn't allow me to assign the following users: itsdarshankumar.

Note that only knative members with read permissions, repo collaborators and people who have commented on this issue/PR can be assigned. Additionally, issues/PRs can only have 10 assignees at the same time. For more information please see the contributor guide

In response to this:

/assign itsdarshankumar

Jan 20 '23 17:01 knative-prow[bot]

hey @dprotaso i'll take this and try my hands on it

Jan 20 '23 18:01 itsdarshankumar

/assign @itsdarshankumar

Jan 20 '23 18:01 dprotaso

related: I didn't realize maxUnavailable affects the Available status condition

https://github.com/kubernetes/kubernetes/issues/106697#issuecomment-1369672284

Feb 02 '23 15:02 dprotaso

hey @itsdarshankumar are you still working on it?

Feb 26 '23 09:02 keshavcodex

hey @itsdarshankumar are you still working on it?

yea I was working on it but due to some other works, wasn't able to wrap this up, you may start on this if you wish:)

Feb 27 '23 17:02 itsdarshankumar

ok sure

Feb 27 '23 17:02 keshavcodex

/assign @keshavcodex /unassign @itsdarshankumar

Mar 01 '23 14:03 dprotaso

Hii @dprotaso , I am new to kantive and I am looking into this issue from last many days but could not figure out what would be the right direction to solve this issue. I have reproduced the same error .Till now I have observed that

When we run command kubectl describe pods we get below output

Here readiness probe failed is the issue. Here application is exiting immediately after completing and therefore restarting again and again . But I am unable to find the exact files where should I look into to solve this issue .Can you please suggest me how should I proceed further in order to solve this issue?

Jun 18 '23 20:06 SomilJain0112

@keshavcodex are you still working on this issue?

@SomilJain0112 we expect you to run a container that runs an http service on $PORT (default value is 8080). We probe that port to see if it is ready.

Jun 19 '23 14:06 dprotaso

I created https://github.com/knative/serving/pull/14248 to delete the old revisions but that's unlikely to be correct. We probably want them still around, just not using up compute. How can I shut down the pods without deleting the entire revision? Please correct me if my work is heading in the wrong direction as well.

Aug 09 '23 17:08 msiyaj

serving serving copied to clipboard

Initial Revisions (at least) with crash-looping pods take a long time to terminate/clean up Pods

What version of Knative?

Expected Behavior

Actual Behavior

Steps to Reproduce the Problem

serving
serving copied to clipboard