serving icon indicating copy to clipboard operation
serving copied to clipboard

Defective revision can lead to pods never being removed

Open DavidR91 opened this issue 2 years ago • 6 comments

In what area(s)?

/area autoscale

What version of Knative?

Repro'ed in 1.3 1.5 1.9

(This repros with istio as networking + ingress 1.12.9. Besides using the operator for install the configuration is very vanilla but I can provide more details if useful)

Expected Behavior

Deployments that pass their initial progress deadline but contain pods that start to crashloop should eventually be scaled down and removed.

(Note: Scale to zero assumed)

Actual Behavior

If there is buffered traffic for a revision of a service, and the service passed its initial deployment progress deadline, knative will keep the revision's deployments alive forever with no obvious way to scale them down or remove them (keeping around the pods in a crashlooping state)

Example use case case encountered: a revision contains a pod with e.g. an address of an external resource like a database. The service is working with this revision for some time, and then the external resource address is changed (causing the pod to startup but the container to not serve requests and eventually enter restart loops). A new revision is created to amend this - but if there is any outstanding traffic for the old revision, the old defective pods are kept around and never scaled down.

The state of the PodAutoscaler in this instance becomes Ready=Unknown Reason=Queued with status messages to the effect of Requests to the target are being buffered as resources are provisioned

Removing the service is not a solution because the newest revision is correctly serving traffic.

modified

Steps to Reproduce the Problem

  • Create a container that serves HTTP traffic correctly but ceases to start listening/functioning based on external criteria
    • Simple example is to sleep for 5 seconds and exit before the listener starts if the minute of the current hour is >30
  • Create a service + revision for the container
  • Send traffic to the service while the external criteria allows the container to operate
    • Make sure the service passes its initial deployment deadline (~10 mins)
  • Wait for it to scale back down to zero
  • Send traffic to the service now that the external criteria prevents it starting
  • The deployment will scale up and all the created pods will crashloop
  • Create a new revision that corrects the issue, and drive traffic to the service again
  • The service's new revision will start and serve traffic but the deployment and pods of the old defective revision will stick around with no clear way to remove them

NOTE: You can obviously delete the revision, but this is not a solution for services which have only a single revision (do we have to delete the entire service to kill these pods?). This bug is partly a question of whether knative is actually designed to be able to clean up this scenario, or whether it would rest on a human operator or additional orchestrator to resolve.

DavidR91 avatar Feb 06 '23 11:02 DavidR91

/triage accepted

dprotaso avatar Feb 09 '23 17:02 dprotaso

/assign

jsanin-vmw avatar Feb 15 '23 17:02 jsanin-vmw

/unassign

jsanin-vmw avatar May 23 '23 19:05 jsanin-vmw

/assign

jsanin-vmw avatar Oct 30 '23 22:10 jsanin-vmw

PR 14573 aims to fix this issue.

The proposed fix is based on the TimeoutSeconds field in the Revision. After this timeoutSeconds has gone by there should not be any pending requests in the activator and the Unreachable revision can scale down with no risk of requests not being processed.

The default value for timeoutSeconds is 300, so the pods on the failing revision will only scale down after this time. TimeoutSeconds can be changed of course.

jsanin avatar Feb 01 '24 14:02 jsanin

@DavidR91 I've been trying to reproduce this issue to verify the proposed fixed - but in my testings I'm seeing the revision pod scale down once the activator times out the request.

Do you have a consistent way to trigger this issue? Can you confirm requests are being timed out by the activator?

dprotaso avatar Feb 17 '24 02:02 dprotaso

This issue is stale because it has been open for 90 days with no activity. It will automatically close after 30 more days of inactivity. Reopen the issue with /reopen. Mark the issue as fresh by adding the comment /remove-lifecycle stale.

github-actions[bot] avatar May 21 '24 01:05 github-actions[bot]

Closing this out due to lack of user-input

dprotaso avatar May 21 '24 16:05 dprotaso