serving icon indicating copy to clipboard operation
serving copied to clipboard

Fix crash-looping pods take a long time to terminate/clean

Open andrew-delph opened this issue 1 year ago • 10 comments

Pods that instantly crash do no scale to 0 until progress deadline is called for the deployment.

Proposed Changes

If pa is not ready and unreachable, It is marked inactive instead of queued. This will scale the deployment to 0 even if metrics cannot be retrieved.

andrew-delph avatar Nov 09 '23 18:11 andrew-delph

Hi @andrew-delph. Thanks for your PR.

I'm waiting for a knative member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

knative-prow[bot] avatar Nov 09 '23 18:11 knative-prow[bot]

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: andrew-delph Once this PR has been reviewed and has the lgtm label, please assign dprotaso for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment Approvers can cancel approval by writing /approve cancel in a comment

knative-prow[bot] avatar Nov 09 '23 18:11 knative-prow[bot]

/test istio-latest-no-mesh

andrew-delph avatar Nov 14 '23 20:11 andrew-delph

8:41:40PM: ^ Pending: ErrImagePull (message: rpc error: code = Unknown desc = failed to pull and unpack image "quay.io/jetstack/cert-manager-webhook:v1.13.0": failed to copy: httpReadSeeker: failed open: unexpected status code https://quay.io/v2/jetstack/cert-manager-webhook/manifests/sha256:9f9bda751112262bbe0c0d55e8d06f0fc558870535e063f0c065d632199467f2: 504 Gateway Time-out)

This is definitely not related to the changes.

/test istio-latest-no-mesh

ReToCode avatar Nov 15 '23 15:11 ReToCode

Codecov Report

Attention: Patch coverage is 87.50000% with 4 lines in your changes are missing coverage. Please review.

Project coverage is 86.06%. Comparing base (72f91e5) to head (1f2944d). Report is 25 commits behind head on main.

:exclamation: Current head 1f2944d differs from pull request most recent head fa32dee. Consider uploading reports for the commit fa32dee to get more accurate results

Files Patch % Lines
pkg/apis/autoscaling/v1alpha1/pa_lifecycle.go 0.00% 4 Missing :warning:
Additional details and impacted files
@@            Coverage Diff             @@
##             main   #14607      +/-   ##
==========================================
+ Coverage   84.20%   86.06%   +1.85%     
==========================================
  Files         213      197      -16     
  Lines       16633    14936    -1697     
==========================================
- Hits        14006    12854    -1152     
+ Misses       2280     1774     -506     
+ Partials      347      308      -39     

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

codecov[bot] avatar Nov 15 '23 15:11 codecov[bot]

/retest

andrew-delph avatar Nov 18 '23 17:11 andrew-delph

/retest

andrew-delph avatar Nov 20 '23 20:11 andrew-delph

/ok-to-test /retest

dprotaso avatar Nov 22 '23 16:11 dprotaso

/hold

I want to fix https://github.com/knative/serving/issues/14660 prior

dprotaso avatar Jan 12 '24 19:01 dprotaso

I think adding a e2e test for this would be good. I plan on doing it soon

andrew-delph avatar Jan 19 '24 16:01 andrew-delph

@andrew-delph: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
contour-latest_serving_main be3ec907cad9fba8eef70cb67241c93698a3b45e link true /test contour-latest
gateway-api-latest_serving_main be3ec907cad9fba8eef70cb67241c93698a3b45e link true /test gateway-api-latest
kourier-stable_serving_main be3ec907cad9fba8eef70cb67241c93698a3b45e link true /test kourier-stable
kourier-stable-tls_serving_main be3ec907cad9fba8eef70cb67241c93698a3b45e link true /test kourier-stable-tls
contour-tls_serving_main be3ec907cad9fba8eef70cb67241c93698a3b45e link true /test contour-tls
https_serving_main be3ec907cad9fba8eef70cb67241c93698a3b45e link false /test https
gateway-api-latest-and-contour_serving_main be3ec907cad9fba8eef70cb67241c93698a3b45e link false /test gateway-api-latest-and-contour

Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

knative-prow[bot] avatar Mar 15 '24 13:03 knative-prow[bot]

I'm currently getting an issue when the progress deadline is reached but looking it.

andrew-delph avatar Mar 20 '24 15:03 andrew-delph