serving icon indicating copy to clipboard operation
serving copied to clipboard

Fix deployment status propagation when scaling from zero

Open skonto opened this issue 1 year ago • 5 comments

Fixes #14157

Proposed Changes

  • Introduces a new PA condition (PodAutoscalerConditionScaleTargetScaled) that detects failures during scaling to zero, covering the K8s gaps where deployment status is not updated correctly. The condition is set to false just before we scale down to zero (before the deployment update happens) and if pods are crashing. We set it back to true when we scale from zero and we have enough ready pods.

  • Previously when deployment was scaled down to zero, revision ready status would be true (and stay that way), but with this patch the pod error is detected and propagated:

Ksvc status:

{
    "lastTransitionTime": "2024-10-04T13:57:35Z",
    "message": "Revision \"revision-failure-00001\" failed with message: Back-off pulling image \"index.docker.io/skonto/revisionfailure@sha256:c7dd34a5919877b89617c3a0df7382e7de0f98318f2c12bf4374bb293f104977\".",
    "reason": "RevisionFailed",
    "status": "False",
    "type": "ConfigurationsReady"
},

Revision:

k  get revision
NAME                     CONFIG NAME        GENERATION   READY   REASON             ACTUAL REPLICAS   DESIRED REPLICAS
revision-failure-00001   revision-failure   1            False   ImagePullBackOff   0                 0

PA status:
{
    "lastTransitionTime": "2024-10-04T13:57:35Z",
    "message": "Back-off pulling image \"index.docker.io/skonto/revisionfailure@sha256:c7dd34a5919877b89617c3a0df7382e7de0f98318f2c12bf4374bb293f104977\"",
    "reason": "ImagePullBackOff",
    "status": "False",
    "type": "ScaleTargetScaled"
}
],
  • Updates the pa status propagation logic in the revision reconciler.
  • Extends a bit the resource quota e2e test to show that when deployment is scaled to zero we will still report the error. That is irrelevant to this patch but we want to show that we cover certain scenarios more. Probably it would be good to add more e2e tests anyway.
  • The steps to test is simply start a skvc, let it scale to zero then remove the image from your registry, block any access (kill internet) and then issue a request.

skonto avatar Oct 04 '24 14:10 skonto

Codecov Report

:x: Patch coverage is 38.02817% with 44 lines in your changes missing coverage. Please review. :white_check_mark: Project coverage is 80.63%. Comparing base (72fdded) to head (5ae57a6). :warning: Report is 200 commits behind head on main.

Files with missing lines Patch % Lines
pkg/reconciler/autoscaling/kpa/scaler.go 26.66% 19 Missing and 3 partials :warning:
pkg/resources/pods.go 0.00% 9 Missing :warning:
pkg/apis/autoscaling/v1alpha1/pa_lifecycle.go 33.33% 4 Missing :warning:
pkg/apis/serving/v1/revision_lifecycle.go 42.85% 3 Missing and 1 partial :warning:
pkg/reconciler/autoscaling/kpa/kpa.go 57.14% 2 Missing and 1 partial :warning:
pkg/testing/functional.go 75.00% 2 Missing :warning:
Additional details and impacted files
@@            Coverage Diff             @@
##             main   #15550      +/-   ##
==========================================
- Coverage   80.77%   80.63%   -0.14%     
==========================================
  Files         222      222              
  Lines       18035    18102      +67     
==========================================
+ Hits        14567    14597      +30     
- Misses       3094     3128      +34     
- Partials      374      377       +3     

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

:rocket: New features to boost your workflow:
  • :snowflake: Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

codecov[bot] avatar Oct 04 '24 14:10 codecov[bot]

This Pull Request is stale because it has been open for 90 days with no activity. It will automatically close after 30 more days of inactivity. Reopen with /reopen. Mark as fresh by adding the comment /remove-lifecycle stale.

github-actions[bot] avatar Jan 06 '25 01:01 github-actions[bot]

/remove-lifecycle stale

skonto avatar Jan 17 '25 14:01 skonto

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: skonto Once this PR has been reviewed and has the lgtm label, please ask for approval from dprotaso. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment Approvers can cancel approval by writing /approve cancel in a comment

knative-prow[bot] avatar Jan 20 '25 13:01 knative-prow[bot]

PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

knative-prow-robot avatar Jan 22 '25 06:01 knative-prow-robot

This Pull Request is stale because it has been open for 90 days with no activity. It will automatically close after 30 more days of inactivity. Reopen with /reopen. Mark as fresh by adding the comment /remove-lifecycle stale.

github-actions[bot] avatar Jul 18 '25 01:07 github-actions[bot]