serving icon indicating copy to clipboard operation
serving copied to clipboard

[wip] If deployment is never available propagate the container msg

Open skonto opened this issue 1 year ago • 18 comments

Fixes https://github.com/knative/serving/issues/14157

Proposed Changes

  • This propagates the msg of the container t when deployment never reaches availability and keeps having:
        {
            "lastTransitionTime": "2024-01-23T20:05:38Z",
            "message": "Deployment does not have minimum availability.",
            "reason": "MinimumReplicasUnavailable",
            "status": "False",
            "type": "Ready"
        }

Then the ksvc when the deployment is scaled back to zero will have:


           "conditions": [
                    {
                        "lastTransitionTime": "2024-04-02T09:18:31Z",
                        "message": "Revision \"helloworld-go-00001\" failed with message: Back-off pulling image \"index.docker.io/skonto/helloworld-go@sha256:dd20d7659c16bdc58c09740a543ef3c36b7c04742a2b6b280a30c2a76dcf6c09\".",
                        "reason": "RevisionFailed",
                        "status": "False",
                        "type": "ConfigurationsReady"
                    },
                    {
                        "lastTransitionTime": "2024-04-02T09:14:44Z",
                        "message": "Revision \"helloworld-go-00001\" failed to become ready.",
                        "reason": "RevisionMissing",
                        "status": "False",
                        "type": "Ready"
                    },
                    {
                        "lastTransitionTime": "2024-04-02T09:14:44Z",
                        "message": "Revision \"helloworld-go-00001\" failed to become ready.",
                        "reason": "RevisionMissing",
                        "status": "False",
                        "type": "RoutesReady"
                    }
                ],


  • This changes the initial state from unknown to "failed" even for normal ksvcs, however once we are up this is cleared at the configuration level as well. The idea for this initial state comes from the fact that K8s for readiness probes also considers a probe failed until InitialDelaySeconds pass. Right now we have when a ksvc is deployed, at the beginning of it lifecycle:
                "conditions": [
                    {
                        "lastTransitionTime": "2024-04-02T10:34:55Z",
                        "status": "Unknown",
                        "type": "ConfigurationsReady"
                    },
                    {
                        "lastTransitionTime": "2024-04-02T10:34:55Z",
                        "message": "Configuration \"helloworld-go\" is waiting for a Revision to become ready.",
                        "reason": "RevisionMissing",
                        "status": "Unknown",
                        "type": "Ready"
                    },
                    {
                        "lastTransitionTime": "2024-04-02T10:34:55Z",
                        "message": "Configuration \"helloworld-go\" is waiting for a Revision to become ready.",
                        "reason": "RevisionMissing",
                        "status": "Unknown",
                        "type": "RoutesReady"
                    }

With this PR we start with a failing status until it is cleared:

                "conditions": [
                    {
                        "lastTransitionTime": "2024-04-02T10:31:01Z",
                        "message": "Revision \"helloworld-go-00001\" failed with message: .",
                        "reason": "RevisionFailed",
                        "status": "False",
                        "type": "ConfigurationsReady"
                    },
                    {
                        "lastTransitionTime": "2024-04-02T10:31:01Z",
                        "message": "Configuration \"helloworld-go\" does not have any ready Revision.",
                        "reason": "RevisionMissing",
                        "status": "False",
                        "type": "Ready"
                    },
                    {
                        "lastTransitionTime": "2024-04-02T10:31:01Z",
                        "message": "Configuration \"helloworld-go\" does not have any ready Revision.",
                        "reason": "RevisionMissing",
                        "status": "False",
                        "type": "RoutesReady"
                    }
  • To reproduce the initial issue with minikube you can use the following:
    • apply a ksvc
    • wait for a pod to come up and then ksvc to scale to zero
    • minikube image list and then minikube image rm so that no image is available within minikube for the user container.
    • block any internet access so image can't be pulled
    • issue a request via curl ...
    • observe the revision and deployment statuses
  • When the issue is resolved the next request will clear the status messages:
               "conditions": [
                   {
                       "lastTransitionTime": "2024-04-02T09:37:03Z",
                       "status": "True",
                       "type": "ConfigurationsReady"
                   },
                   {
                       "lastTransitionTime": "2024-04-02T09:37:03Z",
                       "status": "True",
                       "type": "Ready"
                   },
                   {
                       "lastTransitionTime": "2024-04-02T09:37:03Z",
                       "status": "True",
                       "type": "RoutesReady"
                   }

skonto avatar Jan 24 '24 14:01 skonto

Codecov Report

Attention: Patch coverage is 48.00000% with 13 lines in your changes missing coverage. Please review.

Project coverage is 84.68%. Comparing base (bb51203) to head (fababbe).

Files Patch % Lines
pkg/apis/serving/v1/revision_lifecycle.go 0.00% 11 Missing :warning:
pkg/reconciler/revision/reconcile_resources.go 81.81% 2 Missing :warning:
Additional details and impacted files
@@            Coverage Diff             @@
##             main   #14835      +/-   ##
==========================================
- Coverage   84.75%   84.68%   -0.08%     
==========================================
  Files         218      218              
  Lines       13482    13502      +20     
==========================================
+ Hits        11427    11434       +7     
- Misses       1688     1700      +12     
- Partials      367      368       +1     

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

codecov[bot] avatar Jan 24 '24 14:01 codecov[bot]

The earlier concerns with using Available is that it changes when the revision is scaling. Thus Available=False when scaling up until all the pods are ready.

We don't want to mark the revision as ready=false when scaling is occuring. I haven't dug into the code changes in the PR yet but how do we handle that scenario?

dprotaso avatar Jan 25 '24 14:01 dprotaso

We don't want to mark the revision as ready=false when scaling is occuring.

My goal is to only touch status when progressdeadline does not seems to work (https://github.com/kubernetes/kubernetes/issues/106054). That means only apply when if *deployment.Spec.Replicas > 0 && deployment.Status.AvailableReplicas == 0 holds and there is some waiting happening which means in that case nothing is ready actually. Also I only set the revision availability to false if it is not false already, otherwise I don't touch it.

skonto avatar Jan 25 '24 14:01 skonto

Yeah ProgressDeadline only seems to apply when doing a rollout from one replicaset to another one

dprotaso avatar Jan 25 '24 14:01 dprotaso

I discovered that here https://github.com/kubernetes/kubernetes/issues/106697

dprotaso avatar Jan 25 '24 15:01 dprotaso

/test istio-latest-no-mesh

skonto avatar Feb 02 '24 14:02 skonto

infra /test istio-latest-no-mesh

skonto avatar Feb 05 '24 09:02 skonto

@dprotaso @ReToCode gentle ping

skonto avatar Feb 05 '24 09:02 skonto

hey @skonto are you still working on this?

dprotaso avatar Mar 26 '24 19:03 dprotaso

hey @skonto are you still working on this?

@dprotaso yes I will take a look to update it based on your comment here: https://github.com/knative/serving/pull/14835#discussion_r1496775999.

skonto avatar Mar 28 '24 13:03 skonto

@dprotaso could you please take a look at the current fix? It is quite unfortunate that progress deadline does not cover this because the fix belongs to K8s not here imho. :shrug: Btw it is hard to follow in general how resources are updated. Reason is controllers often have more than one resource that they create/manage and then we rely on triggering reconciliation on every resource that changes, hopefully reaching the ksvc status in order to update it at some point. The graph is basically more or less like this (omitting SKS):

ksvc <-> config <->{route, revision} revision <-> {pa, certificate, deployment} I am wondering if we could simplify this further.

skonto avatar Apr 02 '24 17:04 skonto

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: skonto Once this PR has been reviewed and has the lgtm label, please ask for approval from dprotaso. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment Approvers can cancel approval by writing /approve cancel in a comment

knative-prow[bot] avatar Apr 03 '24 07:04 knative-prow[bot]

closing reopening to trigger new github actions to run

dprotaso avatar Apr 22 '24 14:04 dprotaso

/retest

dprotaso avatar Apr 23 '24 01:04 dprotaso

Looks like the failures are legit in the api package

dprotaso avatar Apr 23 '24 02:04 dprotaso

/hold for release tomorrow (if prow works ;) )

dprotaso avatar Apr 23 '24 02:04 dprotaso

@skonto: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
istio-latest-no-mesh_serving_main 3b7524c03e864afd7e056ddd7b6ec883049b77aa link true /test istio-latest-no-mesh

Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

knative-prow[bot] avatar Apr 23 '24 02:04 knative-prow[bot]

It needs more work.

skonto avatar Apr 26 '24 07:04 skonto

@skonto: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
certmanager-integration-tests_serving_main 3b7524c03e864afd7e056ddd7b6ec883049b77aa link true /test certmanager-integration-tests

Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

knative-prow[bot] avatar May 27 '24 18:05 knative-prow[bot]

@dprotaso pls review. Probably this needs a release comment.

skonto avatar Jun 04 '24 12:06 skonto

Testing this out it seems like regular cold starts cause revision/config/ksvc to all become failed (ready=false) and then once the pod is running it flips back to ready=true. Which I don't think is ideal.

I think what we really want is to signal when it's failed to scale but only after the progressdeadline. Maybe that happens as a separate status (warning?) condition unsure.

dprotaso avatar Jun 06 '24 21:06 dprotaso

Which I don't think is ideal.

It changes the behaviour by design, as stated in the description it starts with false. Anyway I will check if we can have something less intrusive.

I think what we really want is to signal when it's failed to scale but only after the progressdeadline.

Probably we could do that at the scaler side when activating and before we scale down to zero by using the pod accessor and inspecting pods directly I suspect. It should be less intrusive.

skonto avatar Jun 07 '24 20:06 skonto

closing in favor of #15326

skonto avatar Jun 13 '24 09:06 skonto