serving
serving copied to clipboard
[wip] If deployment is never available propagate the container msg
Fixes https://github.com/knative/serving/issues/14157
Proposed Changes
- This propagates the msg of the container t when deployment never reaches availability and keeps having:
{
"lastTransitionTime": "2024-01-23T20:05:38Z",
"message": "Deployment does not have minimum availability.",
"reason": "MinimumReplicasUnavailable",
"status": "False",
"type": "Ready"
}
Then the ksvc when the deployment is scaled back to zero will have:
"conditions": [
{
"lastTransitionTime": "2024-04-02T09:18:31Z",
"message": "Revision \"helloworld-go-00001\" failed with message: Back-off pulling image \"index.docker.io/skonto/helloworld-go@sha256:dd20d7659c16bdc58c09740a543ef3c36b7c04742a2b6b280a30c2a76dcf6c09\".",
"reason": "RevisionFailed",
"status": "False",
"type": "ConfigurationsReady"
},
{
"lastTransitionTime": "2024-04-02T09:14:44Z",
"message": "Revision \"helloworld-go-00001\" failed to become ready.",
"reason": "RevisionMissing",
"status": "False",
"type": "Ready"
},
{
"lastTransitionTime": "2024-04-02T09:14:44Z",
"message": "Revision \"helloworld-go-00001\" failed to become ready.",
"reason": "RevisionMissing",
"status": "False",
"type": "RoutesReady"
}
],
- This changes the initial state from unknown to "failed" even for normal ksvcs, however once we are up this is cleared at the configuration level as well. The idea for this initial state comes from the fact that K8s for readiness probes also considers a probe failed until
InitialDelaySecondspass. Right now we have when a ksvc is deployed, at the beginning of it lifecycle:
"conditions": [
{
"lastTransitionTime": "2024-04-02T10:34:55Z",
"status": "Unknown",
"type": "ConfigurationsReady"
},
{
"lastTransitionTime": "2024-04-02T10:34:55Z",
"message": "Configuration \"helloworld-go\" is waiting for a Revision to become ready.",
"reason": "RevisionMissing",
"status": "Unknown",
"type": "Ready"
},
{
"lastTransitionTime": "2024-04-02T10:34:55Z",
"message": "Configuration \"helloworld-go\" is waiting for a Revision to become ready.",
"reason": "RevisionMissing",
"status": "Unknown",
"type": "RoutesReady"
}
With this PR we start with a failing status until it is cleared:
"conditions": [
{
"lastTransitionTime": "2024-04-02T10:31:01Z",
"message": "Revision \"helloworld-go-00001\" failed with message: .",
"reason": "RevisionFailed",
"status": "False",
"type": "ConfigurationsReady"
},
{
"lastTransitionTime": "2024-04-02T10:31:01Z",
"message": "Configuration \"helloworld-go\" does not have any ready Revision.",
"reason": "RevisionMissing",
"status": "False",
"type": "Ready"
},
{
"lastTransitionTime": "2024-04-02T10:31:01Z",
"message": "Configuration \"helloworld-go\" does not have any ready Revision.",
"reason": "RevisionMissing",
"status": "False",
"type": "RoutesReady"
}
- To reproduce the initial issue with minikube you can use the following:
- apply a ksvc
- wait for a pod to come up and then ksvc to scale to zero
- minikube image list and then minikube image rm so that no image is available within minikube for the user container.
- block any internet access so image can't be pulled
- issue a request via curl ...
- observe the revision and deployment statuses
- When the issue is resolved the next request will clear the status messages:
"conditions": [
{
"lastTransitionTime": "2024-04-02T09:37:03Z",
"status": "True",
"type": "ConfigurationsReady"
},
{
"lastTransitionTime": "2024-04-02T09:37:03Z",
"status": "True",
"type": "Ready"
},
{
"lastTransitionTime": "2024-04-02T09:37:03Z",
"status": "True",
"type": "RoutesReady"
}
Codecov Report
Attention: Patch coverage is 48.00000% with 13 lines in your changes missing coverage. Please review.
Project coverage is 84.68%. Comparing base (
bb51203) to head (fababbe).
| Files | Patch % | Lines |
|---|---|---|
| pkg/apis/serving/v1/revision_lifecycle.go | 0.00% | 11 Missing :warning: |
| pkg/reconciler/revision/reconcile_resources.go | 81.81% | 2 Missing :warning: |
Additional details and impacted files
@@ Coverage Diff @@
## main #14835 +/- ##
==========================================
- Coverage 84.75% 84.68% -0.08%
==========================================
Files 218 218
Lines 13482 13502 +20
==========================================
+ Hits 11427 11434 +7
- Misses 1688 1700 +12
- Partials 367 368 +1
:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.
The earlier concerns with using Available is that it changes when the revision is scaling. Thus Available=False when scaling up until all the pods are ready.
We don't want to mark the revision as ready=false when scaling is occuring. I haven't dug into the code changes in the PR yet but how do we handle that scenario?
We don't want to mark the revision as ready=false when scaling is occuring.
My goal is to only touch status when progressdeadline does not seems to work (https://github.com/kubernetes/kubernetes/issues/106054). That means only apply when if *deployment.Spec.Replicas > 0 && deployment.Status.AvailableReplicas == 0 holds and there is some waiting happening which means in that case nothing is ready actually. Also I only set the revision availability to false if it is not false already, otherwise I don't touch it.
Yeah ProgressDeadline only seems to apply when doing a rollout from one replicaset to another one
I discovered that here https://github.com/kubernetes/kubernetes/issues/106697
/test istio-latest-no-mesh
infra /test istio-latest-no-mesh
@dprotaso @ReToCode gentle ping
hey @skonto are you still working on this?
hey @skonto are you still working on this?
@dprotaso yes I will take a look to update it based on your comment here: https://github.com/knative/serving/pull/14835#discussion_r1496775999.
@dprotaso could you please take a look at the current fix? It is quite unfortunate that progress deadline does not cover this because the fix belongs to K8s not here imho. :shrug: Btw it is hard to follow in general how resources are updated. Reason is controllers often have more than one resource that they create/manage and then we rely on triggering reconciliation on every resource that changes, hopefully reaching the ksvc status in order to update it at some point. The graph is basically more or less like this (omitting SKS):
ksvc <-> config <->{route, revision} revision <-> {pa, certificate, deployment} I am wondering if we could simplify this further.
[APPROVALNOTIFIER] This PR is NOT APPROVED
This pull-request has been approved by: skonto Once this PR has been reviewed and has the lgtm label, please ask for approval from dprotaso. For more information see the Kubernetes Code Review Process.
The full list of commands accepted by this bot can be found here.
- ~~OWNERS~~ [skonto]
- pkg/apis/OWNERS
Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment
closing reopening to trigger new github actions to run
/retest
Looks like the failures are legit in the api package
/hold for release tomorrow (if prow works ;) )
@skonto: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:
| Test name | Commit | Details | Required | Rerun command |
|---|---|---|---|---|
| istio-latest-no-mesh_serving_main | 3b7524c03e864afd7e056ddd7b6ec883049b77aa | link | true | /test istio-latest-no-mesh |
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.
It needs more work.
@skonto: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:
| Test name | Commit | Details | Required | Rerun command |
|---|---|---|---|---|
| certmanager-integration-tests_serving_main | 3b7524c03e864afd7e056ddd7b6ec883049b77aa | link | true | /test certmanager-integration-tests |
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.
@dprotaso pls review. Probably this needs a release comment.
Testing this out it seems like regular cold starts cause revision/config/ksvc to all become failed (ready=false) and then once the pod is running it flips back to ready=true. Which I don't think is ideal.
I think what we really want is to signal when it's failed to scale but only after the progressdeadline. Maybe that happens as a separate status (warning?) condition unsure.
Which I don't think is ideal.
It changes the behaviour by design, as stated in the description it starts with false. Anyway I will check if we can have something less intrusive.
I think what we really want is to signal when it's failed to scale but only after the progressdeadline.
Probably we could do that at the scaler side when activating and before we scale down to zero by using the pod accessor and inspecting pods directly I suspect. It should be less intrusive.
closing in favor of #15326