serving Ensure ContainerHealthy condition is set back to True

Fixes #15487

Proposed Changes

This changes the Revision reconciler to contain a code path that changes the ContainerHealthy condition from False to True as the old code path is not active anymore (see linked issue). The criteria that has been chosen is whether the deployment has replicas and whether all of them are ready.

Release Note

A revision is now set to ContainerHealthy=True when all replicas of a deployment are ready

Sep 09 '24 13:09 SaschaSchwarze0

Hi @SaschaSchwarze0. Thanks for your PR.

I'm waiting for a knative member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Sep 09 '24 13:09 knative-prow[bot]

/ok-to-test

Sep 09 '24 13:09 skonto

cc @dprotaso for review.

Sep 09 '24 13:09 skonto

Codecov Report

Attention: Patch coverage is 80.64516% with 6 lines in your changes missing coverage. Please review.

Project coverage is 80.82%. Comparing base (df7f168) to head (8460f72). Report is 15 commits behind head on main.

Files with missing lines	Patch %	Lines
pkg/testing/functional.go	62.50%	5 Missing and 1 partial :warning:

Additional details and impacted files

@@            Coverage Diff             @@
##             main   #15503      +/-   ##
==========================================
- Coverage   80.84%   80.82%   -0.03%     
==========================================
  Files         222      222              
  Lines       18035    18063      +28     
==========================================
+ Hits        14581    14600      +19     
- Misses       3085     3092       +7     
- Partials      369      371       +2

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

Sep 09 '24 14:09 codecov[bot]

@dprotaso gentle ping, any objections on this one? It seems ok to me.

Oct 03 '24 11:10 skonto

I updated the PR based on the discussion in https://cloud-native.slack.com/archives/C04LMU0AX60/p1733333864902479.

Dec 05 '24 11:12 SaschaSchwarze0

@SaschaSchwarze0 Are you able to address the comments and wrap up the PR? We are looking forward with this fix.

Dec 30 '24 20:12 yuzisun

@SaschaSchwarze0 -- would you like some help getting this over the finish line? I can PR or commit to your branch if you won't have time to complete it; this is apparently also causing pain for @houshengbo and @yuzisun.

Jan 03 '25 22:01 evankanderson

I really hope this PR can be approved, since it has been causing issues to our env on weekly basis, during the past months. It would be good that it can be patched to 1.15 and 1.16 branches.

Jan 06 '25 16:01 houshengbo

/lgtm

Jan 06 '25 18:01 houshengbo

@dprotaso gentle ping, @houshengbo the goal is to get this in for 1.17 that is coming soon but we can backport as well.

Jan 10 '25 11:01 skonto

I am now OK with the PR BTW. In terms of how we can decide the revision or service is in good health, it is ok to check all the pods. We used to run into situations that some pods are up, but some are not due to the node issues. It is not correct mark the service or revision as healthy, in this situation.

Jan 10 '25 16:01 houshengbo

Hey FYI - I'm getting to this PR just trying to unblock dependency updates for the release first.

Jan 10 '25 21:01 dprotaso

@evankanderson @skonto @dprotaso I was reading most of the things that were commented, but did not have the time to make a conclusion out of them so far. Now, reading through all of them, I still think there needs to be a decision on the direction we want to go. I think there are two options (and the current PR is a mixture).

(1) Mark ContainerHealthy true when all pods are healthy. This means one would need to loop through all pods like the PR currently does. And one should set ContainerHealthy to true when spec.replicas > 0 && spec.replicas == .status.readyReplicas (I think I had this at some point, but we changed that). The disadvantage of this is that one needs to retrieve all pods because there is no caching. Retrieving all pods has always been done so it would not be a change.

(2) Mark ContainerHealthy true when one pod is healthy. This means that the condition to set this can stay at .status.readyReplicas > 0 the way it currently is. We can then recert the logic changes to check all pods but instead only look at 1. We can optimize the List call to run with .status.phase==Running (because I do not think we care about Pending pods) and with Limit=1 so that the API only returns one even if there are many.

My vote would be (2). ContainerHealthy would then most likely only be set to False for those "permanent" errors and not for things like an OOM that happens once per day assuming the KSvc's scale is >1.

Jan 13 '25 11:01 SaschaSchwarze0

Tentatively I would vote for option (2) given current semantics. Here are my thoughts:

As a side note, independently of this PR, one thing that we don't consider is the number of min replicas ready for characterizing a revision as healthy or not. For example in a scenario where minScale=2>1 and one pod keeps exiting what I observed is that the revision remains healthy and accepts traffic. However, initially, when we create the revision we will never become ready if we don't achieve the desired minScale. Is that ok? I don't think we have a strictly defined health condition for the revision. For example should achieving minScale=N be a strict requirement at all times or should be the value defined by deployment.Spec.Replicas (as set by KPA dynamically)? If the former was the case then we would need to wait for deployment.Status.ReadyReplicas = minScale in this PR to reset ContainerHealthy.

In any case the scope of the PR is the bug of never setting back the condition to True and I think we should only consider the fix for the bug here given current Knative semantics.

We have:

*deployment.Spec.Replicas > 0 && deployment.Status.AvailableReplicas == 0

ContainerHealthy was set to false due to some permanent error we detected where all pods are down, thus revision becomes not ready. We reset the condition once we know that the permanent error has been fixed. To be precise we should only care about that class of errors here, we don't detect errors that brought some pods down (that was the goal I guess in the past, only permanent errors).

When we decide to reset the ContainerHealthy, if the error is not fixed on some nodes, I feel that although inconvenient to the admin it should not make Knative unavailable. There is activator that queues requests until capacity catches up. Same with the semantics we have with minScale when some pod fails. So I would vote for (2) for the scope of this bug.

In the future though we might want to allow flexibility for what is considered healthy for better UX and operations. Similarly to what it is done in https://github.com/knative/serving/pull/15397.

For example one definition of a healthy service covering different personas is:

Service is "Healthy" if:

Healthy Pods >= Minimum Scale
A majority of the pods or all pods are healthy given current desired scale
Error rate or latency thresholds are not violated.

The first two points might be useful for the admin offering Knative as a platfrom. Usually, users of the service only care about these SLOs in the last point. Knative could produce the right signal in the revision status to serve the personas above, especially the admin. External tooling could also detect status and report this info but can't handle/manage the traffic. Note here that when we recover from any bad state e.g. deployment.Status.AvailableReplicas == 0 , it does not mean we can't serve the users even without violating their SLOs as we catch up. It depends on the traffic. It can be the case though that until we fully recover we might not be able to serve them and thus from their pov the service is still not healthy.

Jan 13 '25 14:01 skonto

@evankanderson @dprotaso any additional opinions on https://github.com/knative/serving/pull/15503#issuecomment-2586848454 or should I go ahead and change the implementation to option (2) ?

Jan 15 '25 10:01 SaschaSchwarze0

@SaschaSchwarze0 let's go with option 2.

(because I do not think we care about Pending pods)

I think we do - since that's how we would know about image pull failures

Jan 15 '25 19:01 dprotaso

@SaschaSchwarze0 let's go with option 2.

(because I do not think we care about Pending pods)

I think we do - since that's how we would know about image pull failures

Done.

Jan 16 '25 08:01 SaschaSchwarze0

/lgtm /approve /cherry-pick release-1.16

Jan 20 '25 22:01 dprotaso

@dprotaso: once the present PR merges, I will cherry-pick it on top of release-1.16 in a new PR and assign it to you.

In response to this:

/lgtm /approve /cherry-pick release-1.16

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Jan 20 '25 22:01 knative-prow-robot

/cherry-pick release-1.15

Jan 20 '25 22:01 dprotaso

@dprotaso: once the present PR merges, I will cherry-pick it on top of release-1.15 in a new PR and assign it to you.

In response to this:

/cherry-pick release-1.15

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Jan 20 '25 22:01 knative-prow-robot

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: dprotaso, SaschaSchwarze0

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [dprotaso]

Approvers can indicate their approval by writing /approve in a comment Approvers can cancel approval by writing /approve cancel in a comment

Jan 20 '25 22:01 knative-prow[bot]

@dprotaso: new pull request created: #15712

In response to this:

/lgtm /approve /cherry-pick release-1.16

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Jan 20 '25 22:01 knative-prow-robot

@dprotaso: new pull request created: #15713

In response to this:

/cherry-pick release-1.15

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Jan 20 '25 22:01 knative-prow-robot

Thanks @SaschaSchwarze0 and @dprotaso !!

Jan 21 '25 03:01 yuzisun