Revision sets to True when deploy has the minimum number
Fixes #15889
Proposed Changes
- Revision is set to healthy, when the deployment has the minimum number of replicas running and ready.
Release Note
Revision is set to healthy, when the deployment has the minimum number of replicas running and ready.
[APPROVALNOTIFIER] This PR is NOT APPROVED
This pull-request has been approved by: houshengbo Once this PR has been reviewed and has the lgtm label, please assign dprotaso for approval. For more information see the Code Review Process.
The full list of commands accepted by this bot can be found here.
Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment
Codecov Report
:white_check_mark: All modified and coverable lines are covered by tests.
:white_check_mark: Project coverage is 80.95%. Comparing base (bbf34f6) to head (797b64c).
:warning: Report is 134 commits behind head on main.
Additional details and impacted files
@@ Coverage Diff @@
## main #15890 +/- ##
==========================================
+ Coverage 80.93% 80.95% +0.01%
==========================================
Files 210 210
Lines 16731 16731
==========================================
+ Hits 13542 13545 +3
+ Misses 2839 2836 -3
Partials 350 350
:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.
:rocket: New features to boost your workflow:
- :snowflake: Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
If min-scale is involved then I think the issue might be in a different place - more specifically the PodAutoscaler.
For example PodAutoscalerConditionScaleTargetInitialized should only be 'True' when we've reached min-scale.
https://github.com/knative/serving/blob/8a39d5e7020aa99d885a3333c33e641fcc4e31b0/pkg/apis/autoscaling/v1alpha1/pa_lifecycle.go#L33-L38
We mark the revision ready here
https://github.com/knative/serving/blob/8a39d5e7020aa99d885a3333c33e641fcc4e31b0/pkg/apis/serving/v1/revision_lifecycle.go#L196-L203
In
Another thing we should do is update the e2e test to surface the issue you found - eg. fail if Ready=True but ReplicaReadyCount < MinScale
https://github.com/knative/serving/blob/main/test/e2e/minscale_readiness_test.go
@dprotaso Do we allow this situation to happen? Revision is set to true even if there is only one pod, even if the minscale is larger than one.
If the revision is set to true, I guess traffic is assigned and shifted to the new revisiosn, and old revision starts to terminate(This is how we run into the issue). If deployment is running with pods less than the minscale, revision availability should NOT be set to true.
I will check on the podautoscaler part and the test cases.
Revision is set to true even if there is only one pod, even if the minscale is larger than one.
No - because I would expect PodAutoscalerConditionScaleTargetInitialized to be False until minReady=minScale
When that condition is False then PodAutoscaler.Ready Condition should be False as well because it's a child condition of that LivingConditionSet.
rev.Status.MarkContainerHealthyTrue()
This line was added into pkg/reconciler/revision/reconcile_resources.go, because we would like the changes of the deployment resource itself to be reflected in the revision.
Per the PodAutoscaler, the logic was correct as in kpa:
https://github.com/knative/serving/blob/main/pkg/reconciler/autoscaling/kpa/kpa.go#L274
and in hpa
https://github.com/knative/serving/blob/main/pkg/reconciler/autoscaling/hpa/hpa.go#L102
so the PodAutoscaler is running as we expected for both kpa and hpa. They both check if the minscale is reached before setting the rev status.
/retest
/retest
/retest
/retest
/test istio-latest-no-mesh_serving_main
curling github for a file is getting a HTTP 429 rate limit
/test istio-latest-no-mesh_serving_main
/test istio-latest-no-mesh_serving_main
@dprotaso I tried multiple times to make sure there is no race condition. The test results seem to be good so far.
One other observation I have - I think as part of the test you should introduce a readiness delay on the second revision.
What I'm seeing happen locally is the new revisions spin up instantaneously because of the image being pre-cached on the node. Thus there isn't an observed early termination of the first revision.
@houshengbo @dprotaso where are we on this ?
I dug into this more - I think we should fix PropagateAutoscalerStatus
https://github.com/knative/serving/blob/794c02fca63d7a6addde5eafb46a56d0876514c0/pkg/apis/serving/v1/revision_lifecycle.go#L172
When we have no pods ready we end up with a PA Status of
status:
actualScale: 0
conditions:
- lastTransitionTime: "2025-06-05T01:42:44Z"
message: Requests to the target are being buffered as resources are provisioned.
reason: Queued
status: Unknown
type: Active
- lastTransitionTime: "2025-06-05T01:42:44Z"
message: Requests to the target are being buffered as resources are provisioned.
reason: Queued
status: Unknown
type: Ready
- lastTransitionTime: "2025-06-05T01:42:44Z"
message: K8s Service is not ready
reason: NotReady
status: Unknown
type: SKSReady
- lastTransitionTime: "2025-06-05T01:42:44Z"
status: Unknown
type: ScaleTargetInitialized
desiredScale: 3
metricsServiceName: revision-failure-00002-private
observedGeneration: 1
serviceName: revision-failure-00002
When 1 replica is ready (out of 3) we have the PA status
status:
actualScale: 1
conditions:
- lastTransitionTime: "2025-06-05T01:42:44Z"
message: Requests to the target are being buffered as resources are provisioned.
reason: Queued
status: Unknown
type: Active
- lastTransitionTime: "2025-06-05T01:42:44Z"
message: Requests to the target are being buffered as resources are provisioned.
reason: Queued
status: Unknown
type: Ready
- lastTransitionTime: "2025-06-05T01:44:09Z"
status: "True"
type: SKSReady
- lastTransitionTime: "2025-06-05T01:42:44Z"
status: Unknown
type: ScaleTargetInitialized
desiredScale: 3
metricsServiceName: revision-failure-00002-private
observedGeneration: 1
serviceName: revision-failure-00002
So PropagateAutoscalerStatus doesn't handle the case where SKSReady=True and ScaleTargetInitialized=Unknown - it seems like we should set RevisionConditionResourcesAvailable=Unknown given that
This Pull Request is stale because it has been open for 90 days with
no activity. It will automatically close after 30 more days of
inactivity. Reopen with /reopen. Mark as fresh by adding the
comment /remove-lifecycle stale.
I have the alternate fix in this PR - https://github.com/knative/serving/pull/16094