minScale treatment races reachability check
In what area(s)?
/area autoscale
What version of Knative?
All recent at least.
Expected Behavior
autoscaling.knative.dev/minScale holds the respective scale of a revision after it becomes ready.
Actual Behavior
We sometimes scale below the mark only to then scale up again.
Example failing test: https://prow.knative.dev/view/gs/knative-prow/pr-logs/pull/knative_serving/11217/pull-knative-serving-istio-stable-no-mesh/1383957220156248064
Steps to Reproduce the Problem
This is a race with the labeler component, which labels all revisions targeted by a Route to be "Reachable". If a revision is unreachable, we ignore the minScale annotation, since nothing can hit the respective revision anyway. This allows more efficient resource usage for minScaled revisions.
Possible solutions
1. Make the Unreachable state more consistent with routing state
In the Revision reconciler, we translate a certain RoutingState (as given by the labeler) to a given Reachability state on the KPA spec (which then informs the minScale decision). We already have a pending state for Revisions that have not been reconciled by the labeler yet. That means, if we'd base strictly off of that RoutingState, this race should be fixed.
However: As it stands, it's possible that a revision is never touched by the labeler, for example if Configuration and Route are created independently. Such a Revision stays in pending forever and as such, would never scale down. That may or may not be fine.
2. Allow for the race
We can allow for the race and adjust the respective test to actually wait for route state to kick in before starting to assert the number of pods.
Personally I'd love to make 1 fly, to have a reliable and consistent RoutingState that can be used as a useful signal for things. We just need to be aware to not change semantics in an unintended way, as Garbage Collection is relying on this mechanism too. Arguably, the semantics we're after are actually quite close to those of Garbage Collection so that might be a good thing!
However: As it stands, it's possible that a revision is never touched by the labeler, for example if Configuration and Route are created independently. Such a Revision stays in pending forever and as such, would never scale down. That may or may not be fine.
Is this accurate? The labeler walks the route's traffic block so I'd assume it wouldn't matter how the Route/Configuration is related.
I guess I meant to point out, that it's possible for a revision to never be touched by the labeler. For example if you create a Route + Config and specify specific revisions in the Route. The same goes for a Service I guess.
For example if you create a Route + Config and specify specific revisions in the Route.
Am I missing something I feel like that's not the case cause the list of revisions/configs to label is built from the Route's spec & status
https://github.com/knative/serving/blob/ae54012f0e01c5352b3168c3d12ab6da5dfe70e3/pkg/reconciler/labeler/metasync.go#L39-L41
Right, so this list might miss Revisions, which then are in RouteState = Pending (which we would take as a signal to not scale down for minScale purposes) and they'd stay there forever.
If I'm misinterpreting that, fixing this issue is trivial :thinking:
Just reproduced the behavior of the labeler I described: Running the MinScale test and adding a 30s sleep to the labeler caused it to never see the first revision and thus never change it's routing state.
Is this is fixing process?
@jwcesign it's not, nope!
/triage accepted