serving Prevent a PodAutoscaler's DesiredScale to turn to -1

Fixes #14669

Proposed Changes

This changed the Knative PodAutoscaler logic to never change the DesiredScale from a non-negative to a negative value.
This happens when leading AutoScaler restarted and it has no metrics. Moments later once it has metrics, it will change it back.
While this has no functional impact, it causes such a change for every revision which leads to four Kubernetes API calls (update PodAutoscaler in autoscaler, update Revision in controller, and then again PodAutoscaler in autoscaler, update Revision in controller) which has massive impact as QPS slows down overall progression inside these two controllers.

Release Note

The autoscaler now keeps the desiredScale of a PodAutoscaler at its current value while it initializes and therefore has not yet metrics

Feb 05 '24 10:02 SaschaSchwarze0

Hi @SaschaSchwarze0. Thanks for your PR.

I'm waiting for a knative member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Feb 05 '24 10:02 knative-prow[bot]

/ok-to-test

cc @dprotaso @psschwei

Feb 05 '24 12:02 skonto

It seems that the failing test Name: "sks does not exist", expects to set pa.DesiredScale to -1 instead of the original value (11).

Feb 05 '24 13:02 skonto

It seems that the failing test Name: "sks does not exist", expects to set pa.DesiredScale to -1 instead of the original value (11).

Yep, now that I don't have this change guarded by a flag anymore, I have to look at the results of all unit tests. Will check.

Feb 05 '24 13:02 SaschaSchwarze0

I changed the test case. It was actually running into that code path where there are no metrics and now that means that it keeps the desiredScale rather than setting it to -1.

Feb 06 '24 07:02 SaschaSchwarze0

/easycla

Feb 07 '24 14:02 SaschaSchwarze0

Codecov Report

All modified and coverable lines are covered by tests :white_check_mark:

Project coverage is 84.05%. Comparing base (17df219) to head (87a5c84). Report is 6 commits behind head on main.

Additional details and impacted files

@@            Coverage Diff             @@
##             main   #14866      +/-   ##
==========================================
+ Coverage   84.02%   84.05%   +0.03%     
==========================================
  Files         213      213              
  Lines       16796    16803       +7     
==========================================
+ Hits        14112    14124      +12     
+ Misses       2329     2323       -6     
- Partials      355      356       +1

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

Feb 07 '24 14:02 codecov[bot]

I've been reviewing https://github.com/knative/serving/pull/14607 & https://github.com/knative/serving/pull/14573 which address bugs in the same part of the codebase as this PR.

Those PRs fix the problem but they would introduce regressions if merged as is - so it's not solved yet. I'm creating e2e tests to catch these regressions - it would be good to run them against this PR when they land.

It sorta feels like there might a 'holistic' solution for those PRs where we want to distinguish "I have no data for this revision" and "I have no data for this revision and I haven't tried to sample any"

The latter scenario sounds seems to match what this PR is trying to fix.

I'm going to hold this PR for now until I wrap up the other reviews and the e2e tests

/hold

Feb 15 '24 18:02 dprotaso

@dprotaso can you give an update where we are standing here ?

Mar 19 '24 14:03 SaschaSchwarze0

/hold cancel /lgtm

Mar 27 '24 20:03 dprotaso

/approve

Mar 27 '24 20:03 dprotaso

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: dprotaso, SaschaSchwarze0

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [dprotaso]

Approvers can indicate their approval by writing /approve in a comment Approvers can cancel approval by writing /approve cancel in a comment

Mar 27 '24 20:03 knative-prow[bot]

linter failed - I just updated the PR to use the right ptr package

/lgtm

Mar 27 '24 21:03 dprotaso

kourier-tls flakes - https://github.com/knative/serving/issues/15052 /retest

Mar 27 '24 21:03 dprotaso