serving
serving copied to clipboard
Prevent a PodAutoscaler's DesiredScale to turn to -1
Fixes #14669
Proposed Changes
- This changed the Knative PodAutoscaler logic to never change the DesiredScale from a non-negative to a negative value.
- This happens when leading AutoScaler restarted and it has no metrics. Moments later once it has metrics, it will change it back.
- While this has no functional impact, it causes such a change for every revision which leads to four Kubernetes API calls (update PodAutoscaler in autoscaler, update Revision in controller, and then again PodAutoscaler in autoscaler, update Revision in controller) which has massive impact as QPS slows down overall progression inside these two controllers.
Release Note
The autoscaler now keeps the desiredScale of a PodAutoscaler at its current value while it initializes and therefore has not yet metrics
Hi @SaschaSchwarze0. Thanks for your PR.
I'm waiting for a knative member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.
Once the patch is verified, the new status will be reflected by the ok-to-test label.
I understand the commands that are listed here.
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.
/ok-to-test
cc @dprotaso @psschwei
It seems that the failing test Name: "sks does not exist", expects to set pa.DesiredScale to -1 instead of the original value (11).
It seems that the failing test
Name: "sks does not exist",expects to set pa.DesiredScale to -1 instead of the original value (11).
Yep, now that I don't have this change guarded by a flag anymore, I have to look at the results of all unit tests. Will check.
I changed the test case. It was actually running into that code path where there are no metrics and now that means that it keeps the desiredScale rather than setting it to -1.
/easycla
Codecov Report
All modified and coverable lines are covered by tests :white_check_mark:
Project coverage is 84.05%. Comparing base (
17df219) to head (87a5c84). Report is 6 commits behind head on main.
Additional details and impacted files
@@ Coverage Diff @@
## main #14866 +/- ##
==========================================
+ Coverage 84.02% 84.05% +0.03%
==========================================
Files 213 213
Lines 16796 16803 +7
==========================================
+ Hits 14112 14124 +12
+ Misses 2329 2323 -6
- Partials 355 356 +1
:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.
I've been reviewing https://github.com/knative/serving/pull/14607 & https://github.com/knative/serving/pull/14573 which address bugs in the same part of the codebase as this PR.
Those PRs fix the problem but they would introduce regressions if merged as is - so it's not solved yet. I'm creating e2e tests to catch these regressions - it would be good to run them against this PR when they land.
It sorta feels like there might a 'holistic' solution for those PRs where we want to distinguish "I have no data for this revision" and "I have no data for this revision and I haven't tried to sample any"
The latter scenario sounds seems to match what this PR is trying to fix.
I'm going to hold this PR for now until I wrap up the other reviews and the e2e tests
/hold
@dprotaso can you give an update where we are standing here ?
/hold cancel /lgtm
/approve
[APPROVALNOTIFIER] This PR is APPROVED
This pull-request has been approved by: dprotaso, SaschaSchwarze0
The full list of commands accepted by this bot can be found here.
The pull request process is described here
- ~~OWNERS~~ [dprotaso]
Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment
linter failed - I just updated the PR to use the right ptr package
/lgtm
kourier-tls flakes - https://github.com/knative/serving/issues/15052 /retest