serving icon indicating copy to clipboard operation
serving copied to clipboard

Prevent a PodAutoscaler's DesiredScale to turn to -1

Open SaschaSchwarze0 opened this issue 1 year ago • 9 comments

Fixes #14669

Proposed Changes

  • This changed the Knative PodAutoscaler logic to never change the DesiredScale from a non-negative to a negative value.
  • This happens when leading AutoScaler restarted and it has no metrics. Moments later once it has metrics, it will change it back.
  • While this has no functional impact, it causes such a change for every revision which leads to four Kubernetes API calls (update PodAutoscaler in autoscaler, update Revision in controller, and then again PodAutoscaler in autoscaler, update Revision in controller) which has massive impact as QPS slows down overall progression inside these two controllers.

Release Note

The autoscaler now keeps the desiredScale of a PodAutoscaler at its current value while it initializes and therefore has not yet metrics

SaschaSchwarze0 avatar Feb 05 '24 10:02 SaschaSchwarze0

Hi @SaschaSchwarze0. Thanks for your PR.

I'm waiting for a knative member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

knative-prow[bot] avatar Feb 05 '24 10:02 knative-prow[bot]

/ok-to-test

cc @dprotaso @psschwei

skonto avatar Feb 05 '24 12:02 skonto

It seems that the failing test Name: "sks does not exist", expects to set pa.DesiredScale to -1 instead of the original value (11).

skonto avatar Feb 05 '24 13:02 skonto

It seems that the failing test Name: "sks does not exist", expects to set pa.DesiredScale to -1 instead of the original value (11).

Yep, now that I don't have this change guarded by a flag anymore, I have to look at the results of all unit tests. Will check.

SaschaSchwarze0 avatar Feb 05 '24 13:02 SaschaSchwarze0

I changed the test case. It was actually running into that code path where there are no metrics and now that means that it keeps the desiredScale rather than setting it to -1.

SaschaSchwarze0 avatar Feb 06 '24 07:02 SaschaSchwarze0

/easycla

SaschaSchwarze0 avatar Feb 07 '24 14:02 SaschaSchwarze0

Codecov Report

All modified and coverable lines are covered by tests :white_check_mark:

Project coverage is 84.05%. Comparing base (17df219) to head (87a5c84). Report is 6 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main   #14866      +/-   ##
==========================================
+ Coverage   84.02%   84.05%   +0.03%     
==========================================
  Files         213      213              
  Lines       16796    16803       +7     
==========================================
+ Hits        14112    14124      +12     
+ Misses       2329     2323       -6     
- Partials      355      356       +1     

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

codecov[bot] avatar Feb 07 '24 14:02 codecov[bot]

I've been reviewing https://github.com/knative/serving/pull/14607 & https://github.com/knative/serving/pull/14573 which address bugs in the same part of the codebase as this PR.

Those PRs fix the problem but they would introduce regressions if merged as is - so it's not solved yet. I'm creating e2e tests to catch these regressions - it would be good to run them against this PR when they land.

It sorta feels like there might a 'holistic' solution for those PRs where we want to distinguish "I have no data for this revision" and "I have no data for this revision and I haven't tried to sample any"

The latter scenario sounds seems to match what this PR is trying to fix.

I'm going to hold this PR for now until I wrap up the other reviews and the e2e tests

/hold

dprotaso avatar Feb 15 '24 18:02 dprotaso

@dprotaso can you give an update where we are standing here ?

SaschaSchwarze0 avatar Mar 19 '24 14:03 SaschaSchwarze0

/hold cancel /lgtm

dprotaso avatar Mar 27 '24 20:03 dprotaso

/approve

dprotaso avatar Mar 27 '24 20:03 dprotaso

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: dprotaso, SaschaSchwarze0

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment Approvers can cancel approval by writing /approve cancel in a comment

knative-prow[bot] avatar Mar 27 '24 20:03 knative-prow[bot]

linter failed - I just updated the PR to use the right ptr package

/lgtm

dprotaso avatar Mar 27 '24 21:03 dprotaso

kourier-tls flakes - https://github.com/knative/serving/issues/15052 /retest

dprotaso avatar Mar 27 '24 21:03 dprotaso