[wip] [feedback wanted] Do not scrape pods when activator in path
Fixes #7324
Proposed Changes
- Pause scraping pods when activator in data path (excess burst capacity < 0)
- Resume when excess burst capacity >= 0
Feedback needed on the following items (more emphasis on the first):
-
In the current implementation, due to https://github.com/knative/serving/pull/13027, in the circumstance that excess burst capacity < 0 AND there are no activator endpoints, then there might be metrics missed since SKS forces "serve" mode. Is there a way to float the status ("proxy" or "serve") to the autoscaler? Or, another way to account for this?
-
Writing unit tests for this situation
Release Note
NONE
Hi @Alexander-Kita. Thanks for your PR.
I'm waiting for a github.com member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.
Once the patch is verified, the new status will be reflected by the ok-to-test label.
I understand the commands that are listed here.
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.
Codecov Report
:x: Patch coverage is 57.35294% with 29 lines in your changes missing coverage. Please review.
:white_check_mark: Project coverage is 79.92%. Comparing base (6d9d1b6) to head (dd3f55d).
| Files with missing lines | Patch % | Lines |
|---|---|---|
| pkg/autoscaler/metrics/collector.go | 36.66% | 18 Missing and 1 partial :warning: |
| pkg/reconciler/metric/metric.go | 28.57% | 9 Missing and 1 partial :warning: |
Additional details and impacted files
@@ Coverage Diff @@
## main #16254 +/- ##
==========================================
- Coverage 80.09% 79.92% -0.17%
==========================================
Files 215 215
Lines 13361 13427 +66
==========================================
+ Hits 10701 10732 +31
- Misses 2300 2334 +34
- Partials 360 361 +1
:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.
:rocket: New features to boost your workflow:
- :snowflake: Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
/ok-to-test
The e2e failures seem legit
Generally after a quick look I like the abstractions used. Though I'm guessing there's something more nuanced that's causing this change to break the e2e tests
wI believe I found what is causing these e2e failures. The activator (concurrency_reporter) appears to stop collecting metrics when it sees zero concurrency in the service:
from pkg/activator/handler/concurrency_reporter.go
// This is only 0 if we have seen no activity for the entire reporting
// period at all.
if report.AverageConcurrency == 0 {
toDelete = append(toDelete, key)
}
This appears to trigger too early and stop sending metrics (since no concurrency is seen), which is preventing the service from ever scaling to zero since pods are no longer scraped. I added a buffer to test this out (it has to see zero 3 times before stopping) and it passed the e2e test when I ran it. This behavior was probably hidden since we were still scraping metrics while the activator was in the path. Maybe the activator should track revision stats unless it is gone or scaled to zero? How do you recommend I approach a solution to this, if one is still wanted? @dprotaso
Even better, I could float the SKS mode to the metric reconciler so it can be paused from there instead.
Edit: took second approach as of latest commit
[APPROVALNOTIFIER] This PR is NOT APPROVED
This pull-request has been approved by: Alexander-Kita Once this PR has been reviewed and has the lgtm label, please assign dprotaso for approval. For more information see the Code Review Process.
The full list of commands accepted by this bot can be found here.
Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment