pkg
pkg copied to clipboard
Expose queue proxy request metrics reporting period
Changes
- Exposes the reporting period used by Prometheus and Opencensus backend destinations.
- Required for fixing https://github.com/knative/serving/issues/12069
/kind api-change
[APPROVALNOTIFIER] This PR is NOT APPROVED
This pull-request has been approved by: skonto
To complete the pull request process, please assign yanweiguo after the PR has been reviewed.
You can assign the PR to them by writing /assign @yanweiguo in a comment when ready.
The full list of commands accepted by this bot can be found here.
Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment
Codecov Report
Patch coverage: 57.69% and project coverage change: -0.07 :warning:
Comparison is base (
511b394) 81.27% compared to head (17762d5) 81.21%.
Additional details and impacted files
@@ Coverage Diff @@
## main #2362 +/- ##
==========================================
- Coverage 81.27% 81.21% -0.07%
==========================================
Files 163 163
Lines 9908 9932 +24
==========================================
+ Hits 8053 8066 +13
- Misses 1612 1621 +9
- Partials 243 245 +2
| Impacted Files | Coverage Δ | |
|---|---|---|
| metrics/config_observability.go | 69.51% <54.16%> (-6.35%) |
:arrow_down: |
| metrics/config.go | 84.03% <100.00%> (ø) |
Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.
:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Do you have feedback about the report comment? Let us know in this issue.
This Pull Request is stale because it has been open for 90 days with
no activity. It will automatically close after 30 more days of
inactivity. Reopen with /reopen. Mark as fresh by adding the
comment /remove-lifecycle stale.
/remove-lifecycle stale
/retest
Gentle ping @julz @dprotaso @nak3
It is probably better to create a PoC against serving with this change? At least it will be easier to review.
@nak3 ok I will do one to show how this is going to be used, the idea is to have that attribute configurable so we can fix that bug.
Regarding codecov:
codegen/cmd/injection-gen/generators 1.51% injection 32% tracing 57.99% test 52.23%
Could we skip test and codegen, not sure what the settings are (dont have access) and open two issues for improving tracing and injection? Is anyone looking at the low coverage already? /cc @pierDipi @nak3 @dprotaso @evankanderson
@nak3 @pierDipi here is the poc, setting the new attribute influences the default reporting at the queue proxy exporter. I verified it again playing with the OpenTelemtry collector.
Note: The default exporter reporting-period values are defined per backend here. Prometheus: 5min, opencensus 1min. If the user sets metrics.reporting-period-seconds the new attribute we introduce in this PR will override that value at the queue proxy side. However, metrics.reporting-period-seconds will be set for other components like activator, autoscaler etc if the user adds that in the configmap.
For some reason btw we are not exposing in the config-observability the following properties, I dont see them in the fixed url of the configmap:
// The following keys are used to configure metrics reporting.
// See https://github.com/knative/serving/blob/main/config/config-observability.yaml
// for details.
collectorAddressKey = "metrics.opencensus-address"
collectorSecureKey = "metrics.opencensus-require-tls"
reportingPeriodKey = "metrics.reporting-period-seconds"
Thank you.
Note: The default exporter reporting-period values are defined per backend here. Prometheus: 5min, opencensus 1min. If the user sets metrics.reporting-period-seconds the new attribute we introduce in this PR will override that value at the queue proxy side. However, metrics.reporting-period-seconds will be set for other components like activator, autoscaler etc if the user adds that in the configmap.
Umm... I think it is a little bit complicated to understand. I understood as the following table:
| component | default | when metrics.reporting-period-seconds in cm is 5s |
|---|---|---|
| queue-proxy | 5min or 1min | 5s |
| activator | ? | ? |
But then, what is the value for defaultRequestReportingPeriodSeconds? activator (and other)'s default?
@nak3 what I meant is that the new attribute (metrics.request-metrics-reporting-period-seconds) does not affect the other components only QP:
| component | default | when metrics.reporting-period-seconds in cm is 5s | when metrics.request-metrics-reporting-period-seconds in cm is 20s | when metrics.reporting-period-seconds = 5s and metrics.request-metrics-reporting-period-seconds in cm is 20s |
|---|---|---|---|---|
| queue-proxy | 10s | 10s | 20s | 20s |
| activator | 5min or 1min | 5s | 5min or 1min | 5s |
metrics.request-metrics-reporting-period-seconds <- affects only QP metrics exporter and overrides anything else, this PR does that.
metrics.reporting-period-seconds <- currently an internal config key used for setting up any metrics exporter. In this PR it sets anything other than QP.
Note in this PR 10s is the default value for metrics.request-metrics-reporting-period-seconds and is set even that attribute is not in the cm. Given the bug of losing metrics (linked in the description) I thought 10s is more appropriate to avoid the current UX. In any case users can configure it if they see it adds overhead and need a longer period.
The concept of having a default even if we dont have a key in the cm is similar to the metrics.backend-destination key which defaults to prometheus.
If I dont set this 10s default period for metrics.request-metrics-reporting-period-seconds then in the table the following changes (rest is the same):
| component | default | when metrics.reporting-period-seconds in cm is 5s |
|---|---|---|
| queue-proxy | 5min or 1min | 5s |
| activator | 5min or 1min | 5s |
I understand. Thank you. (request-metrics-reporting-period-seconds vs reporting-period-seconds is bad for eyes, though :sweat_smile: )
So, this PR has changes, to be precise:
- exposing
request-metrics-reporting-period-secondsto set queue-proxy's reporting period via configmap. - decreasing the queue-proxy's default reporting period from 1min or 5min to 10s.
Is it alright without having two separated PRs? I am asking because I'm not sure how much the default change affects the current users and so I'm wondering the separated PRs will help troubleshooting/reverting/etc..
Is it alright without having two separated PRs?
I dont mind to do so.
request-metrics-reporting-period-seconds vs reporting-period-seconds is bad for eyes,
I am open to suggestions :)
/lgtm
Alright, then LGTM. (though I think I don't have the power to merge it. :sweat_smile: )
@nak3 will do the two PRs.
This Pull Request is stale because it has been open for 90 days with
no activity. It will automatically close after 30 more days of
inactivity. Reopen with /reopen. Mark as fresh by adding the
comment /remove-lifecycle stale.
/remove-lifecycle stale
/retest
@skonto is there anything blocking this change?
@dprotaso no just need some cycles to update. Will do.
New changes are detected. LGTM label has been removed.
[APPROVALNOTIFIER] This PR is NOT APPROVED
This pull-request has been approved by: skonto
Once this PR has been reviewed and has the lgtm label, please assign mdemirhan for approval by writing /assign @mdemirhan in a comment. For more information see:The Kubernetes Code Review Process.
The full list of commands accepted by this bot can be found here.
Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment
Actually there was a mistake above the default reporting periods are: Prometheus: 5 sec, opencensus 1min
@dprotaso @nak3 updated the PR so that we dont change the default values. When no value is set we pick what ever value exists for metrics.reporting-period-seconds.
@nak3 seens this before?
2022/07/27 13:06:48 http: TLS handshake error from 127.0.0.1:54210: tls: no certificates configured
--- FAIL: TestConversionValidResponse (0.21s)
logger.go:130: 2022-07-27T13:06:48.345Z ERROR webhook/webhook.go:148 failed to fetch secret {"error": "secret \"webhook-certs\" not found"}
conversion_integration_test.go:117: Failed to get response Get "https://0.0.0.0:34381/bazinga": remote error: tls: unrecognized name
logger.go:130: 2022-07-27T13:06:48.345Z INFO webhook/webhook.go:240 Starting to fail readiness probes...
Will retest this passes locally:
metrics: TestMetricsExport/OpenCensus expand_less 5s {Failed === RUN TestMetricsExport/OpenCensus e2e_test.go:194: Created exporter at localhost:12345 {"level":"info","ts":1658927027.88901,"logger":"fallback","caller":"metrics/opencensus_exporter.go:56","msg":"Created OpenCensus exporter with config:","config":{}} e2e_test.go:232: Timeout reading input e2e_test.go:238: Unexpected OpenCensus exports (-want +got): []metrics.metricExtract(Inverse(Sort, []string{ "knative.dev/project/testComponent/global_export_counts<>:2", "knative.dev/project/testComponent/resource_global_export_count<>:2",
knative.dev/project/testComponent/testing/value<project="p1",rev..., -knative.dev/project/testComponent/testing/value<project="p1",revision="r2">:1, })) --- FAIL: TestMetricsExport/OpenCensus (5.05s) }
Tekton tests need an update downstream :shrug: :
# github.com/tektoncd/pipeline/pkg/pod
Error: pkg/pod/pod.go:293:18: assignment mismatch: 2 variables but changeset.Get returns 1 value
/retest
/retest
you're hitting flakes
@dprotaso gentle ping.
This Pull Request is stale because it has been open for 90 days with
no activity. It will automatically close after 30 more days of
inactivity. Reopen with /reopen. Mark as fresh by adding the
comment /remove-lifecycle stale.
This issue or pull request is stale because it has been open for 90 days with no activity.
This bot triages issues and PRs according to the following rules:
- After 90d of inactivity,
lifecycle/staleis applied - After 30d of inactivity since
lifecycle/stalewas applied, the issue is closed
You can:
- Mark this issue or PR as fresh with
/remove-lifecycle rotten - Close this issue or PR with
/close
/lifecycle stale