pkg icon indicating copy to clipboard operation
pkg copied to clipboard

Expose queue proxy request metrics reporting period

Open skonto opened this issue 4 years ago • 30 comments

Changes

  • Exposes the reporting period used by Prometheus and Opencensus backend destinations.
  • Required for fixing https://github.com/knative/serving/issues/12069

/kind api-change

skonto avatar Nov 26 '21 10:11 skonto

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: skonto To complete the pull request process, please assign yanweiguo after the PR has been reviewed. You can assign the PR to them by writing /assign @yanweiguo in a comment when ready.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment Approvers can cancel approval by writing /approve cancel in a comment

knative-prow-robot avatar Nov 26 '21 10:11 knative-prow-robot

Codecov Report

Patch coverage: 57.69% and project coverage change: -0.07 :warning:

Comparison is base (511b394) 81.27% compared to head (17762d5) 81.21%.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #2362      +/-   ##
==========================================
- Coverage   81.27%   81.21%   -0.07%     
==========================================
  Files         163      163              
  Lines        9908     9932      +24     
==========================================
+ Hits         8053     8066      +13     
- Misses       1612     1621       +9     
- Partials      243      245       +2     
Impacted Files Coverage Δ
metrics/config_observability.go 69.51% <54.16%> (-6.35%) :arrow_down:
metrics/config.go 84.03% <100.00%> (ø)

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Do you have feedback about the report comment? Let us know in this issue.

codecov[bot] avatar Nov 26 '21 10:11 codecov[bot]

This Pull Request is stale because it has been open for 90 days with no activity. It will automatically close after 30 more days of inactivity. Reopen with /reopen. Mark as fresh by adding the comment /remove-lifecycle stale.

github-actions[bot] avatar Feb 28 '22 01:02 github-actions[bot]

/remove-lifecycle stale

skonto avatar Feb 28 '22 11:02 skonto

/retest

skonto avatar Mar 04 '22 12:03 skonto

Gentle ping @julz @dprotaso @nak3

skonto avatar Mar 04 '22 12:03 skonto

It is probably better to create a PoC against serving with this change? At least it will be easier to review.

nak3 avatar Mar 07 '22 07:03 nak3

@nak3 ok I will do one to show how this is going to be used, the idea is to have that attribute configurable so we can fix that bug.

skonto avatar Mar 08 '22 10:03 skonto

Regarding codecov:

codegen/cmd/injection-gen/generators 1.51% injection 32% tracing 57.99% test 52.23%

Could we skip test and codegen, not sure what the settings are (dont have access) and open two issues for improving tracing and injection? Is anyone looking at the low coverage already? /cc @pierDipi @nak3 @dprotaso @evankanderson

skonto avatar Mar 08 '22 11:03 skonto

@nak3 @pierDipi here is the poc, setting the new attribute influences the default reporting at the queue proxy exporter. I verified it again playing with the OpenTelemtry collector.

Note: The default exporter reporting-period values are defined per backend here. Prometheus: 5min, opencensus 1min. If the user sets metrics.reporting-period-seconds the new attribute we introduce in this PR will override that value at the queue proxy side. However, metrics.reporting-period-seconds will be set for other components like activator, autoscaler etc if the user adds that in the configmap. For some reason btw we are not exposing in the config-observability the following properties, I dont see them in the fixed url of the configmap:

// The following keys are used to configure metrics reporting.
// See https://github.com/knative/serving/blob/main/config/config-observability.yaml
// for details.
collectorAddressKey = "metrics.opencensus-address"
collectorSecureKey  = "metrics.opencensus-require-tls"
reportingPeriodKey  = "metrics.reporting-period-seconds"

skonto avatar Mar 08 '22 14:03 skonto

Thank you.

Note: The default exporter reporting-period values are defined per backend here. Prometheus: 5min, opencensus 1min. If the user sets metrics.reporting-period-seconds the new attribute we introduce in this PR will override that value at the queue proxy side. However, metrics.reporting-period-seconds will be set for other components like activator, autoscaler etc if the user adds that in the configmap.

Umm... I think it is a little bit complicated to understand. I understood as the following table:

component default when metrics.reporting-period-seconds in cm is 5s
queue-proxy 5min or 1min 5s
activator ? ?

But then, what is the value for defaultRequestReportingPeriodSeconds? activator (and other)'s default?

nak3 avatar Mar 09 '22 04:03 nak3

@nak3 what I meant is that the new attribute (metrics.request-metrics-reporting-period-seconds) does not affect the other components only QP:

component default when metrics.reporting-period-seconds in cm is 5s when metrics.request-metrics-reporting-period-seconds in cm is 20s when metrics.reporting-period-seconds = 5s and metrics.request-metrics-reporting-period-seconds in cm is 20s
queue-proxy 10s 10s 20s 20s
activator 5min or 1min 5s 5min or 1min 5s

metrics.request-metrics-reporting-period-seconds <- affects only QP metrics exporter and overrides anything else, this PR does that. metrics.reporting-period-seconds <- currently an internal config key used for setting up any metrics exporter. In this PR it sets anything other than QP. Note in this PR 10s is the default value for metrics.request-metrics-reporting-period-seconds and is set even that attribute is not in the cm. Given the bug of losing metrics (linked in the description) I thought 10s is more appropriate to avoid the current UX. In any case users can configure it if they see it adds overhead and need a longer period. The concept of having a default even if we dont have a key in the cm is similar to the metrics.backend-destination key which defaults to prometheus.

If I dont set this 10s default period for metrics.request-metrics-reporting-period-seconds then in the table the following changes (rest is the same):

component default when metrics.reporting-period-seconds in cm is 5s
queue-proxy 5min or 1min 5s
activator 5min or 1min 5s

skonto avatar Mar 09 '22 08:03 skonto

I understand. Thank you. (request-metrics-reporting-period-seconds vs reporting-period-seconds is bad for eyes, though :sweat_smile: )

So, this PR has changes, to be precise:

  1. exposing request-metrics-reporting-period-seconds to set queue-proxy's reporting period via configmap.
  2. decreasing the queue-proxy's default reporting period from 1min or 5min to 10s.

Is it alright without having two separated PRs? I am asking because I'm not sure how much the default change affects the current users and so I'm wondering the separated PRs will help troubleshooting/reverting/etc..

nak3 avatar Mar 09 '22 12:03 nak3

Is it alright without having two separated PRs?

I dont mind to do so.

request-metrics-reporting-period-seconds vs reporting-period-seconds is bad for eyes,

I am open to suggestions :)

skonto avatar Mar 10 '22 11:03 skonto

/lgtm

Alright, then LGTM. (though I think I don't have the power to merge it. :sweat_smile: )

nak3 avatar Mar 10 '22 11:03 nak3

@nak3 will do the two PRs.

skonto avatar Mar 17 '22 09:03 skonto

This Pull Request is stale because it has been open for 90 days with no activity. It will automatically close after 30 more days of inactivity. Reopen with /reopen. Mark as fresh by adding the comment /remove-lifecycle stale.

github-actions[bot] avatar Jun 22 '22 01:06 github-actions[bot]

/remove-lifecycle stale

skonto avatar Jun 22 '22 10:06 skonto

/retest

@skonto is there anything blocking this change?

dprotaso avatar Jul 27 '22 00:07 dprotaso

@dprotaso no just need some cycles to update. Will do.

skonto avatar Jul 27 '22 09:07 skonto

New changes are detected. LGTM label has been removed.

knative-prow[bot] avatar Jul 27 '22 09:07 knative-prow[bot]

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: skonto Once this PR has been reviewed and has the lgtm label, please assign mdemirhan for approval by writing /assign @mdemirhan in a comment. For more information see:The Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment Approvers can cancel approval by writing /approve cancel in a comment

knative-prow[bot] avatar Jul 27 '22 09:07 knative-prow[bot]

Actually there was a mistake above the default reporting periods are: Prometheus: 5 sec, opencensus 1min

skonto avatar Jul 27 '22 09:07 skonto

@dprotaso @nak3 updated the PR so that we dont change the default values. When no value is set we pick what ever value exists for metrics.reporting-period-seconds.

skonto avatar Jul 27 '22 12:07 skonto

@nak3 seens this before?

2022/07/27 13:06:48 http: TLS handshake error from 127.0.0.1:54210: tls: no certificates configured
--- FAIL: TestConversionValidResponse (0.21s)
    logger.go:130: 2022-07-27T13:06:48.345Z	ERROR	webhook/webhook.go:148	failed to fetch secret	{"error": "secret \"webhook-certs\" not found"}
    conversion_integration_test.go:117: Failed to get response Get "https://0.0.0.0:34381/bazinga": remote error: tls: unrecognized name
    logger.go:130: 2022-07-27T13:06:48.345Z	INFO	webhook/webhook.go:240	Starting to fail readiness probes...

skonto avatar Jul 27 '22 13:07 skonto

Will retest this passes locally:

metrics: TestMetricsExport/OpenCensus expand_less 5s {Failed === RUN TestMetricsExport/OpenCensus e2e_test.go:194: Created exporter at localhost:12345 {"level":"info","ts":1658927027.88901,"logger":"fallback","caller":"metrics/opencensus_exporter.go:56","msg":"Created OpenCensus exporter with config:","config":{}} e2e_test.go:232: Timeout reading input e2e_test.go:238: Unexpected OpenCensus exports (-want +got): []metrics.metricExtract(Inverse(Sort, []string{ "knative.dev/project/testComponent/global_export_counts<>:2", "knative.dev/project/testComponent/resource_global_export_count<>:2", knative.dev/project/testComponent/testing/value<project="p1",rev..., - knative.dev/project/testComponent/testing/value<project="p1",revision="r2">:1, })) --- FAIL: TestMetricsExport/OpenCensus (5.05s) }

skonto avatar Jul 27 '22 13:07 skonto

Tekton tests need an update downstream :shrug: :

# github.com/tektoncd/pipeline/pkg/pod
Error: pkg/pod/pod.go:293:18: assignment mismatch: 2 variables but changeset.Get returns 1 value

skonto avatar Jul 27 '22 13:07 skonto

/retest

skonto avatar Jul 27 '22 13:07 skonto

/retest

you're hitting flakes

dprotaso avatar Jul 27 '22 14:07 dprotaso

@dprotaso gentle ping.

skonto avatar Jul 29 '22 12:07 skonto

This Pull Request is stale because it has been open for 90 days with no activity. It will automatically close after 30 more days of inactivity. Reopen with /reopen. Mark as fresh by adding the comment /remove-lifecycle stale.

github-actions[bot] avatar Oct 28 '22 01:10 github-actions[bot]

This issue or pull request is stale because it has been open for 90 days with no activity.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle rotten
  • Close this issue or PR with /close

/lifecycle stale

knative-prow-robot avatar Nov 27 '22 02:11 knative-prow-robot