serving icon indicating copy to clipboard operation
serving copied to clipboard

queue-proxy does not appear to flush metrics on shutdown

Open jasonaliyetti opened this issue 4 years ago • 10 comments

What version of Knative?

0.25.0

Expected Behavior

When configured to use opencensus, the queue-proxy should attempt to flush metrics to the collector endpoint on shutdown.

Actual Behavior

The queue-proxy does not appear to flush metrics on shutdown. This leads to missing telemetry.

Steps to Reproduce the Problem

  • Setup knative-serving to use opencensus for request metrics
  • Send a request and wait for the metric to show up in otel collector's prometheus exporter endpoint (counter value should be 1)
  • Send a request to the service and then delete the pod
  • Observe the prometheus exporter endpoint on the otel colelctor and verify the request counter does not increase even after the pod is gone (counter value is 1 but should be 2)

Side note: It would be very nice to have metrics.reporting-period-seconds actually propagate to the queue-proxy if set to give some control over the interval as well.

jasonaliyetti avatar Sep 30 '21 13:09 jasonaliyetti

@jasonaliyetti I read two tickets here:

a) flush metrics bug. I see that we do call flush in queue proxy main and other places to make sure metrics are flushed before main exists. This part needs debugging, do you have any logs about the second request being handled at all (so counter is actually increased)? As a side note I was wondering if we should also stop the exporter after flushing it. I would assume that stopping should make any metric recording to fail if exporter is stopped but some request is being handled at the same time, at least this way we know some err happened. However, It seems that we drain all connections and all requests are served before we run the flush part, as soon as we get SIGTERM. The default reporting period is: defaultReportingDuration = 10 * time.Second. (didnt see a default one for the opencensus exporter), but we do flush metrics at the end of main which should export them and ship them to the collector (took a quick look in the oc exporter code, any errors at the collector side?).

I assumed the deletion is not done with --force --grace-period=0 because after SIGTERM we will get SIGKILL pretty quickly (the latter is not captured).

b) propagate metrics.reporting-period-seconds at the queue side. Straightforward to do.

Would you like to work on a PR to fix the second (I could review it).

As a note for cautiousness, monitoring is not meant for accurate counting.

/cc @evankanderson

skonto avatar Sep 30 '21 15:09 skonto

@skonto I did get a response for my second request so I know it went through. I did not note any errors logged that looked out of place as part of a shutdown but I will double check this.

I will try to find some time in the next week to look at the second issue.

jasonaliyetti avatar Oct 01 '21 18:10 jasonaliyetti

Also, I did not force delete or use an ungraceful termination. I did however have a small timeout set for the revision, I increased this from 5 seconds to 60 seconds. It looks like the issue still occurs (no metrics), but now the queue-proxy logs look to fully shutdown gracefully.

Looking through the code it looks like the reporting period may be set to 1 minute. This seem to reflect what I experience -- there's usually a minute of lag between the request and the counter being incremented on the prometheus exporter in otel collector.

jasonaliyetti avatar Oct 01 '21 19:10 jasonaliyetti

Cool missed that config part (guess 1min is set for performance reasons, so that a batch is sent). I am wondering why though flushing does not do the job I will try to reproduce it next week as well. I would expect it to override the reporting period and ship the metrics. Will get back to you here.

skonto avatar Oct 01 '21 22:10 skonto

@jasonaliyetti I spent some time debugging this and verified that metrics are not shipped due to the long reporting period (1 min). Unfortunately when flush is called (defered one) in queue main, it does not mean that view's data will be uploaded (normally sent via ocagent), they are added for processing here. These calls do not sent directly any metrics. What worked for me was to set a small reporting period of 1 sec (can be tuned) at the queue proxy side. This does come with the downside/overhead of making more requests for the lifetime of the pod. Not sure what is the impact at scale for the collector but at least it seems that metrics are sent via grpc using an rpc stream. I will create a PR to make this configurable. When tested the new config value I did see (via logging) batches sent (also verified this via the prometheus exporter endpoint at the collector side and via the collector's debug logs).

skonto avatar Nov 25 '21 14:11 skonto

This issue is stale because it has been open for 90 days with no activity. It will automatically close after 30 more days of inactivity. Reopen the issue with /reopen. Mark the issue as fresh by adding the comment /remove-lifecycle stale.

github-actions[bot] avatar Feb 24 '22 01:02 github-actions[bot]

/remove-lifecycle stale

skonto avatar Mar 04 '22 10:03 skonto

This issue is stale because it has been open for 90 days with no activity. It will automatically close after 30 more days of inactivity. Reopen the issue with /reopen. Mark the issue as fresh by adding the comment /remove-lifecycle stale.

github-actions[bot] avatar Jun 03 '22 01:06 github-actions[bot]

/remove-lifecycle stale

skonto avatar Jun 14 '22 08:06 skonto

This issue is stale because it has been open for 90 days with no activity. It will automatically close after 30 more days of inactivity. Reopen the issue with /reopen. Mark the issue as fresh by adding the comment /remove-lifecycle stale.

github-actions[bot] avatar Sep 14 '22 01:09 github-actions[bot]

This issue or pull request is stale because it has been open for 90 days with no activity.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle rotten
  • Close this issue or PR with /close

/lifecycle stale

knative-prow-robot avatar Oct 14 '22 02:10 knative-prow-robot

/lifecycle frozen

deleted comments were ones related to the stale bot

dprotaso avatar Mar 29 '23 19:03 dprotaso

/triage accepted

The linked PR is merged - https://github.com/knative/pkg/pull/2362

@skonto is there anything else to configure in the serving repo? Also we should update the docs

dprotaso avatar Mar 29 '23 19:03 dprotaso

bump @skonto

dprotaso avatar Apr 26 '23 13:04 dprotaso

@dprotaso Afaik nope this should be fixed assuming the reporting period is set accordingly. I will create a PR for the docs side. @jasonaliyetti is this something you want to try with Knative latest version?

skonto avatar Apr 29 '23 14:04 skonto

@skonto were you able to update the docs? If so I'll close this issue

/assign @skonto

dprotaso avatar May 25 '23 14:05 dprotaso

Not yet I was off for the past two weeks I will take a look now.

skonto avatar May 26 '23 09:05 skonto

@dprotaso I have a new PR I will update next docs.

skonto avatar May 26 '23 16:05 skonto

@skonto can you add some docs to the website about these new options?

dprotaso avatar May 30 '23 02:05 dprotaso

@dprotaso done: https://github.com/knative/docs/pull/5582.

skonto avatar May 30 '23 09:05 skonto