stackdriver_exporter icon indicating copy to clipboard operation
stackdriver_exporter copied to clipboard

Metrics gap due to "Treat failure to collect metric as fatal"

Open pracucci opened this issue 4 years ago • 4 comments

We're experiencing small gaps in the metrics exported by the stackdriver exporter and, after some investigation, ended up to be caused by the stackdriver exporter exiting each time an API request timeout:

level=error ts=2020-06-30T12:54:18.862Z caller=monitoring_collector.go:194 msg="Error while getting Google Stackdriver Monitoring metrics" err="Get \"REDACTED\": net/http: request canceled (Client.Timeout exceeded while awaiting headers)"

This behavioural change has been introduced in https://github.com/prometheus-community/stackdriver_exporter/pull/83 and I'm wondering how we could improve it. Few ideas/options:

  1. Revert PR #83
  2. Exit with fatal error only for specific errors (in this case I would need to better understand the specific error returned by the #66 use case)
  3. Add a CLI flag to change the "exit on metric collection error" behaviour, to be able to enable/disable it

Any other idea to fix it?

Example of gap

Screen Shot 2020-06-30 at 17 36 52

pracucci avatar Jun 30 '20 15:06 pracucci

Yes, I had some conversation with @gouthamve about this. We should probably extend the collector to have a stackdriver_up metric. I'm thinking the best option for now is to simply revert the Fatal.

SuperQ avatar Jul 01 '20 16:07 SuperQ

Ping @cpick. It turns out the error handling change introduces too many problems. I think you would be better off not crashing the exporter and fixing your Vault automation to restart on credential rotation.

SuperQ avatar Jul 03 '20 09:07 SuperQ

I figured out a simple way to reproduce the issue.

It looks like the error is: googleapi: Error 403: Permission monitoring.metricDescriptors.list denied (or the resource may not exist)., forbidden.

This seems to come from google.golang.org/api/googleapi/googleapi.go.

SuperQ avatar Jul 03 '20 09:07 SuperQ

It makes sense to me to specifically identify authentication issues and only exit when one is detected.

@pracucci:

We're experiencing small gaps in the metrics exported by the stackdriver exporter and, after some investigation, ended up to be caused by the stackdriver exporter exiting each time an API request timeout:

Just out of curiosity, if the underlying request to the Stackdriver API times out then that will still result in a gap in metrics (whether or not stackdriver-exporter restarts), right?

@SuperQ:

We should probably extend the collector to have a stackdriver_up metric. I'm thinking the best option for now is to simply revert the Fatal.

Being able to continue to publish this proposed stackdriver_up metric (and alert on it) sounds like a good reason to not exit on all failures. (My original patch in #83 did not try to distinguish between different error types in the hope that it might also fix other transient error-types. We also used any resulting alerts about stackdriver-exporter restarting frequently as a sign that the Stackdriver API may not be healthy, but can certainly understand how others may not want that.)

@SuperQ:

It looks like the error is: googleapi: Error 403: Permission monitoring.metricDescriptors.list denied (or the resource may not exist)., forbidden.

Exactly, that's one example and here's another:

FATA[0003] Error while getting Google Stackdriver Monitoring metrics: Get https://monitoring.googleapis.com/v3/projects/example/metricDescriptors?alt=json&filter=metric.type+%3D+starts_with%28%22pubsub.googleapis.com%2Fsubscription%2Fnum_undelivered_messages%22%29: oauth2: cannot fetch token: 400 Bad Request Response: {"error":"invalid_grant","error_description":"Invalid JWT Signature."} source="monitoring_collector.go:138"

If we exit on those two cases I think that'd be a great middle ground.

cpick avatar Jul 09 '20 18:07 cpick