prometheus-engine Google HTTP API Failed

Google HTTP API Failed

Open elebioda opened this issue 2 years ago • 11 comments

Using the frontend UI for our google managed prometheus and are periodically seeing these errors in the pod

requesting GCM failed" err="Get \"https://monitoring.googleapis.com/v1/projects/{{}}/location/global/prometheus/api/v1/query_range?end=1646328750&query=sum%28kube_deployment_status_replicas_available%7Bdeployment%3D%22{{}}%22%7D%29&start=1646327850&step=30\": context canceled"

Any curly braces resemble data stripped out by me

Mar 03 '22 17:03 elebioda

A context cancellation error would typically imply that the request was aborted from the client side. Does this happen for any particular query type or is it fairly random?

Mar 03 '22 17:03 fabxc

This happens for any query for replicas available

Mar 03 '22 17:03 elebioda

Sorry. Correction, this happens sporadically for queries on the same metric type in Grafana.

Mar 03 '22 18:03 elebioda

Is it possible that this is a rather expensive query, which may run into a timeout sporadically? When you run this query via the frontend UI or the Cloud console, what kind of timing do you sett?

Mar 09 '22 10:03 fabxc

I've been experiencing this with some regularity in one of my projects. The query is not very demanding and generally returns rather quickly. However, my Grafana alert rules periodically error and the frontend logs show "context canceled" as above.

Here's a screenshot of a graph showing the errors: Screenshot_20220929_113037

The query is (probe_ssl_earliest_cert_expiry{job="blackbox"} - time()) / 86400 with a timeframe of now-10m to now. This is the only alert I have configured on a graph (from GMP). However, I see plenty of other similar "context canceled" logs for the frontend. My data source is configured with 120 second timeout, 5m scrape interval and 2m query timeout. The frontend (and other components) are deployed on a GKE cluster of 2 e2-micro nodes, which though limited in resources, do not look to be oversubscribed.

Sep 29 '22 15:09 mhoran

I set up the frontend outside of the GKE cluster to rule that out as a factor. I'm not getting timeouts from the GCM request anymore, but I did get two timeouts just below in copy response. This suggests to me that, at times, the Google Managed Prometheus endpoint seems to be taking quite a bit of time for various queries. (Edit: shortly after posting this I did get a timeout from the GCM request -- so there does seem to be some latency with the Google Managed Prometheus endpoint.)

In my case, it would be nice if Grafana could query Google Managed Prometheus directly without the frontend proxy. This would require the Grafana Prometheus plugin have support for OAuth 2, or the Grafana Cloud Monitoring plugin have support for PromQL.

Sep 30 '22 14:09 mhoran

After digging through my ingress logs it would seem that these requests are indeed being canceled by Grafana due to a timeout. Despite setting my query timeout to 2m, the alert timeout is 30s and this is not configurable. For some reason Google Managed Prometheus seems to periodically have a hard time with queries and takes a bit longer. I see this when loading my graphs as well.

Hopefully the performance of GMP will improve over time. In addition, better integration of GMP into Grafana would potentially help in my case.

Sep 30 '22 21:09 mhoran

Hi @mhoran,

Are the timeouts happening just for Grafana's managed alerts?

Oct 10 '22 16:10 pintohutch

I would see timeouts in the Grafana UI for various graphs as well until I added the following to my Ingress:

metadata:
  annotations:
    nginx.ingress.kubernetes.io/proxy-next-upstream: error timeout http_503
    nginx.ingress.kubernetes.io/proxy-connect-timeout: "3"

The key being http_503. Prior to this I would see a lot of 500/502/503 errors in addition to Grafana canceled requests after 30 seconds (the latter for alerts only). The ingress-nginx-controller was also timing out at 5 seconds and not retrying. I lowered the timeout to 3 seconds and enabled retries by adding http_503 to proxy-next-upstream. This has helped a lot, but sometimes alert queries still take more than 30 seconds and Grafana cancels them.

Oct 10 '22 16:10 mhoran

Hey Matt, could you share what project you're using as the metrics scope? You can email it to me at (my github username) at (the company I work for).com

Re: the frontend proxy, we also want to get rid of it, but as we want to keep using the Prometheus data source, you're right that Grafana would need to add OAuth support to it first. We've asked but haven't got anywhere. FWIW all the frontend proxy does is add OAuth creds and forward the request to the GCP APIs; it shouldn't be the source of any latency, if it is then that's a problem.

P.S. Good to see you on here, haven't talked with you since we both worked at Pivotal ;-)

Oct 10 '22 17:10 lyanco

:wave: hey Lee!

I sent my project details over to you via email.

I am fairly certain the frontend is not to blame here as well. I tried running the queries against the monitoring endpoint directly with my own OAuth token and they had similar latency. I did not see any queries taking 30+ seconds, though -- but I did not perform a very scientific analysis. I did look through the frontend code and I didn't see anything suspicious.

I had thought about poking at the Grafana code to see how difficult it would be to crib the existing JWT auth code from the Stackdriver integration for the Prometheus integration. But I'm juggling too many personal projects at the moment, so I haven't made any progress on that yet. Hopefully since the code is mostly written at this point it won't be too bad, but I'm not sure.

Great running into you and thanks for the help!

Oct 10 '22 21:10 mhoran

@mhoran IIUC this is more related to nginx ingress configuration more than the actual GMP query API, correct?

If so, I'm tempted to close this issue as we don't have any other concrete data to go on to debug.

Nov 22 '22 23:11 pintohutch

I saw timeouts when running the frontend outside of Google Cloud as well. However, with the above configuration this happens very rarely with the frontend I have deployed on GKE. I never see timeouts over two subsequent requests, so I've configured Grafana to tolerate the failures. In practice I also don't see timeouts when browsing in the Grafana UI either, since making the ingress change. Sometimes the requests do take a minute or so, but the retry ensures it succeeds.

So while I think that the GMP endpoint does sometimes take an unnecessary amount of time to respond to a simple query, this is probably only noticeable to those who have a very low number of QPS through the API. As such, I think it's fine to close this out since any further diagnosis is likely going to be difficult unless a high volume user can reliably reproduce.

Thanks!

Nov 22 '22 23:11 mhoran

Yea makes sense.

Thanks for the response! I'm going to close this. Feel free to re-open or file a new issue if there's any other issues and we can take a look.

Cheers

Nov 22 '22 23:11 pintohutch

prometheus-engine prometheus-engine copied to clipboard

Google HTTP API Failed

prometheus-engine
prometheus-engine copied to clipboard