thanos icon indicating copy to clipboard operation
thanos copied to clipboard

Rate query failing from Grafana

Open Aransh opened this issue 4 months ago • 2 comments

Thanos, Prometheus and Golang version used: Thanos 0.34.0 Prometheus 2.51.0

Object Storage Provider: Linode

What happened: I have Grafana deployed to my k8s cluster as part of the kube-prometheus-stack Helm chart. It is connected to my Thanos querier as its main datasource (which is connected to various Thanos sidecars).

One of our performance engineers has raised my attention to an issue in Grafana specifically, with the following query (note this is using custom metrics from our apps): sum (irate(starlord_http_requests_total{container=“starlord-cyber-feed”,namespace=“app”, cluster=“qa-1”}[1m])) by (cluster)

The problem is: On local prometheus UI, or on Thanos querier UI, running this query works, no problems at all. But on Grafana (as part of a dashboard, or generally on explore), as soon as we increase the time to >12h, the graph flattens down to 0… Now, since the query is working just fine on both Prometheus and Thanos Querier, I am left to believe the issue here must be with Grafana (as Thanos Querier is its datasource, so why would it provide a different response?)

Some example screenshots: Here is the query set to 1 hour, in both Grafana and Thanos querier, looks all good: Thanos-1 Grafana-1

Now, here it is in both, set to 24 hours: Thanos-2 Grafana-2

I’ve tried debugging this and haven’t found much, what I did try:

  • Tried using “rate” instead of “irate”, same issue
  • Tried changing the datasource’s “scrape interval” to 30s (from the default 15), same issue
  • Tried updating Promteheus+Grafana+Thanos to latest version

Only lead I did find is this log line, with http status 400, matching my query: logger=context userId=3 orgId=1 uname=<my-email> t=2024-04-01T16:07:47.268619112Z level=info msg="Request Completed" method=POST path=/api/ds/query status=400 remote_addr=10.2.1.129 time_ms=16 duration=16.278802ms size=13513 referer="https://<my-domain>/explore?orgId=1&panes=%7B%22r5m%22%3A%7B%22datasource%22%3A%22P5DCFC7561CCDE821%22%2C%22queries%22%3A%5B%7B%22refId%22%3A%22A%22%2C%22expr%22%3A%22sum+%28rate%28starlord_http_requests_total%7Bcontainer%3D%5C%22starlord-cyber-feed%5C%22%2Cnamespace%3D%5C%22app%5C%22%2C+cluster%3D%5C%22qa-1%5C%22%7D%5B1m%5D%29%29+by+%28cluster%29%22%2C%22range%22%3Atrue%2C%22instant%22%3Atrue%2C%22datasource%22%3A%7B%22type%22%3A%22prometheus%22%2C%22uid%22%3A%22P5DCFC7561CCDE821%22%7D%2C%22editorMode%22%3A%22code%22%2C%22legendFormat%22%3A%22__auto%22%7D%5D%2C%22range%22%3A%7B%22from%22%3A%22now-24h%22%2C%22to%22%3A%22now%22%7D%7D%7D&schemaVersion=1" handler=/api/ds/query status_source=downstream

So, possibly Thanos querier is failing to handle the query from Grafana for some reason? Thanos itself isn't showing anything on the log about this...

What you expected to happen: I expected queries ran in Thanos querier, and in Grafana (which queries Thanos querier) to be the same.

How to reproduce it (as minimally and precisely as possible): Not sure how to reproduce without our specific metrics, but general setup is kube-prometheus-stack + thanos-querier + thanos sidecar(s)

Aransh avatar Apr 01 '24 16:04 Aransh

Got a response from Grafana support, issue was on their side

Aransh avatar Apr 01 '24 17:04 Aransh

Nevermind, Grafana support actually said this might indeed be a Thanos bug, would appreciate if you can take a look. https://community.grafana.com/t/rate-query-failing-only-on-grafana/118325/4

Aransh avatar Apr 01 '24 21:04 Aransh