thanos icon indicating copy to clipboard operation
thanos copied to clipboard

Ruler: v0.25.2 no query API server unreachable

Open bwplotka opened this issue 3 years ago • 16 comments

One user shared that our Rulers were having hiccups with finding the right Qurier endpoints resulting in gaps:

image

Apparently reverting to v0.24.0 resolved the issue. This seems to be a stateful Ruler.

We will need to have more information e.g:

  • what was reverted - only ruler version or anything else?
  • What's the configuration of the mentioned ruler?

bwplotka avatar May 02 '22 09:05 bwplotka

The configuration of Thanos Ruler:

- args: - rule - --log.level=debug - --log.format=logfmt - --grpc-address=0.0.0.0:10901 - --http-address=0.0.0.0:10902 - --objstore.config=$(OBJSTORE_CONFIG) - --data-dir=/thanos/data - --eval-interval=2m - --label=rule_replica="$(NAME)" - --alert.label-drop=rule_replica - --remote-write.config-file=/etc/thanos/conf/rw-config.yaml - --query=dnssrv+_http._tcp.observatorium-thanos-query-frontend.monitoring.svc.cluster.local - --rule-file=/etc/thanos/rules/*/*.yaml

There was no change in the config. Just the version change from v0.25.2 to 0.24.0 fixed the problem.

sharathfeb12 avatar May 18 '22 20:05 sharathfeb12

Hello 👋 Looks like there was no activity on this issue for the last two months. Do you mind updating us on the status? Is this still reproducible or needed? If yes, just comment on this PR or push a commit. Thanks! 🤗 If there will be no activity in the next two weeks, this issue will be closed (we can always reopen an issue if we need!). Alternatively, use remind command if you wish to be reminded at some point in future.

stale[bot] avatar Jul 31 '22 04:07 stale[bot]

I am encountering a similar issue since upgrading to v0.28.1. Many rules are failing to be evaluated with ruler with the error no query API server reachable, was this issue ever resolved? @bwplotka @yeya24

RohitKochhar avatar Jan 18 '23 14:01 RohitKochhar

I'm seeing the same issue after upgrading to v0.29.0, but couple of findings that I have is, when we have the targets around 4k+ its working fine, where as if targets were increased to 24k we are running into this error "No query API server reachable"

Additional info from the logs are,

LabelSets: Mint: -62167219200000 Maxt: 9223372036854775807: rpc error: code = Unknown desc = query Prometheus: request failed with code 503 Service Unavailable; msg Service Unavailable\"}

Also, can someone help me in understanding if all rules are being executed simultaneously?

daganibhanu avatar Feb 10 '23 11:02 daganibhanu

Hi Team, 5903 as per the suggestion, we have upgraded to 0.29.0, since then we are seeing this issue, is there any workaround or could you please help on how to deal about this issue? Thanks in advance!

daganibhanu avatar Feb 14 '23 05:02 daganibhanu

@bwplotka Can I know if this issue is addressed in version 0.30.0? or any pointers on this issue would be helpful. Thanks in advance!!

daganibhanu avatar Feb 23 '23 10:02 daganibhanu

@bwplotka we have the same problem with 0.30.0 ruler. We deploy it with the thanosruler crd and use the dnssrv record discovery in kubernetes.

Cellebyte avatar Mar 02 '23 16:03 Cellebyte

@bwplotka it looks like that partial_response_strategy needs to be enabled for ruler rules now. As without that specific flag it is not possible to query with missing stores as it returns errors.

Cellebyte avatar Mar 03 '23 10:03 Cellebyte

Hey @Cellebyte I'm having the same issue, could you clarify better how you fixed it?

As per Thanos documentation: "It is recommended to keep partial response as abort for alerts and that is the default as well."

What exactly did you enable and how? I'm using ThanosRuler CRD if that helps

Migueljfs avatar Mar 03 '23 11:03 Migueljfs

@Migueljfs you need to set it to partial_response_strategy: "warn" because ruler will fail if one of the storeAPIs of your querier is not reachable or does not answer to the ruler rule request.

Cellebyte avatar Mar 03 '23 11:03 Cellebyte

We are covering the problem which is mentioned above by an additional alert which checks if our remote query is reachable by using vector(0) or the up metric for the remote cluster.

Cellebyte avatar Mar 03 '23 11:03 Cellebyte

We have identified the issue, in our case looks like issue was with one of the prometheus shard, which has used up all the memory and was not responding, on cleaning up of data, which is removing WAL, head_chunks and TSDB ( it may cause data loss) and bringing up the shards clean, it started working.

daganibhanu avatar Mar 06 '23 05:03 daganibhanu

did anyone get a fix for the above issue? I have set partial_response_strategy: "warn" in my rules file but still I get the same error as "no query API server reachable". Below is the command I have used to bring up my ruler. /bin/thanos rule --data-dir /var/lib/prometheus-ruler/ --eval-interval 30s --rule-file /etc/prometheus/alert/*.yml --alert.query-url http:/<prom-server-1>:9090 --alertmanagers.url http://localhost:9093 --objstore.config-file /etc/prometheus/bucket.yml --query http://<prom-server-1>:129090 --query http://<prom-server-2>:29090 --label 'monitor_cluster="eu1"' --label 'replica="prom-server101"'

Can someone help with this issue? or any other version of thanos handling this error?

sunilnerella avatar Mar 17 '23 06:03 sunilnerella

having similar issue running thanos v0.31.0 via ThanosRuler CRD (prometheus operator).

zbialik avatar Nov 21 '23 18:11 zbialik

I changed --query value from load balancer (with Thanos Query as a endpoints) to direct Thanos Queries endpoint names. The problem disappeared immediately :)

LukaszWasko avatar Jan 29 '24 16:01 LukaszWasko