0.35: Panic with query mode distributed
Thanos, Prometheus and Golang version used: 0.35.0
Object Storage Provider: Azure Storage Account
What happened:
After enable query.mode=distributed my querier get a lot of panics. Removing --query.mode=distributed stops all pancis
What you expected to happen:
No panics
How to reproduce it (as minimally and precisely as possible):
At the moment, I'm unable to provide a minimal reproducible environment. However, according ruler logs, all queries like absent(up{job=\"kube-proxy\"} == 1) (job can have any label) seems affected.
We are using stateless rulers and thanos receive, no sidecars.
Thanos querier arguments:
query --log.level=info --log.format=json --grpc-address=0.0.0.0:10901 --http-address=0.0.0.0:10902 --query.replica-label=replica --query.replica-label=prometheus_replica --query.replica-label=thanos_receive_replica --query.replica-label=thanos_ruler_replica --endpoint=opsstack-thanos-storegateway.opsstack.svc.cluster.local.:10901 --endpoint=opsstack-thanos-receive.opsstack.svc.cluster.local.:10901 --alert.query-url=http://opsstack-thanos-query.opsstack.svc.cluster.local:10902 --enable-auto-gomemlimit --query.promql-engine=thanos --query.mode=distributed --query.auto-downsampling --query.default-tenant-id=opsstack --web.disable-cors --web.prefix-header=X-Forwarded-Prefix
Full logs to relevant components:
Uncomment if you would like to post collapsible logs:
Ruler Logs
Just a few, they are repeating
{"caller":"rule.go:968","component":"rules","err":"rpc error: code = Internal desc = runtime error: index out of range [0] with length 0","level":"error","query":"absent(up{job=\"kube-proxy\"} == 1)","ts":"2024-05-02T13:03:09.954559928Z"}
{"caller":"rule.go:938","component":"rules","err":"read query instant response: perform POST request against http://opsstack-thanos-query.opsstack.svc.cluster.local:10902/api/v1/query: Post \"http://opsstack-thanos-query.opsstack.svc.cluster.local:10902/api/v1/query\": EOF","level":"error","query":"absent(up{job=\"apiserver\"} == 1)","ts":"2024-05-02T12:59:26.372712478Z"}
Querier Logs
https://gist.github.com/jkroepke/9fc58319bf819866138a8dae4f1c8d92
Anything else we need to know:
Environment:
- OS (e.g. from /etc/os-release): Linux (Offical Thanos Images)
- Version: quay.io/thanos/thanos:v0.35.0
The issue here was that the querier is not configured to point at query APIs but at store APIs ( we should guard against that better probably ); this leads to the promql-engine distributing to 0 other engines, which exposes a bug where we dont guard against that!
Edit: see https://cloud-native.slack.com/archives/CK5RSSC10/p1714654267669009