thanos icon indicating copy to clipboard operation
thanos copied to clipboard

Query: upstream query endpoint reclassified as sidecar unexpectedly

Open erhudy opened this issue 2 years ago • 2 comments

Thanos, Prometheus and Golang version used: Thanos v0.25.0, Prometheus v2.34.0

Object Storage Provider: N/A

What happened: I have a hybrid Thanos setup that joins Thanos Query instances in AWS with on-prem Thanos Query instances. Our Grafana setup reads from on-premises Thanos Query, which in turn reads from a combination of on-prem sidecar instances and cloud query instances.

In the specific setup that is being problematic, our non-production env Thanos Query setup is reading from two cloud query instances. One of them is for a QA environment and is only announcing 2 labelsets at present. The other is for a dev environment and is announcing 24 labelsets. Both cloud query instances are constituted of 4 Thanos query replicas running in EKS, fronted by an NLB provisioned by the AWS LB controller. The NLB forwards 10901/TCP through to the replicas.

What you expected to happen:

What happens regularly is that the cloud dev query instance will be unexpectedly reclassified as a sidecar announcing a single labelset. The two on-prem query instances that read from the cloud one do not always agree on this; sometimes one of them shows it as a sidecar while the other one shows it as a query. Occasionally it fixes itself and goes back to being classified as a query endpoint, but more often than that it just gets stuck that way and I have to restart the on-prem instances to get them to see it as a query endpoint again.

How to reproduce it (as minimally and precisely as possible):

I have not worked out what is provoking this to happen yet. My initial suspicion was that the Thanos instances in EKS were being restarted quickly and the NLB would go unhealthy for too long while new target registration was in progress, so I did some work to reduce how often Thanos was getting restarted in EKS, but that doesn't seem to have made a difference so far.

Full logs to relevant components:

I don't have logs at the moment but will post them the next time I see the problem occur.

Anything else we need to know:

erhudy avatar Apr 13 '22 12:04 erhudy

Hi, thanks for the intro. It would be nice to have more information and perhaps like some examples / logs / screenshots when you get the chance

wiardvanrij avatar Apr 15 '22 00:04 wiardvanrij

Hello 👋 Looks like there was no activity on this issue for the last two months. Do you mind updating us on the status? Is this still reproducible or needed? If yes, just comment on this PR or push a commit. Thanks! 🤗 If there will be no activity in the next two weeks, this issue will be closed (we can always reopen an issue if we need!). Alternatively, use remind command if you wish to be reminded at some point in future.

stale[bot] avatar Sep 21 '22 06:09 stale[bot]