thanos icon indicating copy to clipboard operation
thanos copied to clipboard

Thanos connection errors

Open bsamsom opened this issue 2 years ago • 4 comments

Thanos, Prometheus and Golang version used: Thanos: 0.18.0 && 0.27.0 (i tried updating to 0.27.0 to see if it would fix the issue) Prometheus: 2.25.0

Object Storage Provider: GCP

What happened: We started getting the following errors on random qurries: Mint: 1660132800030 Maxt: 9223372036854775807: rpc error: code = Canceled desc = grpc: the client connection is closing

Mint: 1660132800030 Maxt: 9223372036854775807: rpc error: code = Internal desc = unexpected EOF

What you expected to happen: No errors to occur and result to to be returned by the query.

How to reproduce it (as minimally and precisely as possible): Hit Thanos with random quiries and you will eventally get the errors(usually every 7-8 requests).

Anything else we need to know:

We tried setting the following args to tune Thanos to try and fix the errors but they did not work and were removed: thanos-sidecar: --grpc-grace-period=5s thanos-querier: --store.response-timeout=10s --grpc-grace-period=5s thanos-store-gateway: --store.response-timeout=10s --grpc-grace-period=5s thanos-global-query --query.timeout=5m --store.response-timeout=10s --grpc-grace-period=5s

and alredy had the follow local query timeout set: thanos-querier: - --query.timeout=2m

bsamsom avatar Aug 25 '22 18:08 bsamsom

We are also getting similar error in one of our thanos envs.

Can someone from Thanos dev team provide some insight how to get around the issue?

Thx!

aarontams avatar Sep 20 '22 00:09 aarontams

Hi,

It looks like some TCP connection reset. Can be through GRPC client connections or some proxy you use in between the client and server. I remember we had someone reporting and fixing similar issue in their setup- perhaps looking through past issues would help. Perhaps @jmichalek132 or @GiedriusS you remember that?

bwplotka avatar Sep 20 '22 11:09 bwplotka

Additionally, did you check if server was not restarting at the moment of those queries? E.g. OOMs?

bwplotka avatar Sep 20 '22 11:09 bwplotka

@bwplotka Yes ive checked is the server was restarting and it wasnt, I did look at past issues and had found this one: https://github.com/thanos-io/thanos/issues/2286#issuecomment-601072887

Who had what seems the like the same issue, and was able to fix it some tls config but did not mention which one or what they needed to set it to.

bsamsom avatar Sep 20 '22 16:09 bsamsom