thanos
thanos copied to clipboard
Thanos connection errors
Thanos, Prometheus and Golang version used: Thanos: 0.18.0 && 0.27.0 (i tried updating to 0.27.0 to see if it would fix the issue) Prometheus: 2.25.0
Object Storage Provider: GCP
What happened:
We started getting the following errors on random qurries:
Mint: 1660132800030 Maxt: 9223372036854775807: rpc error: code = Canceled desc = grpc: the client connection is closing
Mint: 1660132800030 Maxt: 9223372036854775807: rpc error: code = Internal desc = unexpected EOF
What you expected to happen: No errors to occur and result to to be returned by the query.
How to reproduce it (as minimally and precisely as possible): Hit Thanos with random quiries and you will eventally get the errors(usually every 7-8 requests).
Anything else we need to know:
We tried setting the following args to tune Thanos to try and fix the errors but they did not work and were removed: thanos-sidecar: --grpc-grace-period=5s thanos-querier: --store.response-timeout=10s --grpc-grace-period=5s thanos-store-gateway: --store.response-timeout=10s --grpc-grace-period=5s thanos-global-query --query.timeout=5m --store.response-timeout=10s --grpc-grace-period=5s
and alredy had the follow local query timeout set: thanos-querier: - --query.timeout=2m
We are also getting similar error in one of our thanos envs.
Can someone from Thanos dev team provide some insight how to get around the issue?
Thx!
Hi,
It looks like some TCP connection reset. Can be through GRPC client connections or some proxy you use in between the client and server. I remember we had someone reporting and fixing similar issue in their setup- perhaps looking through past issues would help. Perhaps @jmichalek132 or @GiedriusS you remember that?
Additionally, did you check if server was not restarting at the moment of those queries? E.g. OOMs?
@bwplotka Yes ive checked is the server was restarting and it wasnt, I did look at past issues and had found this one: https://github.com/thanos-io/thanos/issues/2286#issuecomment-601072887
Who had what seems the like the same issue, and was able to fix it some tls config but did not mention which one or what they needed to set it to.