thanos
thanos copied to clipboard
query: error DeadlineExceeded missing error details
Thanos, Prometheus and Golang version used: The Docker images: Thanos: 0.23.0 and newer go: go1.16.8
What happened: When the query fails to connect to a store/sidecar it does not display certificates issues like before. This makes debugging difficulty.
level=warn
ts=2021-12-03T16:11:21.556102388Z
caller=endpointset.go:500
component=endpointset
msg="update of node failed" err="getting metadata: fallback fetching info from prometheus-sidecar-blockstorage-1-0.prometheus-operated.cpe-scrape-blockstorage.svc.cluster.local:10901: rpc error: code = DeadlineExceeded desc = context deadline exceeded"
address=prometheus-sidecar-blockstorage-1-0.prometheus-operated.cpe-scrape-blockstorage.svc.cluster.local:10901
What you expected to happen: The error should display more information if the certificate is invalid like in Thanos v0.22.0
level=warn ts=2021-12-03T16:30:14.55260649Z
caller=endpointset.go:512
component=endpointset
msg="update of node failed" err="getting metadata: fallback fetching info from prometheus-sidecar-blockstorage-1-0.prometheus-operated.cpe-scrape-blockstorage.svc.cluster.local:10901 after err: rpc error: code = DeadlineExceeded desc = latest balancer error: connection error: desc = \"transport: authentication handshake failed: x509: certificate is valid for prometheus-sidecar-blockstorage-1-0.prometheus-operated.scrape-blockstorage.svc.cluster.local, not prometheus-sidecar-blockstorage-1-0.prometheus-operated.cpe-scrape-blockstorage.svc.cluster.local\": rpc error: code = DeadlineExceeded desc = context deadline exceeded"
address=prometheus-sidecar-blockstorage-1-0.prometheus-operated.cpe-scrape-blockstorage.svc.cluster.local:10901
How to reproduce it (as minimally and precisely as possible):
- Create an invalid Cert for the Store/Sidecar
- Run Thanos 0.23.0 or newer with TLS enabled at the Store/Sidecar
The error message says the cert is not valid, please check cert. the diff is .scrape-blockstorage
vs cpe-scrape-blockstorage
.
transport: authentication handshake failed: x509: certificate is valid for prometheus-sidecar-blockstorage-1-0.prometheus-operated.scrape-blockstorage.svc.cluster.local, not prometheus-sidecar-blockstorage-1-0.prometheus-operated.cpe-scrape-blockstorage.svc.cluster.local
Yes, I know what the error message is complaining about but only because it showed the Cert error, which is only displayed in Thanos Versions v0.22.0 and bellow. Thats is my issue. In Thanos v0.23.0 does NOT display the cert error.
Ran into a similar issue and had to temporarily install v0.22.0 to see the full error message Thanos v0.24.0 still does not display the full error message
I've faced the same issue today, I've missed adding the store port and I could only see the errors after rollback to version 0.22
Hello 👋 Looks like there was no activity on this issue for the last two months.
Do you mind updating us on the status? Is this still reproducible or needed? If yes, just comment on this PR or push a commit. Thanks! 🤗
If there will be no activity in the next two weeks, this issue will be closed (we can always reopen an issue if we need!). Alternatively, use remind
command if you wish to be reminded at some point in future.
Closing for now as promised, let us know if you need this to be reopened! 🤗
Still reproducible and needed.
Is this also reproducible with the latest version?
I have yet to ask our internal Thanos maintainers about whether the issue persists on the latest version or not.
It looks like we've held our deploy back at v0.22.0
so we can have the additional context in error messages.
I did some digging and found the following:
- thanos commit
2bf7e1ca06e60cd1dc722c97640b639b9d8ce704
between tagsv0.22.0
andv0.23.0
upgraded a few dependencies, includinggrpc
:
ahlex@galette:~/repos/thanos% git diff 2bf7e1ca06e60cd1dc722c97640b639b9d8ce704{~,} -- go.mod | rg -C 2 -w grpc
- google.golang.org/api v0.50.0
- google.golang.org/genproto v0.0.0-20210713002101-d411969a0d9a
- google.golang.org/grpc v1.39.0
+ google.golang.org/api v0.56.0
+ google.golang.org/genproto v0.0.0-20210903162649-d08c68adba83
+ google.golang.org/grpc v1.40.0
gopkg.in/alecthomas/kingpin.v2 v2.2.6
gopkg.in/fsnotify.v1 v1.4.7
--
+ github.com/prometheus/prometheus => github.com/prometheus/prometheus v1.8.2-0.20210914090109-37468d88dce8
github.com/sercand/kuberesolver => github.com/sercand/kuberesolver v2.4.0+incompatible
google.golang.org/grpc => google.golang.org/grpc v1.29.1
- inspecting differences between grpc versions
v1.39.0
andv1.40.0
show a simplification in the errors returned:
ahlex@galette:~/repos/grpc-go% git diff v1.39.0..v1.40.0 -- internal/status/status.go | rg -C 5 -F 'Error()'
type Error struct {
- e *spb.Status
+ s *Status
}
func (e *Error) Error() string {
- return fmt.Sprintf("rpc error: code = %s desc = %s", codes.Code(e.e.GetCode()), e.e.GetMessage())
+ return e.s.String()
}
// GRPCStatus returns the Status represented by se.
- with what I think is the oversimplification of going from
e.e.GetMessage()
toe.s.String()
, introduced inba41bbac225e6e1a9b822fe636c40c3b7d977894
ofgrpc
:
ahlex@galette:~/repos/grpc-go% git diff ba41bbac225e6e1a9b822fe636c40c3b7d977894{~,} -- internal/status/status.go | rg -C 5 -F 'Error()'
type Error struct {
- e *spb.Status
+ s *Status
}
func (e *Error) Error() string {
- return fmt.Sprintf("rpc error: code = %s desc = %s", codes.Code(e.e.GetCode()), e.e.GetMessage())
+ return e.s.String()
}
// GRPCStatus returns the Status represented by se.
To help answer:
Is this also reproducible with the latest version?
- getting the version of
grpc
in the latest tagged thanos versionv0.26.0
:
ahlex@galette:~/repos/thanos% git show v0.26.0:go.mod | rg -F google.golang.org/grpc
google.golang.org/grpc v1.43.0
google.golang.org/grpc/examples v0.0.0-20211119005141-f45e61797429
google.golang.org/grpc => google.golang.org/grpc v1.40.1
- checking for the simplified
Error()
response ingrpc
versionv1.43.0
:
ahlex@galette:~/repos/grpc-go% git show v1.43.0:internal/status/status.go | rg -C 5 -F 'Error()'
// and a nil *Error should never be returned by this package.
type Error struct {
s *Status
}
func (e *Error) Error() string {
return e.s.String()
}
// GRPCStatus returns the Status represented by se.
func (e *Error) GRPCStatus() *Status {
Yes, I believe the latest Thanos can repro the issue, but I would like someone to also validate my digging.
As for a solution forward, if y'all think my research is on the right track, I'm more than happy to open a PR into grpc-go
, if no one else beats me to it.
Thanks!
I think that makes sense. The only thing I'm wondering if we shouldn't call something else to get the full info, but at a first simple look, I could not find such method. Perhaps we should make an issue / PR there indeed. I.e. best time/effort would perhaps be to create an issue there. Just to double check we are on the right path.
Anyhow, good spot/debugging ;) thank you
Hello! So I spent some time trying to PR a fix, to find out that I was a bit off the mark.
While I still think the problem is between grpc-go versions v1.39.0
and v1.40.0
, upon closer inspection the changes to internal/status/status.go
look fine.
I did some more digging and found out I'm not proficient enough in all of the changes to make a definitive call, so I filed https://github.com/grpc/grpc-go/issues/5342 to have the experts look at it. I did find some solid anchors to what I think needs to be bubbled up, and I've dropped those hints onto the upstream bug report.
Sorry I wasn't more helpful, and hopefully we get a fix soon!
All good, you helped a lot :)
Just wanted to add this here, I ran into this same issue during an upgrade:
level=warn ts=2022-06-09T03:31:27.679924722Z caller=endpointset.go:517 component=endpointset msg="update of node failed" err="getting metadata: fallback fetching info from metrics.mycluster.com:443: rpc error: code = DeadlineExceeded desc = context deadline exceeded" address=metrics.mycluster.com:443
I'm not sure if others are experiencing this using the bitnami helm chart, but that is how I installed Thanos. The issue was that the querier parameter --grpc-client-server-name
was missing after upgrading from the 9.x to 10.x version of the chart. At some point they changed the values structure from query.grpc.client.servername
to query.grpc.client.serverName
(capitalized N
).
Hello 👋 Looks like there was no activity on this issue for the last two months.
Do you mind updating us on the status? Is this still reproducible or needed? If yes, just comment on this PR or push a commit. Thanks! 🤗
If there will be no activity in the next two weeks, this issue will be closed (we can always reopen an issue if we need!). Alternatively, use remind
command if you wish to be reminded at some point in future.
Still relevant! Problem persists in version 0.28.0.