thanos icon indicating copy to clipboard operation
thanos copied to clipboard

query: error DeadlineExceeded missing error details

Open Jakob3xD opened this issue 2 years ago • 15 comments

Thanos, Prometheus and Golang version used: The Docker images: Thanos: 0.23.0 and newer go: go1.16.8

What happened: When the query fails to connect to a store/sidecar it does not display certificates issues like before. This makes debugging difficulty.

level=warn
ts=2021-12-03T16:11:21.556102388Z
caller=endpointset.go:500
component=endpointset
msg="update of node failed" err="getting metadata: fallback fetching info from prometheus-sidecar-blockstorage-1-0.prometheus-operated.cpe-scrape-blockstorage.svc.cluster.local:10901: rpc error: code = DeadlineExceeded desc = context deadline exceeded"
address=prometheus-sidecar-blockstorage-1-0.prometheus-operated.cpe-scrape-blockstorage.svc.cluster.local:10901

What you expected to happen: The error should display more information if the certificate is invalid like in Thanos v0.22.0

level=warn ts=2021-12-03T16:30:14.55260649Z
caller=endpointset.go:512
component=endpointset
msg="update of node failed" err="getting metadata: fallback fetching info from prometheus-sidecar-blockstorage-1-0.prometheus-operated.cpe-scrape-blockstorage.svc.cluster.local:10901 after err: rpc error: code = DeadlineExceeded desc = latest balancer error: connection error: desc = \"transport: authentication handshake failed: x509: certificate is valid for prometheus-sidecar-blockstorage-1-0.prometheus-operated.scrape-blockstorage.svc.cluster.local, not prometheus-sidecar-blockstorage-1-0.prometheus-operated.cpe-scrape-blockstorage.svc.cluster.local\": rpc error: code = DeadlineExceeded desc = context deadline exceeded"
address=prometheus-sidecar-blockstorage-1-0.prometheus-operated.cpe-scrape-blockstorage.svc.cluster.local:10901

How to reproduce it (as minimally and precisely as possible):

  • Create an invalid Cert for the Store/Sidecar
  • Run Thanos 0.23.0 or newer with TLS enabled at the Store/Sidecar

Jakob3xD avatar Dec 03 '21 16:12 Jakob3xD

The error message says the cert is not valid, please check cert. the diff is .scrape-blockstorage vs cpe-scrape-blockstorage. transport: authentication handshake failed: x509: certificate is valid for prometheus-sidecar-blockstorage-1-0.prometheus-operated.scrape-blockstorage.svc.cluster.local, not prometheus-sidecar-blockstorage-1-0.prometheus-operated.cpe-scrape-blockstorage.svc.cluster.local

hanjm avatar Dec 04 '21 03:12 hanjm

Yes, I know what the error message is complaining about but only because it showed the Cert error, which is only displayed in Thanos Versions v0.22.0 and bellow. Thats is my issue. In Thanos v0.23.0 does NOT display the cert error.

Jakob3xD avatar Dec 04 '21 04:12 Jakob3xD

Ran into a similar issue and had to temporarily install v0.22.0 to see the full error message Thanos v0.24.0 still does not display the full error message

RoryDoherty avatar Feb 01 '22 15:02 RoryDoherty

I've faced the same issue today, I've missed adding the store port and I could only see the errors after rollback to version 0.22

nicolastakashi avatar Feb 11 '22 14:02 nicolastakashi

Hello 👋 Looks like there was no activity on this issue for the last two months. Do you mind updating us on the status? Is this still reproducible or needed? If yes, just comment on this PR or push a commit. Thanks! 🤗 If there will be no activity in the next two weeks, this issue will be closed (we can always reopen an issue if we need!). Alternatively, use remind command if you wish to be reminded at some point in future.

stale[bot] avatar Apr 16 '22 03:04 stale[bot]

Closing for now as promised, let us know if you need this to be reopened! 🤗

stale[bot] avatar Apr 30 '22 10:04 stale[bot]

Still reproducible and needed.

ahrex avatar May 09 '22 19:05 ahrex

Is this also reproducible with the latest version?

wiardvanrij avatar May 09 '22 19:05 wiardvanrij

I have yet to ask our internal Thanos maintainers about whether the issue persists on the latest version or not.

It looks like we've held our deploy back at v0.22.0 so we can have the additional context in error messages.

I did some digging and found the following:

  • thanos commit 2bf7e1ca06e60cd1dc722c97640b639b9d8ce704 between tags v0.22.0 and v0.23.0 upgraded a few dependencies, including grpc:
ahlex@galette:~/repos/thanos% git diff 2bf7e1ca06e60cd1dc722c97640b639b9d8ce704{~,} -- go.mod | rg -C 2 -w grpc
-       google.golang.org/api v0.50.0
-       google.golang.org/genproto v0.0.0-20210713002101-d411969a0d9a
-       google.golang.org/grpc v1.39.0
+       google.golang.org/api v0.56.0
+       google.golang.org/genproto v0.0.0-20210903162649-d08c68adba83
+       google.golang.org/grpc v1.40.0
        gopkg.in/alecthomas/kingpin.v2 v2.2.6
        gopkg.in/fsnotify.v1 v1.4.7
--
+       github.com/prometheus/prometheus => github.com/prometheus/prometheus v1.8.2-0.20210914090109-37468d88dce8
        github.com/sercand/kuberesolver => github.com/sercand/kuberesolver v2.4.0+incompatible
        google.golang.org/grpc => google.golang.org/grpc v1.29.1
  • inspecting differences between grpc versions v1.39.0 and v1.40.0 show a simplification in the errors returned:
ahlex@galette:~/repos/grpc-go% git diff v1.39.0..v1.40.0 -- internal/status/status.go | rg -C 5 -F 'Error()'
 type Error struct {
-       e *spb.Status
+       s *Status
 }
 
 func (e *Error) Error() string {
-       return fmt.Sprintf("rpc error: code = %s desc = %s", codes.Code(e.e.GetCode()), e.e.GetMessage())
+       return e.s.String()
 }
 
 // GRPCStatus returns the Status represented by se.
  • with what I think is the oversimplification of going from e.e.GetMessage() to e.s.String(), introduced in ba41bbac225e6e1a9b822fe636c40c3b7d977894 of grpc:
ahlex@galette:~/repos/grpc-go% git diff ba41bbac225e6e1a9b822fe636c40c3b7d977894{~,} -- internal/status/status.go | rg -C 5 -F 'Error()'
 type Error struct {
-       e *spb.Status
+       s *Status
 }
 
 func (e *Error) Error() string {
-       return fmt.Sprintf("rpc error: code = %s desc = %s", codes.Code(e.e.GetCode()), e.e.GetMessage())
+       return e.s.String()
 }
 
 // GRPCStatus returns the Status represented by se.

To help answer:

Is this also reproducible with the latest version?

  • getting the version of grpc in the latest tagged thanos version v0.26.0:
ahlex@galette:~/repos/thanos% git show v0.26.0:go.mod | rg -F google.golang.org/grpc
        google.golang.org/grpc v1.43.0
        google.golang.org/grpc/examples v0.0.0-20211119005141-f45e61797429
        google.golang.org/grpc => google.golang.org/grpc v1.40.1
  • checking for the simplified Error() response in grpc version v1.43.0:
ahlex@galette:~/repos/grpc-go% git show v1.43.0:internal/status/status.go | rg -C 5 -F 'Error()'
// and a nil *Error should never be returned by this package.
type Error struct {
        s *Status
}

func (e *Error) Error() string {
        return e.s.String()
}

// GRPCStatus returns the Status represented by se.
func (e *Error) GRPCStatus() *Status {

Yes, I believe the latest Thanos can repro the issue, but I would like someone to also validate my digging.


As for a solution forward, if y'all think my research is on the right track, I'm more than happy to open a PR into grpc-go, if no one else beats me to it.

Thanks!

ahrex avatar May 09 '22 21:05 ahrex

I think that makes sense. The only thing I'm wondering if we shouldn't call something else to get the full info, but at a first simple look, I could not find such method. Perhaps we should make an issue / PR there indeed. I.e. best time/effort would perhaps be to create an issue there. Just to double check we are on the right path.

Anyhow, good spot/debugging ;) thank you

wiardvanrij avatar May 09 '22 22:05 wiardvanrij

Hello! So I spent some time trying to PR a fix, to find out that I was a bit off the mark.

While I still think the problem is between grpc-go versions v1.39.0 and v1.40.0, upon closer inspection the changes to internal/status/status.go look fine.

I did some more digging and found out I'm not proficient enough in all of the changes to make a definitive call, so I filed https://github.com/grpc/grpc-go/issues/5342 to have the experts look at it. I did find some solid anchors to what I think needs to be bubbled up, and I've dropped those hints onto the upstream bug report.

Sorry I wasn't more helpful, and hopefully we get a fix soon!

ahrex avatar May 10 '22 05:05 ahrex

All good, you helped a lot :)

wiardvanrij avatar May 10 '22 09:05 wiardvanrij

Just wanted to add this here, I ran into this same issue during an upgrade:

level=warn ts=2022-06-09T03:31:27.679924722Z caller=endpointset.go:517 component=endpointset msg="update of node failed" err="getting metadata: fallback fetching info from metrics.mycluster.com:443: rpc error: code = DeadlineExceeded desc = context deadline exceeded" address=metrics.mycluster.com:443

I'm not sure if others are experiencing this using the bitnami helm chart, but that is how I installed Thanos. The issue was that the querier parameter --grpc-client-server-name was missing after upgrading from the 9.x to 10.x version of the chart. At some point they changed the values structure from query.grpc.client.servername to query.grpc.client.serverName (capitalized N).

mbrancato avatar Jun 09 '22 04:06 mbrancato

Hello 👋 Looks like there was no activity on this issue for the last two months. Do you mind updating us on the status? Is this still reproducible or needed? If yes, just comment on this PR or push a commit. Thanks! 🤗 If there will be no activity in the next two weeks, this issue will be closed (we can always reopen an issue if we need!). Alternatively, use remind command if you wish to be reminded at some point in future.

stale[bot] avatar Aug 13 '22 07:08 stale[bot]

Still relevant! Problem persists in version 0.28.0.

erikgb avatar Sep 14 '22 20:09 erikgb