thanos
thanos copied to clipboard
query: Error with "dnssrv+" service discovery with intermediate CNAME
Thanos, Prometheus and Golang version used: thanos:v0.28.0
What happened:
When Thanos query is deployed in a Kubernetes cluster with coredns configured to avoid superfluous DNS requests (with autopath @kubernetes and pods verified), a CNAME may be returned to the DNS query resulting in a error in storeAPIs addresses resolution.
This happen when using "relative" DNS name service.namespace, without the full cluster DNS domain.
StoreAPI endpoint is properly discovered on Thanos query start, but few seconds later the resolution fail, removing the endpoint.
It look like it's because only SRV type in response are handled in https://github.com/thanos-io/thanos/blob/v0.28.0/pkg/discovery/dns/miekgdns/resolver.go#L37
What you expected to happen:
The CNAME should be followed to get the real SRV value.
How to reproduce it (as minimally and precisely as possible):
Update Kubernetes coredns config to include both autopath @kubernetes and pods verified:
kubernetes cluster.local in-addr.arpa ip6.arpa {
pods verified
fallthrough in-addr.arpa ip6.arpa
}
autopath @kubernetes
Restart coredns pods then deploy following manifests:
---
apiVersion: v1
kind: Namespace
metadata:
name: test
---
apiVersion: v1
kind: Service
metadata:
labels:
app: query-test
name: query-test
namespace: test
spec:
ports:
- name: grpc
port: 10901
selector:
app: query-test
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: query-main
namespace: test
spec:
selector:
matchLabels:
app: query-main
template:
metadata:
labels:
app: query-main
spec:
containers:
- name: query
args:
- query
- --store=dnssrv+_grpc._tcp.query-test.test
image: quay.io/thanos/thanos:v0.28.0
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: query-test
namespace: test
spec:
selector:
matchLabels:
app: query-test
template:
metadata:
labels:
app: query-test
spec:
containers:
- args:
- query
- --grpc-address=0.0.0.0:10901
image: quay.io/thanos/thanos:v0.28.0
name: query
ports:
- containerPort: 10901
name: grpc
Full logs to relevant components:
Query logs:
level=info ts=2022-09-07T12:39:27.90183412Z caller=options.go:26 protocol=gRPC msg="disabled TLS, key and cert must be set to enable"
level=info ts=2022-09-07T12:39:27.903174142Z caller=query.go:724 msg="starting query node"
level=info ts=2022-09-07T12:39:27.907765004Z caller=intrumentation.go:75 msg="changing probe status" status=healthy
level=info ts=2022-09-07T12:39:27.908028331Z caller=http.go:73 service=http/server component=query msg="listening for requests and metrics" address=0.0.0.0:10902
level=info ts=2022-09-07T12:39:27.908649481Z caller=tls_config.go:195 service=http/server component=query msg="TLS is disabled." http2=false
level=info ts=2022-09-07T12:39:27.911832627Z caller=intrumentation.go:56 msg="changing probe status" status=ready
level=info ts=2022-09-07T12:39:27.912201467Z caller=grpc.go:131 service=gRPC/server component=query msg="listening for serving gRPC" address=0.0.0.0:10901
level=info ts=2022-09-07T12:39:32.920763992Z caller=endpointset.go:381 component=endpointset msg="adding new query with [storeAPI rulesAPI exemplarsAPI targetsAPI MetricMetadataAPI QueryAPI]" address=10.43.60.110:10901 extLset=
level=error ts=2022-09-07T12:39:57.914719788Z caller=query.go:555 msg="failed to resolve addresses for storeAPIs" err="lookup SRV records \"_grpc._tcp.query-test.test\": invalid SRV response record _grpc._tcp.query-test.test.test.svc.cluster.local.\t5\tIN\tCNAME\t_grpc._tcp.query-test.test.svc.cluster.local."
Anything else we need to know:
Without autopath @kubernetes and pods verified:
root@debug-pod:/# grep search /etc/resolv.conf
search test.svc.cluster.local svc.cluster.local cluster.local
root@debug-pod:/# dig +search +noall +answer +additional SRV _grpc._tcp.query-test.test
_grpc._tcp.query-test.test.svc.cluster.local. 5 IN SRV 0 100 10901 query-test.test.svc.cluster.local.
query-test.test.svc.cluster.local. 5 IN A 10.43.60.110
tcpdump:
13:00:42.364131 eth0 Out ifindex 3 86:5d:56:72:50:08 ethertype IPv4 (0x0800), length 141: 10.42.2.20.40814 > 10.43.0.10.53: 53282+ [1au] SRV? _grpc._tcp.query-test.test.default.svc.cluster.local. (93)
13:00:42.368655 eth0 In ifindex 3 ee:ee:ee:ee:ee:ee ethertype IPv4 (0x0800), length 234: 10.43.0.10.53 > 10.42.2.20.40814: 53282 NXDomain*- 0/1/1 (186)
13:00:42.369493 eth0 Out ifindex 3 86:5d:56:72:50:08 ethertype IPv4 (0x0800), length 133: 10.42.2.20.41211 > 10.43.0.10.53: 25643+ [1au] SRV? _grpc._tcp.query-test.test.svc.cluster.local. (85)
13:00:42.371152 eth0 In ifindex 3 ee:ee:ee:ee:ee:ee ethertype IPv4 (0x0800), length 279: 10.43.0.10.53 > 10.42.2.20.41211: 25643*- 1/0/2 SRV query-test.test.svc.cluster.local.:10901 0 100 (231)
With autopath @kubernetes and pods verified:
root@debug-pod:/# dig +search +noall +answer +additional SRV _grpc._tcp.query-test.test
_grpc._tcp.query-test.test.test.svc.cluster.local. 5 IN CNAME _grpc._tcp.query-test.test.svc.cluster.local.
_grpc._tcp.query-test.test.svc.cluster.local. 5 IN SRV 0 100 10901 query-test.test.svc.cluster.local.
query-test.test.svc.cluster.local. 5 IN A 10.43.60.110
tcpdump:
13:01:33.400013 eth0 Out ifindex 3 86:5d:56:72:50:08 ethertype IPv4 (0x0800), length 141: 10.42.2.20.58526 > 10.43.0.10.53: 41154+ [1au] SRV? _grpc._tcp.query-test.test.default.svc.cluster.local. (93)
13:01:33.404428 eth0 In ifindex 3 ee:ee:ee:ee:ee:ee ethertype IPv4 (0x0800), length 397: 10.43.0.10.53 > 10.42.2.20.58526: 41154*- 2/0/2 CNAME _grpc._tcp.query-test.test.svc.cluster.local., SRV query-test.test.svc.cluster.local.:10901 0 100 (349)
I saw that a way to follow CNAME was added previously in LookupIPAddr function via https://github.com/thanos-io/thanos/pull/5271
Could we get the same for SRV?
Thanks.
Thanks for the detailed report @Tassatux! This sounds like a reasonable request to me. Are you happy to take this or shall I open this for others to work on? :slightly_smiling_face:
I'm not sure how to properly fix it, so I prefer that someone with more Go knowledge take a look. :slightly_smiling_face:
Can I work on this one??
Sure @h20220025 go for it :rocket:
Are you still on it @h20220025 ?
Hello👋 I'd like to work on this issue :)
Hey @Atharva-Shinde I'd say go for it :rocket:, maybe @h20220025 didn't get a chance to pick this up after all :upside_down_face: