thanos
thanos copied to clipboard
Trying to add observee cluster, but observer thanos query cannot discover external thanos-discovery sidecar
Thanos, Prometheus and Golang version used: docker.io/bitnami/thanos:0.31.0-scratch-r8
Object Storage Provider: Amazon s3
What happened: I am trying to add an thanos sidecar from another eks cluster(Cluster B) to the thanos query store(Cluster A).
in Cluster A, I used the helm chart (kube-prometheus-stack:47.3.0), and expose the thanos sidecar with alb lb controller ingress.
# Ingress exposes thanos sidecar outside the cluster
thanosIngress:
enabled: true
ingressClassName: alb
annotations:
cert-manager.io/cluster-issuer: letsencrypt-prod
alb.ingress.kubernetes.io/load-balancer-name: alpha-prometheus-alb-ingress
alb.ingress.kubernetes.io/backend-protocol: HTTP
alb.ingress.kubernetes.io/backend-protocol-version: GRPC
alb.ingress.kubernetes.io/group.name: prometheus-alpha
alb.ingress.kubernetes.io/target-type: 'ip'
alb.ingress.kubernetes.io/scheme: internet-facing
alb.ingress.kubernetes.io/security-groups: sg-xxxxxxxxx
alb.ingress.kubernetes.io/manage-backend-security-group-rules: "true"
alb.ingress.kubernetes.io/subnets: subnet-xxxxxxxx, subnet-xxxxxxxx
alb.ingress.kubernetes.io/listen-ports: '[{"HTTPS":443}]'
alb.ingress.kubernetes.io/healthcheck-path: /-/healthy
alb.ingress.kubernetes.io/certificate-arn: <ACM ARN>
labels: {}
# servicePort: 10901
## Port to expose on each node
## Only used if service.type is 'NodePort'
##
nodePort: 30901
## Hosts must be provided if Ingress is enabled.
##
hosts:
- thanos-sc-alpha.alpha.example.in
# - thanos-gateway.domain.com
## Paths to use for ingress rules
##
paths:
- /*
# - /
## For Kubernetes >= 1.18 you should specify the pathType (determines how Ingress paths should be matched)
## See https://kubernetes.io/blog/2020/04/02/improvements-to-the-ingress-api-in-kubernetes-1.18/#better-path-matching-with-path-types
pathType: ImplementationSpecific
## TLS configuration for Thanos Ingress
## Secret must be manually created in the namespace
##
tls:
- secretName: thanos-gateway-tls
hosts:
- thanos-sc-alpha.alpha.example.in
#
After the installtion, i was able to access the grpc with grpcurl.
$ grpcurl thanos-sc-alpha.alpha.example.in:443 list
grpc.health.v1.Health
grpc.reflection.v1alpha.ServerReflection
thanos.Exemplars
thanos.Metadata
thanos.Rules
thanos.Store
thanos.Targets
thanos.info.Info
$ grpcurl thanos-sc-alpha.alpha.example.in:443 list grpc.health.v1.Health
grpc.health.v1.Health.Check
grpc.health.v1.Health.Watch
$ grpcurl thanos-sc-alpha.alpha.example.in:443 grpc.health.v1.Health.Check
{
"status": "SERVING"
}
However, my thanos-query in Cluster B cannot discover the sidecar.
query:
replicaCount: 1
extraFlags: []
stores:
- prometheus-inhouse-kube-pr-thanos-discovery:10901 =====> local thanos sidecar (in Cluster A) works .
- dns+thanos-sc-alpha.alpha.example.in:443 ===============> external thanos sidecar not discovered
level=info ts=2023-12-04T10:33:56.058563173Z caller=options.go:26 protocol=gRPC msg="disabled TLS, key and cert must be set to enable"
level=info ts=2023-12-04T10:33:56.058957254Z caller=query.go:840 msg="starting query node"
level=info ts=2023-12-04T10:33:56.059286219Z caller=intrumentation.go:75 msg="changing probe status" status=healthy
level=info ts=2023-12-04T10:33:56.059301679Z caller=http.go:73 service=http/server component=query msg="listening for requests and metrics" address=0.0.0.0:10902
level=info ts=2023-12-04T10:33:56.059411941Z caller=tls_config.go:232 service=http/server component=query msg="Listening on" address=[::]:10902
level=info ts=2023-12-04T10:33:56.059442163Z caller=tls_config.go:235 service=http/server component=query msg="TLS is disabled." http2=false address=[::]:10902
level=info ts=2023-12-04T10:33:56.059466435Z caller=intrumentation.go:56 msg="changing probe status" status=ready
level=info ts=2023-12-04T10:33:56.059483879Z caller=grpc.go:131 service=gRPC/server component=query msg="listening for serving gRPC" address=0.0.0.0:10901
level=warn ts=2023-12-04T10:34:06.064023703Z caller=endpointset.go:451 component=endpointset msg="update of endpoint failed" err="getting metadata: fallback fetching info from x.x.x.x:443: rpc error: code = DeadlineExceeded desc = context deadline exceeded" address=x.x.x.x:443
level=warn ts=2023-12-04T10:34:06.064142552Z caller=endpointset.go:451 component=endpointset msg="update of endpoint failed" err="getting metadata: fallback fetching info from x.x.x.x:443: rpc error: code = DeadlineExceeded desc = context deadline exceeded" address=3.36.218.184:443
......
FYI, the alb security group is open to thanos query, as welll as my local laptop.
What you expected to happen: Thanos query should be able to discover external sidecar, which is exposed by aws ALB grpc.
How to reproduce it (as minimally and precisely as possible):
as above.
Full logs to relevant components: as above.
Anything else we need to know:
For more information, thanos query's args:
args:
- query
- '--log.level=info'
- '--log.format=logfmt'
- '--grpc-address=0.0.0.0:10901'
- '--http-address=0.0.0.0:10902'
- '--query.replica-label=replica'
- >-
--endpoint=dnssrv+_grpc._tcp.thanos-query-xxxx-storegateway.monitoring.svc.cluster.local
- >-
--endpoint=dnssrv+_grpc._tcp.thanos-query-xxxx-ruler.monitoring.svc.cluster.local
- '--endpoint=prometheus-inhouse-kube-pr-thanos-discovery:10901'
- '--endpoint=dns+thanos-sc-alpha.alpha.example.in:443'
- '--grpc-client-server-name=thanos-sc-alpha.alpha.example.in'
I have tried grpc.server.tls.enable : true
or grpc.client.tls.enable : true
or both,
but nothing was successful...
Also, i have gone through similar issues, also nothing was successful ;( (i.e. --grpc-client-tls-secure
)
Hello @nessa829 were you able to fix that ?
@KM3dd Hi, i changed it to create nlb instead (service type: loadbalancer) of ALB, and it worked.
@nessa829 thank you for your response, that's what I am rying to do but I am new to that so I got stuck, meaning you kept using nginx but service type is loadbalancer or you exposed the service directly and used the external address ip ? thank you again
@KM3dd I disabled thanosIngress
and enabled thanosServiceExternal
instead
thanosServiceExternal:
annotations:
service.beta.kubernetes.io/aws-load-balancer-name: "thanos-sc-lb"
service.beta.kubernetes.io/aws-load-balancer-scheme: "internal"
service.beta.kubernetes.io/aws-load-balancer-type: "external"
service.beta.kubernetes.io/aws-load-balancer-nlb-target-type: "ip"
service.beta.kubernetes.io/aws-load-balancer-subnets: {{prometheus.subnet}}
@nessa829 thank you very much for your help
@nessa829 would you mind to write a brief description on how you solved the issue?
- I understand that you used LoadBalancer instead of Ingress?
- Also, when do you use thanosService vs thanosServiceExternal? If the Cluster A is the "master" cluster woulndnt it be appropiate to define the thanosService there and only on the observer (slave) clusters to set up thanosServiceExternal?
- Especially curious on your thanos query's args