thanos
thanos copied to clipboard
Thanos sidecar is READY state when Prometheus is unavailable
Hello everyone,
TL;DR: We are using the traditional combination model like any other, Prometheus & Thanos sideCar. However, We encountered the issue when our Prometheus got OOMed and We expect the calling from Thanos Query will not route the traffic to the unavailable Prometheus Pod.In fact, the traffic is still forwarded as usual. I guess the main reason is that Thanos SideCar state is still READY at that moment
I assumed this PR was implemented since 0.23 and what current version that We are using is 0.25 so it should not be a problem
The manifest details
Thanos-Query
Container ID: containerd://942e9480c2430a3328d9478598db27601f0c91f83526e0006245d9566401b6ee
Image: quay.io/thanos/thanos:v0.25.0
Image ID: quay.io/thanos/thanos@sha256:bc3657f2b793f2f482991807e5e5a637f1ae4f1c75fb58d563c18a447ea61b8b
Ports: 10901/TCP, 9090/TCP
Host Ports: 0/TCP, 0/TCP
Args:
query
--grpc-address=0.0.0.0:10901
--http-address=0.0.0.0:9090
--log.level=info
--log.format=logfmt
--query.replica-label=prometheus_replica
--query.replica-label=rule_replica
--store=prometheus-k8s.prometheus-operator.svc.cluster.local:10901 #short term
--store=dnssrv+_grpc._tcp.thanos-store.thanos.svc.cluster.local:10901 #long term
Thanos-Store
thanos-store:
Container ID: containerd://fe2a3839ff2356f937b0b5f6cf056d45ab8feb7988bb08a11c62dbd9035d3c36
Image: quay.io/thanos/thanos:v0.25.0
Image ID: quay.io/thanos/thanos@sha256:bc3657f2b793f2f482991807e5e5a637f1ae4f1c75fb58d563c18a447ea61b8b
Ports: 10901/TCP, 10902/TCP
Host Ports: 0/TCP, 0/TCP
Args:
store
--log.level=info
--log.format=logfmt
--data-dir=/var/thanos/store
--grpc-address=0.0.0.0:10901
--http-address=0.0.0.0:10902
--objstore.config=$(OBJSTORE_CONFIG)
--ignore-deletion-marks-delay=24h
--index-cache-size=20GB
--index-cache.config="config":
"addr": "REDACTED:6379"
"db": 0
"type": "REDIS"
--store.caching-bucket.config="config":
"addr": "REDACTED:6379"
"db": 1
"type": "REDIS"
K8s Service for Prometheus-K8s
spec:
clusterIP: 100.70.78.65
clusterIPs:
- 100.70.78.65
ipFamilies:
- IPv4
ipFamilyPolicy: SingleStack
ports:
- name: metrics
port: 9090
protocol: TCP
targetPort: metrics
- name: grpc
port: 10901
protocol: TCP
targetPort: grpc
selector:
app.kubernetes.io/name: prometheus
sessionAffinity: None
type: ClusterIP
Prometheus Pod 4 containers inside.
Containers:
prometheus:
Container ID: containerd://48401e685253283d0b967e9b433eb9fb9c5b9b1e4208efbe71d06d4bad81f257
Image: prom/prometheus:v2.33.5
..
Port: 9090/TCP
Host Port: 0/TCP
Args:
--web.console.templates=/etc/prometheus/consoles
--web.console.libraries=/etc/prometheus/console_libraries
--config.file=/etc/prometheus/config_out/prometheus.env.yaml
--storage.tsdb.path=/prometheus
--storage.tsdb.retention.time=24h
--web.enable-lifecycle
--web.enable-admin-api
--web.external-url=https://prometheus-main.domain
--web.route-prefix=/
--web.config.file=/etc/prometheus/web_config/web-config.yaml
--storage.tsdb.max-block-duration=2h
--storage.tsdb.min-block-duration=2h
State: Running
....
config-reloader:
Container ID: containerd://c5a1e0dd480cca7ce4be2b132e461dc3f0df6f8b3951b272fdd2e227452d862f
....
thanos-sidecar:
Container ID: containerd://85efbd7813b37f7c653c128238d0486d4db1487ee4a91115c2038e81a3879413
Image: quay.io/thanos/thanos:v0.25.0
Image ID: quay.io/thanos/thanos@sha256:bc3657f2b793f2f482991807e5e5a637f1ae4f1c75fb58d563c18a447ea61b8b
Ports: 10902/TCP, 10901/TCP
Host Ports: 0/TCP, 0/TCP
Args:
sidecar
--prometheus.url=http://localhost:9090/
--grpc-address=:10901
--http-address=:10902
--objstore.config=$(OBJSTORE_CONFIG)
--tsdb.path=/prometheus
--log.level=info
State: Running
Started: Mon, 23 May 2022 13:22:31 +0200
Ready: True
...
vault-agent:
Container ID: containerd://bfdc4bb9b055a83ec1e2cfe28fd8687e6cfadd3590424ad5cdf157bc438dfe7d
....
Mounts:
/etc/vault/config from vault-config (ro)
/home/vault from vault-token (rw)
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-c7fd2 (ro)
What actual problem: When the Prometheus got OOMed, the Thanos SideCar is still Ready and traffic still forwarded to that Pod
What I expect: If Prometheus got killed, Thanos SideCar state should change to NOT READY to prevent the traffic sending there
Logs from Thanos Query when Prometheus down
as you can see it the Thanos Query forwarded the traffic to the prometheus-main-1 pod which was OOMed at that time
Version PrometheusOperator: v0.53.0 Prometheus: v2.33.5 Thanos Query / SideCar : v.0.25.0 K8s: 1.21.1
In such a situation, what happens if you curl /api/v1/status/config on Prometheus when it goes down?
@GiedriusS I have not checked yet but I remember the situation at that moment
Total containers are available is 3/4 Prometheus Container state is NOT READY
And Prometheus container at that moment is loading the WAL file. I believe it is a very well-known issue
In the past 2 weeks We have tweaked the Readiness & StarUp Probes to 1000 failure because We know the time for WAL loading is pretty long (usually more than 20 minutes). It helps our Prometheus can be restarted successfully, but I don't know whether it was affected to the Thanos SideCar flow or not.
Readiness: http-get http://:metrics/-/ready delay=0s timeout=3s period=5s #success=1 #failure=1000
Startup: http-get http://:metrics/-/ready delay=0s timeout=3s period=15s #success=1 #failure=1000
hmm I suddenly realized that K8s Service actually has no capability for GRPC protocol Load Balancing as it mentioned there.
So if I want to make sure the traffic is distributed across the pod and did the health check to make sure the pod is READY, maybe I will need to handle it from the Ingress layer such as (Linker or Nginx Ingres).
I was thinking to use the traditional headless DNS (Cluster IP: None) as the normal way but it would not help in case 1 of the Prometheus pod is unavailable.
/api/v1/status/config
Hello 👋 Looks like there was no activity on this issue for the last two months.
Do you mind updating us on the status? Is this still reproducible or needed? If yes, just comment on this PR or push a commit. Thanks! 🤗
If there will be no activity in the next two weeks, this issue will be closed (we can always reopen an issue if we need!). Alternatively, use remind command if you wish to be reminded at some point in future.
I think Sidecar now becomes not ready in such case. Should be fixed by https://github.com/thanos-io/thanos/pull/4939