thanos
thanos copied to clipboard
Routing-receive does not fail or perform new DNS resolution on receive instance IP change
Thanos version used: 0.24
Object Storage Provider: N/A
What happened: Thanos-routing-receive resolves DNS once on startup and then caches the result for the remainder of the runtime. To make it resolve the domain names again the pod has to be redeployed.
This affect the communication between thanos-routing-receive and thanos-receive. The routing-receive resolves the dns names of the thanos-receive instances that are present in the hashring.json file (manipulated by the receive-controller). Then, when the receives are restarted they get new IP addresses, however thanos-routing-receive still tries to communicate with the old one, thus leading to failures reaching the receives instances.
What you expected to happen:
When the receive handler used by routing-receive
hits an errUnavailable
error, the expected behavior would be to either:
- Perform another name resolution, as DNS change is one of the causes of the error or,
- Exit with code > 0, allowing kubernetes to restart the pod anew, which will trigger a new name resolution
How to reproduce it (as minimally and precisely as possible):
Restart a thanos receive
instance to give it a new IP, observe behavior of a corresponding routing-receive
instance.
Full logs to relevant components:
routing-receive
log after restarting corresponding receive
component:
level=debug ts=2022-04-04T09:40:44.358335368Z caller=handler.go:351 component=receive component=receive-handler msg="failed to handle request" err="backing off forward request for endpoint thanos-receive-xxx-0.thanos-receive-xxx.thanos.svc.cluster.local:10901: target not available"
Thanks for the report. Could you share how you configured the components? It can be just the basics and omit any 'personal' details.
Sure, here is the config for the 2 routing-receive
and receive
components. Thanks!
Click to expand the config:
# Source: templates/receive.yaml
# the headless service is needed so thanos-query can reach all individual pods of the thanos-receive statefulSet on the GRPC port
apiVersion: v1
kind: Service
metadata:
name: thanos-receive-headless
namespace: thanos
spec:
type: ClusterIP
clusterIP: None
ports:
- port: 10901
protocol: TCP
name: grpc
selector:
app: thanos-receive
---
# Source: templates/receive.yaml
# the normal service is needed so the thanos-routing-receives can discover the thanos-receives via DNS
# ie. thanos-receive-0.thanos-receive.thanos.svc.cluster.local
apiVersion: v1
kind: Service
metadata:
name: thanos-receive
namespace: thanos
labels:
product: thanos
team: teamname
app: thanos-receive
auth-app: thanos-receive
controller.receive.thanos.io: thanos-receive-controller
tenant: xxx
controller.receive.thanos.io/hashring: xxx
spec:
ports:
- port: 10901
protocol: TCP
name: grpc
- port: 10908
protocol: TCP
name: http-receive
- port: 10909
protocol: TCP
name: http
selector:
app: thanos-receive
controller.receive.thanos.io: thanos-receive-controller
tenant: xxx
controller.receive.thanos.io/hashring: xxx
auth-app: thanos-receive
---
# Source: templates/receive.yaml
apiVersion: apps/v1
kind: StatefulSet
metadata:
labels:
app: thanos-receive
controller.receive.thanos.io: thanos-receive-controller
tenant: xxx
controller.receive.thanos.io/hashring: xxx
auth-app: thanos-receive
name: thanos-receive
namespace: thanos
spec:
serviceName: thanos-receive
replicas: 1
selector:
matchLabels:
app: thanos-receive
controller.receive.thanos.io: thanos-receive-controller
tenant: xxx
controller.receive.thanos.io/hashring: xxx
auth-app: thanos-receive
template:
metadata:
labels:
app: thanos-receive
controller.receive.thanos.io: thanos-receive-controller
tenant: xxx
controller.receive.thanos.io/hashring: xxx
auth-app: thanos-receive
spec:
containers:
- name: thanos-receive
image: "thanos:v0.26.0"
imagePullPolicy: Always
ports:
- containerPort: 10908
name: remote-write
- containerPort: 10901
name: grpc
- containerPort: 10909
name: http
livenessProbe:
initialDelaySeconds: 5
failureThreshold: 4
httpGet:
port: 10909
scheme: HTTP
path: /-/healthy
periodSeconds: 30
readinessProbe:
initialDelaySeconds: 5
failureThreshold: 20
httpGet:
port: 10909
scheme: HTTP
path: /-/ready
periodSeconds: 5
resources:
requests:
memory: 10Gi
cpu: 2000m
limits:
memory: 12Gi
cpu: 3000m
args:
- receive
- --tsdb.path=/data/tsdb
- --label=receive_replica="$(NAME)"
- --grpc-address=0.0.0.0:10901
- --http-address=0.0.0.0:10909
# each pod needs to find its own domain name in the hashring config
- --receive.local-endpoint=$(NAME).$(NAMESPACE).svc.cluster.local:10901
- --remote-write.address=0.0.0.0:10908
- --receive.default-tenant-id="unknown" # only used when 'THANOS-TENANT' header is not present, should not happen
- --tsdb.retention=8d
- |
--objstore.config=type: AZURE
config:
storage_account: "storage"
storage_account_key: $(STORAGE_ACCOUNT_KEY)
container: "thanos-gen1"
endpoint: "" # https://thanos-gen1.blob.core.windows.net must resolve to private ip
max_retries: 5
- --log.level=debug
volumeMounts:
- name: data
mountPath: /data/tsdb
env:
- name: STORAGE_ACCOUNT_KEY
valueFrom:
secretKeyRef:
name: keystorage
key: key1
# below env variables allow each pod to identify themselves
# ie. in the hashring config and to set a label
- name: NAME
valueFrom:
fieldRef:
fieldPath: metadata.name
- name: NAMESPACE
valueFrom:
fieldRef:
fieldPath: metadata.namespace
volumes:
imagePullSecrets:
- name: registry
volumeClaimTemplates:
- metadata:
name: data
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 256Gi
storageClassName: standard
---
# Source: templates/routing-receive.yaml
apiVersion: v1
kind: Service
metadata:
name: thanos-routing-receive
namespace: thanos
labels:
product: thanos
app: thanos-routing-receive
spec:
ports:
- port: 10901
protocol: TCP
name: grpc
- port: 10908
protocol: TCP
name: http-receive
- port: 10909
protocol: TCP
name: http
selector:
app: thanos-routing-receive
---
# Source: templates/routing-receive.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
labels:
app: thanos-routing-receive
name: thanos-routing-receive
namespace: thanos
spec:
replicas: 3
selector:
matchLabels:
app: thanos-routing-receive
template:
metadata:
labels:
app: thanos-routing-receive
spec:
containers:
- name: thanos-routing-receive
image: "thanos:v0.26.0"
imagePullPolicy: Always
ports:
- containerPort: 10908
name: remote-write
- containerPort: 10901
name: grpc
- containerPort: 10909
name: http
livenessProbe:
initialDelaySeconds: 5
failureThreshold: 4
httpGet:
port: 10909
scheme: HTTP
path: /-/healthy
periodSeconds: 30
readinessProbe:
initialDelaySeconds: 5
failureThreshold: 20
httpGet:
port: 10909
scheme: HTTP
path: /-/ready
periodSeconds: 5
resources:
requests:
memory: 2Gi
cpu: 2000m
limits:
memory: 3Gi
cpu: 3000m
args:
- receive
- --tsdb.path=/data/tsdb
- --label=receive_replica="$(NAME)"
- --grpc-address=0.0.0.0:10901
- --http-address=0.0.0.0:10909
- --remote-write.address=0.0.0.0:10908
- --tsdb.retention=8d
- --receive.hashrings-file=/etc/hashring/hashrings.json
- --log.level=debug
volumeMounts:
- name: thanos-receive-hashring-generated
mountPath: /etc/hashring
env:
# below env variables allow each pod to identify themselves
# ie. in the hashring config and to set a label
- name: NAME
valueFrom:
fieldRef:
fieldPath: metadata.name
- name: NAMESPACE
valueFrom:
fieldRef:
fieldPath: metadata.namespace
volumes:
- name: thanos-receive-hashring-generated
configMap:
defaultMode: 420
name: thanos-receive-hashring-generated
imagePullSecrets:
- name: registry
---
apiVersion: v1
data:
hashrings.json: '[{"hashring":"xxx","tenants":["xxx"],"endpoints":["thanos-receive-0:10901"]}]'
kind: ConfigMap
metadata:
creationTimestamp: "2022-05-06T13:46:41Z"
managedFields:
- apiVersion: v1
fieldsType: FieldsV1
fieldsV1:
f:data:
.: {}
f:hashrings.json: {}
f:metadata:
f:ownerReferences:
.: {}
k:{"uid":"***"}: {}
manager: Go-http-client
operation: Update
time: "2022-05-06T13:46:41Z"
name: thanos-receive-hashring-generated
namespace: thanos
ownerReferences:
- apiVersion: v1
kind: ConfigMap
name: thanos-receive-hashring-base
uid: "***"
resourceVersion: "10"
uid: "***"
Can anyone link to the relevant code that does the IP caching?
Seems to be this function: https://github.com/thanos-io/thanos/blob/main/pkg/receive/handler.go#L815
Hi all, we did some more investigation into this issue. It is not a problem with Thanos, neither with DNS resolution. We use an istio proxy sidecar next to the thanos containers. This istio pod was not able to detect the gRPC connection termination when the thanos-receive component was restarted. As a result, the thanos-routing-receive kept sending requests using the old connection/channel and they were timing out because there was nothing on the other side. By tweeking istio we were able to solve the problem.
Hello 👋 Looks like there was no activity on this issue for the last two months.
Do you mind updating us on the status? Is this still reproducible or needed? If yes, just comment on this PR or push a commit. Thanks! 🤗
If there will be no activity in the next two weeks, this issue will be closed (we can always reopen an issue if we need!). Alternatively, use remind
command if you wish to be reminded at some point in future.