thanos icon indicating copy to clipboard operation
thanos copied to clipboard

Routing-receive does not fail or perform new DNS resolution on receive instance IP change

Open erispoe opened this issue 2 years ago • 6 comments

Thanos version used: 0.24

Object Storage Provider: N/A

What happened: Thanos-routing-receive resolves DNS once on startup and then caches the result for the remainder of the runtime. To make it resolve the domain names again the pod has to be redeployed.

This affect the communication between thanos-routing-receive and thanos-receive. The routing-receive resolves the dns names of the thanos-receive instances that are present in the hashring.json file (manipulated by the receive-controller). Then, when the receives are restarted they get new IP addresses, however thanos-routing-receive still tries to communicate with the old one, thus leading to failures reaching the receives instances.

What you expected to happen: When the receive handler used by routing-receive hits an errUnavailable error, the expected behavior would be to either:

  • Perform another name resolution, as DNS change is one of the causes of the error or,
  • Exit with code > 0, allowing kubernetes to restart the pod anew, which will trigger a new name resolution

How to reproduce it (as minimally and precisely as possible): Restart a thanos receive instance to give it a new IP, observe behavior of a corresponding routing-receive instance.

Full logs to relevant components: routing-receive log after restarting corresponding receive component:

level=debug ts=2022-04-04T09:40:44.358335368Z caller=handler.go:351 component=receive component=receive-handler msg="failed to handle request" err="backing off forward request for endpoint thanos-receive-xxx-0.thanos-receive-xxx.thanos.svc.cluster.local:10901: target not available" 

erispoe avatar May 02 '22 09:05 erispoe

Thanks for the report. Could you share how you configured the components? It can be just the basics and omit any 'personal' details.

wiardvanrij avatar May 08 '22 02:05 wiardvanrij

Sure, here is the config for the 2 routing-receive and receive components. Thanks!

Click to expand the config:
# Source: templates/receive.yaml
# the headless service is needed so thanos-query can reach all individual pods of the thanos-receive statefulSet on the GRPC port
apiVersion: v1
kind: Service
metadata:
  name: thanos-receive-headless
  namespace: thanos
spec:
  type: ClusterIP
  clusterIP: None
  ports:
    - port: 10901
      protocol: TCP
      name: grpc
  selector:
    app: thanos-receive
---
# Source: templates/receive.yaml
# the normal service is needed so the thanos-routing-receives can discover the thanos-receives via DNS
# ie. thanos-receive-0.thanos-receive.thanos.svc.cluster.local
apiVersion: v1
kind: Service
metadata:
  name: thanos-receive
  namespace: thanos
  labels: 
    product: thanos
    team: teamname
    app: thanos-receive
    auth-app: thanos-receive
    controller.receive.thanos.io: thanos-receive-controller
    tenant: xxx
    controller.receive.thanos.io/hashring: xxx
spec:
  ports:
    - port: 10901
      protocol: TCP
      name: grpc
    - port: 10908
      protocol: TCP
      name: http-receive
    - port: 10909
      protocol: TCP
      name: http
  selector:
    app: thanos-receive
    controller.receive.thanos.io: thanos-receive-controller
    tenant: xxx
    controller.receive.thanos.io/hashring: xxx
    auth-app: thanos-receive
---
# Source: templates/receive.yaml
apiVersion: apps/v1
kind: StatefulSet
metadata:
  labels:
    app: thanos-receive
    controller.receive.thanos.io: thanos-receive-controller
    tenant: xxx
    controller.receive.thanos.io/hashring: xxx
    auth-app: thanos-receive
  name: thanos-receive
  namespace: thanos
spec:
  serviceName: thanos-receive
  replicas: 1
  selector:
    matchLabels:
      app: thanos-receive
      controller.receive.thanos.io: thanos-receive-controller
      tenant: xxx
      controller.receive.thanos.io/hashring: xxx
      auth-app: thanos-receive
  template:
    metadata:
      labels:
        app: thanos-receive
        controller.receive.thanos.io: thanos-receive-controller
        tenant: xxx
        controller.receive.thanos.io/hashring: xxx
        auth-app: thanos-receive
    spec:
      containers:
        - name: thanos-receive
          image: "thanos:v0.26.0"
          imagePullPolicy: Always
          ports:
            - containerPort: 10908
              name: remote-write
            - containerPort: 10901
              name: grpc
            - containerPort: 10909
              name: http
        
          livenessProbe:
            initialDelaySeconds: 5
            failureThreshold: 4
            httpGet:
              port: 10909
              scheme: HTTP
              path: /-/healthy
            periodSeconds: 30
        
          readinessProbe:
            initialDelaySeconds: 5
            failureThreshold: 20
            httpGet:
              port: 10909
              scheme: HTTP
              path: /-/ready
            periodSeconds: 5
          resources:
            requests:
              memory: 10Gi
              cpu: 2000m
            limits:
              memory: 12Gi
              cpu: 3000m
          args:
            - receive
            - --tsdb.path=/data/tsdb
            - --label=receive_replica="$(NAME)"
            - --grpc-address=0.0.0.0:10901
            - --http-address=0.0.0.0:10909
            # each pod needs to find its own domain name in the hashring config
            - --receive.local-endpoint=$(NAME).$(NAMESPACE).svc.cluster.local:10901
            - --remote-write.address=0.0.0.0:10908
            - --receive.default-tenant-id="unknown" # only used when 'THANOS-TENANT' header is not present, should not happen
            - --tsdb.retention=8d
            - |
              --objstore.config=type: AZURE
              config:
                storage_account: "storage"
                storage_account_key: $(STORAGE_ACCOUNT_KEY)
                container: "thanos-gen1"
                endpoint: "" # https://thanos-gen1.blob.core.windows.net must resolve to private ip
                max_retries: 5
            - --log.level=debug
          volumeMounts:
            - name: data
              mountPath: /data/tsdb
          env:
            - name: STORAGE_ACCOUNT_KEY
              valueFrom:
                secretKeyRef:
                  name: keystorage
                  key: key1
            # below env variables allow each pod to identify themselves
            # ie. in the hashring config and to set a label
            - name: NAME
              valueFrom:
                fieldRef:
                  fieldPath: metadata.name
            - name: NAMESPACE
              valueFrom:
                fieldRef:
                  fieldPath: metadata.namespace
      volumes:
      imagePullSecrets:
        - name: registry
  volumeClaimTemplates:
  - metadata:
      name: data
    spec:
      accessModes:
      - ReadWriteOnce
      resources:
        requests:
          storage: 256Gi
      storageClassName: standard
---
# Source: templates/routing-receive.yaml
apiVersion: v1
kind: Service
metadata:
  name: thanos-routing-receive
  namespace: thanos
  labels: 
    product: thanos
    
    app: thanos-routing-receive
spec:
  ports:
    - port: 10901
      protocol: TCP
      name: grpc
    - port: 10908
      protocol: TCP
      name: http-receive
    - port: 10909
      protocol: TCP
      name: http
  selector: 
    app: thanos-routing-receive
---
# Source: templates/routing-receive.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  labels: 
    app: thanos-routing-receive
  name: thanos-routing-receive
  namespace: thanos
spec:
  replicas: 3
  selector:
    matchLabels: 
      app: thanos-routing-receive
  template:
    metadata:
      labels: 
        app: thanos-routing-receive
    spec:
      containers:
        - name: thanos-routing-receive
          image: "thanos:v0.26.0"
          imagePullPolicy: Always
          ports:
            - containerPort: 10908
              name: remote-write
            - containerPort: 10901
              name: grpc
            - containerPort: 10909
              name: http
        
          livenessProbe:
            initialDelaySeconds: 5
            failureThreshold: 4
            httpGet:
              port: 10909
              scheme: HTTP
              path: /-/healthy
            periodSeconds: 30
        
          readinessProbe:
            initialDelaySeconds: 5
            failureThreshold: 20
            httpGet:
              port: 10909
              scheme: HTTP
              path: /-/ready
            periodSeconds: 5
          resources:
            requests:
              memory: 2Gi
              cpu: 2000m
            limits:
              memory: 3Gi
              cpu: 3000m
          args:
            - receive
            - --tsdb.path=/data/tsdb
            - --label=receive_replica="$(NAME)"
            - --grpc-address=0.0.0.0:10901
            - --http-address=0.0.0.0:10909
            - --remote-write.address=0.0.0.0:10908
            - --tsdb.retention=8d
            - --receive.hashrings-file=/etc/hashring/hashrings.json
            - --log.level=debug
          volumeMounts:
            - name: thanos-receive-hashring-generated
              mountPath: /etc/hashring
          env:
            # below env variables allow each pod to identify themselves
            # ie. in the hashring config and to set a label
            - name: NAME
              valueFrom:
                fieldRef:
                  fieldPath: metadata.name
            - name: NAMESPACE
              valueFrom:
                fieldRef:
                  fieldPath: metadata.namespace
      volumes:
        - name: thanos-receive-hashring-generated
          configMap:
            defaultMode: 420
            name: thanos-receive-hashring-generated
      imagePullSecrets:
        - name: registry
---
apiVersion: v1
data:
  hashrings.json: '[{"hashring":"xxx","tenants":["xxx"],"endpoints":["thanos-receive-0:10901"]}]'
kind: ConfigMap
metadata:
  creationTimestamp: "2022-05-06T13:46:41Z"
  managedFields:
  - apiVersion: v1
    fieldsType: FieldsV1
    fieldsV1:
      f:data:
        .: {}
        f:hashrings.json: {}
      f:metadata:
        f:ownerReferences:
          .: {}
          k:{"uid":"***"}: {}
    manager: Go-http-client
    operation: Update
    time: "2022-05-06T13:46:41Z"
  name: thanos-receive-hashring-generated
  namespace: thanos
  ownerReferences:
  - apiVersion: v1
    kind: ConfigMap
    name: thanos-receive-hashring-base
    uid: "***"
  resourceVersion: "10"
  uid: "***"

erispoe avatar May 12 '22 16:05 erispoe

Can anyone link to the relevant code that does the IP caching?

phillebaba avatar Jun 06 '22 22:06 phillebaba

Seems to be this function: https://github.com/thanos-io/thanos/blob/main/pkg/receive/handler.go#L815

fpetkovski avatar Jun 09 '22 08:06 fpetkovski

Hi all, we did some more investigation into this issue. It is not a problem with Thanos, neither with DNS resolution. We use an istio proxy sidecar next to the thanos containers. This istio pod was not able to detect the gRPC connection termination when the thanos-receive component was restarted. As a result, the thanos-routing-receive kept sending requests using the old connection/channel and they were timing out because there was nothing on the other side. By tweeking istio we were able to solve the problem.

abaguas avatar Jul 12 '22 23:07 abaguas

Hello 👋 Looks like there was no activity on this issue for the last two months. Do you mind updating us on the status? Is this still reproducible or needed? If yes, just comment on this PR or push a commit. Thanks! 🤗 If there will be no activity in the next two weeks, this issue will be closed (we can always reopen an issue if we need!). Alternatively, use remind command if you wish to be reminded at some point in future.

stale[bot] avatar Sep 21 '22 02:09 stale[bot]