external-dns icon indicating copy to clipboard operation
external-dns copied to clipboard

external-dns v0.13.5 trying to create CNAME records after upgrading leading to crashloopbackoff

Open rl0nergan opened this issue 2 years ago • 15 comments

What happened: After upgrading external-dns from 0.13.4 to 0.13.5, it began trying to create CNAME records instead of A records like it had been previously. The external-dns pod then went into CrashLoopBackOff due to a "Modification Conflict" error.

What you expected to happen: External-dns would continue to create A records after an upgrade and not crash.

How to reproduce it (as minimally and precisely as possible): Have multiple

Anything else we need to know?:

Environment: Kubernetes cluster on v1.26

  • External-DNS version (use external-dns --version): 0.13.5
  • DNS provider: Akamai
  • Others: Logs:
time="2023-06-15T20:01:45Z" level=info msg="Instantiating new Kubernetes client"
time="2023-06-15T20:01:45Z" level=info msg="Using inCluster-config based on serviceaccount-token"
time="2023-06-15T20:01:45Z" level=info msg="Created Kubernetes client https://10.233.0.1:443"
time="2023-06-15T20:01:47Z" level=info msg="Creating recordsets" record=argocd.example.org target="[cname.example.org]" ttl=600 type=CNAME zone=example.org
time="2023-06-15T20:01:47Z" level=info msg="Creating recordsets" record=prometheus.example.org target="[cname.example.org]" ttl=600 type=CNAME zone=example.org
time="2023-06-15T20:01:47Z" level=info msg="Creating recordsets" record=loki.example.org target="[cname.example.org]" ttl=600 type=CNAME zone=example.org
time="2023-06-15T20:01:47Z" level=info msg="Creating recordsets" record=teleport.example.org target="[cname.example.org]" ttl=600 type=CNAME zone=example.org
time="2023-06-15T20:01:47Z" level=info msg="Creating recordsets" record=argocd.example.org target="[\"heritage=external-dns,external-dns/owner=default,external-dns/resource=ingress/argocd/argocd-empty-ingress\"]" ttl=600 type=TXT zone=example.org
time="2023-06-15T20:01:47Z" level=info msg="Creating recordsets" record=cname-argocd.example.org target="[\"heritage=external-dns,external-dns/owner=default,external-dns/resource=ingress/argocd/argocd-empty-ingress\"]" ttl=600 type=TXT zone=example.org
time="2023-06-15T20:01:47Z" level=info msg="Creating recordsets" record=prometheus.example.org target="[\"heritage=external-dns,external-dns/owner=default,external-dns/resource=ingress/prometheus/kube-prometheus-kube-prome-prometheus\"]" ttl=600 type=TXT zone=example.org
time="2023-06-15T20:01:47Z" level=info msg="Creating recordsets" record=cname-prometheus.example.org target="[\"heritage=external-dns,external-dns/owner=default,external-dns/resource=ingress/prometheus/kube-prometheus-kube-prome-prometheus\"]" ttl=600 type=TXT zone=example.org
time="2023-06-15T20:01:47Z" level=info msg="Creating recordsets" record=loki.example.org target="[\"heritage=external-dns,external-dns/owner=default,external-dns/resource=ingress/loki/loki\"]" ttl=600 type=TXT zone=example.org
time="2023-06-15T20:01:47Z" level=info msg="Creating recordsets" record=cname-loki.example.org target="[\"heritage=external-dns,external-dns/owner=default,external-dns/resource=ingress/loki/loki\"]" ttl=600 type=TXT zone=example.org
time="2023-06-15T20:01:47Z" level=info msg="Creating recordsets" record=teleport.example.org target="[\"heritage=external-dns,external-dns/owner=default,external-dns/resource=service/teleport-cluster/teleport-cluster\"]" ttl=600 type=TXT zone=example.org
time="2023-06-15T20:01:47Z" level=info msg="Creating recordsets" record=cname-teleport.example.org target="[\"heritage=external-dns,external-dns/owner=default,external-dns/resource=service/teleport-cluster/teleport-cluster\"]" ttl=600 type=TXT zone=example.org
time="2023-06-15T20:01:47Z" level=error msg="Failed to create endpoints for DNS zone example.org. Error: Modification Confict: [Duplicate record set found with name loki.example.org and type TXT]"
time="2023-06-15T20:01:47Z" level=fatal msg="Modification Confict: [Duplicate record set found with name loki.example.org and type TXT]"

rl0nergan avatar Jun 20 '23 13:06 rl0nergan

Please share all args to start external-dns and the resources that let external-dns to create these records. We also need the ingress status as it contains the target and we need to know if there are two resources that want to have different targets and what kind of source you use.

szuecs avatar Jun 21 '23 08:06 szuecs

Args used to start external-dns:

    Args:
      --log-level=debug
      --log-format=text
      --interval=1m
      --source=service
      --source=ingress
      --policy=sync
      --registry=txt
      --domain-filter=example.org
      --provider=akamai

Some example resources:

apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  annotations:
    cert-manager.io/cluster-issuer: vault-production
    cert-manager.io/common-name: prometheus.example.com
    traefik.ingress.kubernetes.io/router.entrypoints: websecure
    traefik.ingress.kubernetes.io/router.tls.options: traefik-mtls@kubernetescrd
  creationTimestamp: "2023-06-05T21:46:46Z"
  generation: 1
  labels:
    app: kube-prometheus-stack-prometheus
    app.kubernetes.io/instance: kube-prometheus
    app.kubernetes.io/managed-by: Helm
    app.kubernetes.io/part-of: kube-prometheus-stack
    app.kubernetes.io/version: 46.4.1
    chart: kube-prometheus-stack-46.4.1
    heritage: Helm
    release: kube-prometheus
  name: kube-prometheus-kube-prome-prometheus
  namespace: prometheus
  resourceVersion: "1806450"
  uid: 8dd5092c-c323-4437-ad24-45dcd2f31cf8
spec:
  ingressClassName: traefik
  rules:
  - host: prometheus.example.org
    http:
      paths:
      - backend:
          service:
            name: kube-prometheus-kube-prome-prometheus
            port:
              number: 9090
        path: /
        pathType: ImplementationSpecific
  tls:
  - hosts:
    - prometheus.example.org
    secretName: prometheus-tls
status:
  loadBalancer:
    ingress:
    - hostname: 12-34-567-89.example.org
      ip: 12.34.567.89
---
apiVersion: v1
kind: Service
metadata:
  annotations:
    external-dns.alpha.kubernetes.io/hostname: teleport.example.org
  creationTimestamp: "2023-06-05T20:38:56Z"
  finalizers:
  - service.kubernetes.io/load-balancer-cleanup
  labels:
    app.kubernetes.io/component: proxy
    app.kubernetes.io/instance: teleport-cluster
    app.kubernetes.io/managed-by: Helm
    app.kubernetes.io/name: teleport-cluster
    app.kubernetes.io/version: 13.0.0-alpha.2-amd64
    helm.sh/chart: teleport-cluster-13.0.3
    teleport.dev/majorVersion: "13"
  name: teleport-cluster
  namespace: teleport-cluster
  resourceVersion: "1784576"
  uid: d7b0f713-a259-403c-a77b-5286d9afb1cf
spec:
  allocateLoadBalancerNodePorts: true
  clusterIP: 10.233.51.128
  clusterIPs:
  - 10.233.51.128
  externalTrafficPolicy: Cluster
  internalTrafficPolicy: Cluster
  ipFamilies:
  - IPv4
  ipFamilyPolicy: SingleStack
  ports:
  - name: tls
    nodePort: 31767
    port: 443
    protocol: TCP
    targetPort: 3080
  selector:
    app.kubernetes.io/component: proxy
    app.kubernetes.io/instance: teleport-cluster
    app.kubernetes.io/name: teleport-cluster
  sessionAffinity: None
  type: LoadBalancer
status:
  loadBalancer:
    ingress:
    - hostname: 12-34-34-123.example.org
      ip: 12-34-34-123

We've seen it fail when trying to create records for both ingress and service type objects without us making any changes other then upgrading the external-dns version.

rl0nergan avatar Jun 26 '23 16:06 rl0nergan

I don't see loki.example.org in the provided example resources. So I don't see how those resources could have created those records.

johngmyers avatar Jun 28 '23 03:06 johngmyers

@johngmyers I see similar behavior.

In my case, I create a KIND cluster with a service that has an annotation(external-dns.alpha.kubernetes.io/hostname: some.example.org for external-dns to create the record. Then I delete the cluster completely. I then recreate the cluster with the same service and annotation.

But because external-dns did not have a chance to delete the entry previously created, it goes into a CrashLoopBackOff state

If I delete the service first, let external-dns delete the entry and then destroy and recreate cluster, then it works as expected.

amold1 avatar Jun 28 '23 15:06 amold1

@amold1 Please supply a reproducable test case, complete with server arguments, Kubernetes resources, any other initial conditions, actual behavior, and expected behavior.

johngmyers avatar Jun 28 '23 15:06 johngmyers

@johngmyers I was also affected by this on v0.13.5, here are the steps to reproduce

  1. Use external-dns v0.11.0 (that's what I used prev., maybe v0.13.4 might work as well as others have pointed). To reproduce we start with an older version and then upgrade to v0.13.5

    # external-dns args
      - args:
        - --log-level=info
        - --log-format=json
        - --interval=30s
        - --source=service
        - --source=ingress
        - --policy=sync
        - --registry=txt
        - --txt-owner-id=xxxxxxxxxxx
        - --domain-filter=example.com
        - --provider=aws
    
  2. ip-address-type: dualstack dualstack for ingress (if you don't have dualstack networking setup, then you can first create the ingress without this annotation and once the LB is provisioned, add this annotation - alb-controller would fail to reconcile further but that should be okay)

    apiVersion: networking.k8s.io/v1
    kind: Ingress
    metadata:
      annotations:
        alb.ingress.kubernetes.io/listen-ports: '[{"HTTP": 80}]'
        alb.ingress.kubernetes.io/scheme: internet-facing
        alb.ingress.kubernetes.io/target-type: ip
        alb.ingress.kubernetes.io/ip-address-type: dualstack
        kubernetes.io/ingress.class: alb
      name: external-dns-test-failure
    spec:
      rules:
      - host: external-dns-test.example.com
        http:
          paths:
          - backend:
              service:
                name: external-dns-test-canary
                port:
                  name: http
            path: /*
            pathType: ImplementationSpecific
    
  3. Once annotation for dualstack is added, external-dns v0.11.0 created three records: TXT, A and AAAA

    external-dns-f4f649bdf-2swsp external-dns {"level":"info","msg":"Desired change: CREATE external-dns-test.example.com A [Id: /hostedzone/XXXXXXXXXX]","time":"2023-07-07T02:04:50Z"}
    external-dns-f4f649bdf-2swsp external-dns {"level":"info","msg":"Desired change: CREATE external-dns-test.example.com AAAA [Id: /hostedzone/XXXXXXXXXXX]","time":"2023-07-07T02:04:50Z"}
    external-dns-f4f649bdf-2swsp external-dns {"level":"info","msg":"Desired change: CREATE external-dns-test.example.com TXT [Id: /hostedzone/XXXXXXXXX]","time":"2023-07-07T02:04:50Z"}
    
  4. Now upgrade the controller to v0.13.5, it tries to create two TXT records with cname- prefix and fails and goes into CrashLoopBackOff

    external-dns-57f9b9d9d7-dkp6q external-dns time="2023-07-07T01:40:00Z" level=info msg="Desired change: CREATE cname-external-dns-test.example.com TXT [Id: /hostedzone/XXXXXXXXXX]"
    external-dns-57f9b9d9d7-dkp6q external-dns time="2023-07-07T01:40:00Z" level=info msg="Desired change: CREATE cname-external-dns-test.example.com TXT [Id: /hostedzone/XXXXXXXXXXX]"
    external-dns-57f9b9d9d7-dkp6q external-dns time="2023-07-07T01:40:00Z" level=error msg="Failure in zone example.com. [Id: /hostedzone/XXXXXXXXXX] when submitting change batch: InvalidChangeBatch: [The request contains an invalid set of changes for a resource record set 'TXT cname-external-dns-test.example.com.']\n\tstatus code: 400, request id: d78cd4b1-0514-4eac-bfe1-bae08e3c071d"
    

EKS: 1.23 aws-load-balancer-controller: v2.5.3

nakamume avatar Jul 07 '23 03:07 nakamume

The two TXT records it was trying to CREATE were exactly the same (I tested using a custom image with additional logic) - so maybe some issue with the deduplication logic?

BTW, I tried master (commit: 92824f4f9) and that didn't result into this behavior.

nakamume avatar Jul 07 '23 03:07 nakamume

If this isn't reproducing on master, there's little reason to investigate.

johngmyers avatar Jul 07 '23 04:07 johngmyers

Did a little more digging, seems the commit https://github.com/kubernetes-sigs/external-dns/commit/1bd38347430ac0dddd8e68e23ecf12c426369892 fixed the issue for me.

nakamume avatar Jul 07 '23 05:07 nakamume

hey @johngmyers, sorry about that. here's the loki ingress resource we're using.

apiVersion: v1
items:
- apiVersion: networking.k8s.io/v1
  kind: Ingress
  metadata:
    annotations:
      cert-manager.io/cluster-issuer: vault-production
      cert-manager.io/common-name: loki.example.org
      traefik.ingress.kubernetes.io/router.entrypoints: websecure
      traefik.ingress.kubernetes.io/router.tls.options: traefik-mtls@kubernetescrd
    creationTimestamp: "2023-05-11T18:17:42Z"
    generation: 1
    labels:
      app.kubernetes.io/instance: loki
      app.kubernetes.io/managed-by: Helm
      app.kubernetes.io/name: loki
      app.kubernetes.io/version: 2.8.2
      helm.sh/chart: loki-5.8.4
    name: loki
    namespace: loki
    resourceVersion: "18129523"
    uid: cd8da11a-f2be-418b-b15c-d4e3c1be4eae
  spec:
    ingressClassName: traefik
    rules:
    - host: loki.example.org
      http:
        paths:
        - backend:
            service:
              name: loki-read
              port:
                number: 3100
          path: /api/prom/tail
          pathType: Prefix
        - backend:
            service:
              name: loki-read
              port:
                number: 3100
          path: /loki/api/v1/tail
          pathType: Prefix
        - backend:
            service:
              name: loki-read
              port:
                number: 3100
          path: /loki/api
          pathType: Prefix
        - backend:
            service:
              name: loki-read
              port:
                number: 3100
          path: /api/prom/rules
          pathType: Prefix
        - backend:
            service:
              name: loki-read
              port:
                number: 3100
          path: /loki/api/v1/rules
          pathType: Prefix
        - backend:
            service:
              name: loki-read
              port:
                number: 3100
          path: /prometheus/api/v1/rules
          pathType: Prefix
        - backend:
            service:
              name: loki-read
              port:
                number: 3100
          path: /prometheus/api/v1/alerts
          pathType: Prefix
        - backend:
            service:
              name: loki-write
              port:
                number: 3100
          path: /api/prom/push
          pathType: Prefix
        - backend:
            service:
              name: loki-write
              port:
                number: 3100
          path: /loki/api/v1/push
          pathType: Prefix
    tls:
    - hosts:
      - loki.example.org
      secretName: loki-distributed-tls
  status:
    loadBalancer:
      ingress:
    - hostname: 12-34-567-89.example.org
      ip: 12.34.567.89

In our case we're running multiple clusters with workloads provisioned via argocd and have seen the same error occur but with different resources mentioned based on what external-dns tries to reconcile first.

rl0nergan avatar Jul 07 '23 18:07 rl0nergan

Same issue with google provider. v0.13.5 version as well. Downgrade to v0.13.4 helped.

maxkokocom avatar Aug 08 '23 03:08 maxkokocom

@joaocc That's not a CNAME record, as reported in the initial description. That's a TXT record and is expected behavior.

johngmyers avatar Sep 27 '23 00:09 johngmyers

@johngmyers You are correct. Will remove my comment to avoid future confusion. Sorry for the misunderstanding. Thx

joaocc avatar Sep 27 '23 07:09 joaocc

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot avatar Jan 29 '24 06:01 k8s-triage-robot

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle rotten
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

k8s-triage-robot avatar Feb 28 '24 06:02 k8s-triage-robot

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue with /reopen
  • Mark this issue as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

k8s-triage-robot avatar Mar 29 '24 07:03 k8s-triage-robot

@k8s-triage-robot: Closing this issue, marking it as "Not Planned".

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue with /reopen
  • Mark this issue as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot avatar Mar 29 '24 07:03 k8s-ci-robot