external-dns
external-dns copied to clipboard
external-dns v0.13.5 trying to create CNAME records after upgrading leading to crashloopbackoff
What happened: After upgrading external-dns from 0.13.4 to 0.13.5, it began trying to create CNAME records instead of A records like it had been previously. The external-dns pod then went into CrashLoopBackOff due to a "Modification Conflict" error.
What you expected to happen: External-dns would continue to create A records after an upgrade and not crash.
How to reproduce it (as minimally and precisely as possible): Have multiple
Anything else we need to know?:
Environment: Kubernetes cluster on v1.26
- External-DNS version (use
external-dns --version): 0.13.5 - DNS provider: Akamai
- Others: Logs:
time="2023-06-15T20:01:45Z" level=info msg="Instantiating new Kubernetes client"
time="2023-06-15T20:01:45Z" level=info msg="Using inCluster-config based on serviceaccount-token"
time="2023-06-15T20:01:45Z" level=info msg="Created Kubernetes client https://10.233.0.1:443"
time="2023-06-15T20:01:47Z" level=info msg="Creating recordsets" record=argocd.example.org target="[cname.example.org]" ttl=600 type=CNAME zone=example.org
time="2023-06-15T20:01:47Z" level=info msg="Creating recordsets" record=prometheus.example.org target="[cname.example.org]" ttl=600 type=CNAME zone=example.org
time="2023-06-15T20:01:47Z" level=info msg="Creating recordsets" record=loki.example.org target="[cname.example.org]" ttl=600 type=CNAME zone=example.org
time="2023-06-15T20:01:47Z" level=info msg="Creating recordsets" record=teleport.example.org target="[cname.example.org]" ttl=600 type=CNAME zone=example.org
time="2023-06-15T20:01:47Z" level=info msg="Creating recordsets" record=argocd.example.org target="[\"heritage=external-dns,external-dns/owner=default,external-dns/resource=ingress/argocd/argocd-empty-ingress\"]" ttl=600 type=TXT zone=example.org
time="2023-06-15T20:01:47Z" level=info msg="Creating recordsets" record=cname-argocd.example.org target="[\"heritage=external-dns,external-dns/owner=default,external-dns/resource=ingress/argocd/argocd-empty-ingress\"]" ttl=600 type=TXT zone=example.org
time="2023-06-15T20:01:47Z" level=info msg="Creating recordsets" record=prometheus.example.org target="[\"heritage=external-dns,external-dns/owner=default,external-dns/resource=ingress/prometheus/kube-prometheus-kube-prome-prometheus\"]" ttl=600 type=TXT zone=example.org
time="2023-06-15T20:01:47Z" level=info msg="Creating recordsets" record=cname-prometheus.example.org target="[\"heritage=external-dns,external-dns/owner=default,external-dns/resource=ingress/prometheus/kube-prometheus-kube-prome-prometheus\"]" ttl=600 type=TXT zone=example.org
time="2023-06-15T20:01:47Z" level=info msg="Creating recordsets" record=loki.example.org target="[\"heritage=external-dns,external-dns/owner=default,external-dns/resource=ingress/loki/loki\"]" ttl=600 type=TXT zone=example.org
time="2023-06-15T20:01:47Z" level=info msg="Creating recordsets" record=cname-loki.example.org target="[\"heritage=external-dns,external-dns/owner=default,external-dns/resource=ingress/loki/loki\"]" ttl=600 type=TXT zone=example.org
time="2023-06-15T20:01:47Z" level=info msg="Creating recordsets" record=teleport.example.org target="[\"heritage=external-dns,external-dns/owner=default,external-dns/resource=service/teleport-cluster/teleport-cluster\"]" ttl=600 type=TXT zone=example.org
time="2023-06-15T20:01:47Z" level=info msg="Creating recordsets" record=cname-teleport.example.org target="[\"heritage=external-dns,external-dns/owner=default,external-dns/resource=service/teleport-cluster/teleport-cluster\"]" ttl=600 type=TXT zone=example.org
time="2023-06-15T20:01:47Z" level=error msg="Failed to create endpoints for DNS zone example.org. Error: Modification Confict: [Duplicate record set found with name loki.example.org and type TXT]"
time="2023-06-15T20:01:47Z" level=fatal msg="Modification Confict: [Duplicate record set found with name loki.example.org and type TXT]"
Please share all args to start external-dns and the resources that let external-dns to create these records. We also need the ingress status as it contains the target and we need to know if there are two resources that want to have different targets and what kind of source you use.
Args used to start external-dns:
Args:
--log-level=debug
--log-format=text
--interval=1m
--source=service
--source=ingress
--policy=sync
--registry=txt
--domain-filter=example.org
--provider=akamai
Some example resources:
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
annotations:
cert-manager.io/cluster-issuer: vault-production
cert-manager.io/common-name: prometheus.example.com
traefik.ingress.kubernetes.io/router.entrypoints: websecure
traefik.ingress.kubernetes.io/router.tls.options: traefik-mtls@kubernetescrd
creationTimestamp: "2023-06-05T21:46:46Z"
generation: 1
labels:
app: kube-prometheus-stack-prometheus
app.kubernetes.io/instance: kube-prometheus
app.kubernetes.io/managed-by: Helm
app.kubernetes.io/part-of: kube-prometheus-stack
app.kubernetes.io/version: 46.4.1
chart: kube-prometheus-stack-46.4.1
heritage: Helm
release: kube-prometheus
name: kube-prometheus-kube-prome-prometheus
namespace: prometheus
resourceVersion: "1806450"
uid: 8dd5092c-c323-4437-ad24-45dcd2f31cf8
spec:
ingressClassName: traefik
rules:
- host: prometheus.example.org
http:
paths:
- backend:
service:
name: kube-prometheus-kube-prome-prometheus
port:
number: 9090
path: /
pathType: ImplementationSpecific
tls:
- hosts:
- prometheus.example.org
secretName: prometheus-tls
status:
loadBalancer:
ingress:
- hostname: 12-34-567-89.example.org
ip: 12.34.567.89
---
apiVersion: v1
kind: Service
metadata:
annotations:
external-dns.alpha.kubernetes.io/hostname: teleport.example.org
creationTimestamp: "2023-06-05T20:38:56Z"
finalizers:
- service.kubernetes.io/load-balancer-cleanup
labels:
app.kubernetes.io/component: proxy
app.kubernetes.io/instance: teleport-cluster
app.kubernetes.io/managed-by: Helm
app.kubernetes.io/name: teleport-cluster
app.kubernetes.io/version: 13.0.0-alpha.2-amd64
helm.sh/chart: teleport-cluster-13.0.3
teleport.dev/majorVersion: "13"
name: teleport-cluster
namespace: teleport-cluster
resourceVersion: "1784576"
uid: d7b0f713-a259-403c-a77b-5286d9afb1cf
spec:
allocateLoadBalancerNodePorts: true
clusterIP: 10.233.51.128
clusterIPs:
- 10.233.51.128
externalTrafficPolicy: Cluster
internalTrafficPolicy: Cluster
ipFamilies:
- IPv4
ipFamilyPolicy: SingleStack
ports:
- name: tls
nodePort: 31767
port: 443
protocol: TCP
targetPort: 3080
selector:
app.kubernetes.io/component: proxy
app.kubernetes.io/instance: teleport-cluster
app.kubernetes.io/name: teleport-cluster
sessionAffinity: None
type: LoadBalancer
status:
loadBalancer:
ingress:
- hostname: 12-34-34-123.example.org
ip: 12-34-34-123
We've seen it fail when trying to create records for both ingress and service type objects without us making any changes other then upgrading the external-dns version.
I don't see loki.example.org in the provided example resources. So I don't see how those resources could have created those records.
@johngmyers I see similar behavior.
In my case, I create a KIND cluster with a service that has an annotation(external-dns.alpha.kubernetes.io/hostname: some.example.org for external-dns to create the record. Then I delete the cluster completely. I then recreate the cluster with the same service and annotation.
But because external-dns did not have a chance to delete the entry previously created, it goes into a CrashLoopBackOff state
If I delete the service first, let external-dns delete the entry and then destroy and recreate cluster, then it works as expected.
@amold1 Please supply a reproducable test case, complete with server arguments, Kubernetes resources, any other initial conditions, actual behavior, and expected behavior.
@johngmyers I was also affected by this on v0.13.5, here are the steps to reproduce
-
Use external-dns v0.11.0 (that's what I used prev., maybe v0.13.4 might work as well as others have pointed). To reproduce we start with an older version and then upgrade to v0.13.5
# external-dns args - args: - --log-level=info - --log-format=json - --interval=30s - --source=service - --source=ingress - --policy=sync - --registry=txt - --txt-owner-id=xxxxxxxxxxx - --domain-filter=example.com - --provider=aws -
ip-address-type: dualstackdualstack for ingress (if you don't have dualstack networking setup, then you can first create the ingress without this annotation and once the LB is provisioned, add this annotation - alb-controller would fail to reconcile further but that should be okay)apiVersion: networking.k8s.io/v1 kind: Ingress metadata: annotations: alb.ingress.kubernetes.io/listen-ports: '[{"HTTP": 80}]' alb.ingress.kubernetes.io/scheme: internet-facing alb.ingress.kubernetes.io/target-type: ip alb.ingress.kubernetes.io/ip-address-type: dualstack kubernetes.io/ingress.class: alb name: external-dns-test-failure spec: rules: - host: external-dns-test.example.com http: paths: - backend: service: name: external-dns-test-canary port: name: http path: /* pathType: ImplementationSpecific -
Once annotation for dualstack is added, external-dns v0.11.0 created three records: TXT, A and AAAA
external-dns-f4f649bdf-2swsp external-dns {"level":"info","msg":"Desired change: CREATE external-dns-test.example.com A [Id: /hostedzone/XXXXXXXXXX]","time":"2023-07-07T02:04:50Z"} external-dns-f4f649bdf-2swsp external-dns {"level":"info","msg":"Desired change: CREATE external-dns-test.example.com AAAA [Id: /hostedzone/XXXXXXXXXXX]","time":"2023-07-07T02:04:50Z"} external-dns-f4f649bdf-2swsp external-dns {"level":"info","msg":"Desired change: CREATE external-dns-test.example.com TXT [Id: /hostedzone/XXXXXXXXX]","time":"2023-07-07T02:04:50Z"} -
Now upgrade the controller to v0.13.5, it tries to create two TXT records with
cname-prefix and fails and goes into CrashLoopBackOffexternal-dns-57f9b9d9d7-dkp6q external-dns time="2023-07-07T01:40:00Z" level=info msg="Desired change: CREATE cname-external-dns-test.example.com TXT [Id: /hostedzone/XXXXXXXXXX]" external-dns-57f9b9d9d7-dkp6q external-dns time="2023-07-07T01:40:00Z" level=info msg="Desired change: CREATE cname-external-dns-test.example.com TXT [Id: /hostedzone/XXXXXXXXXXX]" external-dns-57f9b9d9d7-dkp6q external-dns time="2023-07-07T01:40:00Z" level=error msg="Failure in zone example.com. [Id: /hostedzone/XXXXXXXXXX] when submitting change batch: InvalidChangeBatch: [The request contains an invalid set of changes for a resource record set 'TXT cname-external-dns-test.example.com.']\n\tstatus code: 400, request id: d78cd4b1-0514-4eac-bfe1-bae08e3c071d"
EKS: 1.23 aws-load-balancer-controller: v2.5.3
The two TXT records it was trying to CREATE were exactly the same (I tested using a custom image with additional logic) - so maybe some issue with the deduplication logic?
BTW, I tried master (commit: 92824f4f9) and that didn't result into this behavior.
If this isn't reproducing on master, there's little reason to investigate.
Did a little more digging, seems the commit https://github.com/kubernetes-sigs/external-dns/commit/1bd38347430ac0dddd8e68e23ecf12c426369892 fixed the issue for me.
hey @johngmyers, sorry about that. here's the loki ingress resource we're using.
apiVersion: v1
items:
- apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
annotations:
cert-manager.io/cluster-issuer: vault-production
cert-manager.io/common-name: loki.example.org
traefik.ingress.kubernetes.io/router.entrypoints: websecure
traefik.ingress.kubernetes.io/router.tls.options: traefik-mtls@kubernetescrd
creationTimestamp: "2023-05-11T18:17:42Z"
generation: 1
labels:
app.kubernetes.io/instance: loki
app.kubernetes.io/managed-by: Helm
app.kubernetes.io/name: loki
app.kubernetes.io/version: 2.8.2
helm.sh/chart: loki-5.8.4
name: loki
namespace: loki
resourceVersion: "18129523"
uid: cd8da11a-f2be-418b-b15c-d4e3c1be4eae
spec:
ingressClassName: traefik
rules:
- host: loki.example.org
http:
paths:
- backend:
service:
name: loki-read
port:
number: 3100
path: /api/prom/tail
pathType: Prefix
- backend:
service:
name: loki-read
port:
number: 3100
path: /loki/api/v1/tail
pathType: Prefix
- backend:
service:
name: loki-read
port:
number: 3100
path: /loki/api
pathType: Prefix
- backend:
service:
name: loki-read
port:
number: 3100
path: /api/prom/rules
pathType: Prefix
- backend:
service:
name: loki-read
port:
number: 3100
path: /loki/api/v1/rules
pathType: Prefix
- backend:
service:
name: loki-read
port:
number: 3100
path: /prometheus/api/v1/rules
pathType: Prefix
- backend:
service:
name: loki-read
port:
number: 3100
path: /prometheus/api/v1/alerts
pathType: Prefix
- backend:
service:
name: loki-write
port:
number: 3100
path: /api/prom/push
pathType: Prefix
- backend:
service:
name: loki-write
port:
number: 3100
path: /loki/api/v1/push
pathType: Prefix
tls:
- hosts:
- loki.example.org
secretName: loki-distributed-tls
status:
loadBalancer:
ingress:
- hostname: 12-34-567-89.example.org
ip: 12.34.567.89
In our case we're running multiple clusters with workloads provisioned via argocd and have seen the same error occur but with different resources mentioned based on what external-dns tries to reconcile first.
Same issue with google provider. v0.13.5 version as well. Downgrade to v0.13.4 helped.
@joaocc That's not a CNAME record, as reported in the initial description. That's a TXT record and is expected behavior.
@johngmyers You are correct. Will remove my comment to avoid future confusion. Sorry for the misunderstanding. Thx
The Kubernetes project currently lacks enough contributors to adequately respond to all issues.
This bot triages un-triaged issues according to the following rules:
- After 90d of inactivity,
lifecycle/staleis applied - After 30d of inactivity since
lifecycle/stalewas applied,lifecycle/rottenis applied - After 30d of inactivity since
lifecycle/rottenwas applied, the issue is closed
You can:
- Mark this issue as fresh with
/remove-lifecycle stale - Close this issue with
/close - Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.
This bot triages un-triaged issues according to the following rules:
- After 90d of inactivity,
lifecycle/staleis applied - After 30d of inactivity since
lifecycle/stalewas applied,lifecycle/rottenis applied - After 30d of inactivity since
lifecycle/rottenwas applied, the issue is closed
You can:
- Mark this issue as fresh with
/remove-lifecycle rotten - Close this issue with
/close - Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle rotten
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.
This bot triages issues according to the following rules:
- After 90d of inactivity,
lifecycle/staleis applied - After 30d of inactivity since
lifecycle/stalewas applied,lifecycle/rottenis applied - After 30d of inactivity since
lifecycle/rottenwas applied, the issue is closed
You can:
- Reopen this issue with
/reopen - Mark this issue as fresh with
/remove-lifecycle rotten - Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/close not-planned
@k8s-triage-robot: Closing this issue, marking it as "Not Planned".
In response to this:
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.
This bot triages issues according to the following rules:
- After 90d of inactivity,
lifecycle/staleis applied- After 30d of inactivity since
lifecycle/stalewas applied,lifecycle/rottenis applied- After 30d of inactivity since
lifecycle/rottenwas applied, the issue is closedYou can:
- Reopen this issue with
/reopen- Mark this issue as fresh with
/remove-lifecycle rotten- Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/close not-planned
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.