external-dns AWS Route53 and shared service IP address: InvalidChangeBatch: Duplicate Resource Record

What happened:

We are using MetalLB’s capability to share same external IP address with two or more Kubernetes services (through the metallb.universe.tf/allow-shared-ip annotation) and assign the same hostname through the external-dns.alpha.kubernetes.io/hostname annotation. Unfortunately, this results in errors with the AWS Route53 API:

time="2021-01-22T07:35:34Z" level=info msg="Desired change: CREATE loginputs.app.aws.company.com TXT [Id: /hostedzone/Z07039751Q2TM9MKHPKU5]"
time="2021-01-22T07:35:34Z" level=info msg="Desired change: CREATE mq.app.aws.company.com A [Id: /hostedzone/Z07039751Q2TM9MKHPKU5]"
time="2021-01-22T07:35:34Z" level=info msg="Desired change: CREATE mq.app.aws.company.com TXT [Id: /hostedzone/Z07039751Q2TM9MKHPKU5]"
time="2021-01-22T07:35:34Z" level=error msg="Failure in zone app.aws.company.com. [Id: /hostedzone/Z07039751Q2TM9MKHPKU5]"
time="2021-01-22T07:35:34Z" level=error msg="InvalidChangeBatch: [Duplicate Resource Record: '10.160.0.188']\n\tstatus code: 400, request id: 7205efc0-0a63-41e8-a9a4-66a739ac77b8"
time="2021-01-22T07:35:34Z" level=error msg="failed to submit all changes for the following zones: [/hostedzone/Z07039751Q2TM9MKHPKU5]"

What you expected to happen:

external-dns should deduplicate the list of changes that it submits in a single batch.

How to reproduce it (as minimally and precisely as possible):

Ensure a valid Route 53 configuration, assign the same loadBalancerIP, and identical external-dns.alpha.kubernetes.io/hostname annotations. When testing with MetalLB, the metallb.universe.tf/allow-shared-ip needs to be set on both services as shown in the docs.

Anything else we need to know?:

More details:

NAMESPACE              NAME                                 TYPE           CLUSTER-IP       EXTERNAL-IP    PORT(S)                                                                                                                       AGE
log                    graylog-tcp                          LoadBalancer   100.66.166.38    10.160.0.188   5044:32071/TCP,12201:30325/TCP,12203:30399/TCP                                                                                10h
log                    graylog-udp                          LoadBalancer   100.66.142.96    10.160.0.188   5410:32120/UDP                                                                                                                10h

Service graylog-udp:

apiVersion: v1
kind: Service
metadata:
  name: graylog-udp
  namespace: log
  annotations:
    external-dns.alpha.kubernetes.io/hostname: loginputs.app.aws.company.com
    metallb.universe.tf/allow-shared-ip: graylog-inputs
[…]
spec:
[…]
  type: LoadBalancer
  loadBalancerIP: 10.160.0.188

Service graylog-tcp:

apiVersion: v1
kind: Service
metadata:
  name: graylog-tcp
  namespace: log
  annotations:
    external-dns.alpha.kubernetes.io/hostname: loginputs.app.aws.company.com
    metallb.universe.tf/allow-shared-ip: graylog-inputs
[…]
spec:
[…]
  type: LoadBalancer
  loadBalancerIP: 10.160.0.188

This issue has been mentionied in #1015 before but that issue is closed and lacks details.

(By the way, just in case anybody wonders why there is MetalLB on an AWS intance: This is an easy way to manage multiple IP addresses on a single-node cluster.)

Environment:

External-DNS version (use external-dns --version): v0.7.5
DNS provider: AWS Route53
Others:

Jan 22 '21 11:01 stephan2012

We are also seeing the same issue with v0.7.6, with multiple services running on the same EKS cluster. The issue does not occur with v0.7.4

Apr 15 '21 11:04 FlowColwyn

We are experiencing the issue today. I can confirm switching to v0.7.4 seems to have resolved the issue. We are not using the MetalLB though, just the standard external-dns pod. Bug showed in latest tag prompting debug.

Apr 29 '21 08:04 robwithhair

We encountered this issue today using v0.7.6. Tested this with v0.8.0 and it still has the issue. Rolling back to v0.7.4 fixed it.

May 14 '21 10:05 miguelgmalpha

Records not being deduplicated seems to be visible also with the rfc2136 provider. external-dns constantly removes and adds a resource record because the same hostname is assigned to two different services sharing the same IP address.

Logs:

time="2021-05-14T10:19:03Z" level=info msg="Removing RR: loginput.lab.company.com 60 A 192.168.100.118"
time="2021-05-14T10:19:03Z" level=info msg="Adding RR: loginput.lab.company.com 60 A 192.168.100.118"
time="2021-05-14T10:19:03Z" level=info msg="Adding RR: loginput.lab.company.com 60 A 192.168.100.118"
time="2021-05-14T10:19:03Z" level=info msg="Removing RR: loginput.lab.company.com 0 TXT \"heritage=external-dns,external-dns/owner=default,external-dns/resource=service/log/graylog-tcp\""
time="2021-05-14T10:19:03Z" level=info msg="Adding RR: loginput.lab.company.com 60 TXT \"heritage=external-dns,external-dns/owner=default,external-dns/resource=service/log/graylog-tcp\""
time="2021-05-14T10:20:03Z" level=info msg="Removing RR: loginput.lab.company.com 60 A 192.168.100.118"
time="2021-05-14T10:20:03Z" level=info msg="Adding RR: loginput.lab.company.com 60 A 192.168.100.118"
time="2021-05-14T10:20:03Z" level=info msg="Adding RR: loginput.lab.company.com 60 A 192.168.100.118"
time="2021-05-14T10:20:03Z" level=info msg="Removing RR: loginput.lab.company.com 0 TXT \"heritage=external-dns,external-dns/owner=default,external-dns/resource=service/log/graylog-tcp\""
time="2021-05-14T10:20:03Z" level=info msg="Adding RR: loginput.lab.company.com 60 TXT \"heritage=external-dns,external-dns/owner=default,external-dns/resource=service/log/graylog-tcp\""
time="2021-05-14T10:21:03Z" level=info msg="Removing RR: loginput.lab.company.com 60 A 192.168.100.118"
time="2021-05-14T10:21:03Z" level=info msg="Adding RR: loginput.lab.company.com 60 A 192.168.100.118"
time="2021-05-14T10:21:03Z" level=info msg="Adding RR: loginput.lab.company.com 60 A 192.168.100.118"
time="2021-05-14T10:21:03Z" level=info msg="Removing RR: loginput.lab.company.com 0 TXT \"heritage=external-dns,external-dns/owner=default,external-dns/resource=service/log/graylog-tcp\""
time="2021-05-14T10:21:03Z" level=info msg="Adding RR: loginput.lab.company.com 60 TXT \"heritage=external-dns,external-dns/owner=default,external-dns/resource=service/log/graylog-tcp\""
time="2021-05-14T10:22:03Z" level=info msg="Removing RR: loginput.lab.company.com 60 A 192.168.100.118"
time="2021-05-14T10:22:03Z" level=info msg="Adding RR: loginput.lab.company.com 60 A 192.168.100.118"
time="2021-05-14T10:22:03Z" level=info msg="Adding RR: loginput.lab.company.com 60 A 192.168.100.118"
time="2021-05-14T10:22:03Z" level=info msg="Removing RR: loginput.lab.company.com 0 TXT \"heritage=external-dns,external-dns/owner=default,external-dns/resource=service/log/graylog-tcp\""
time="2021-05-14T10:22:03Z" level=info msg="Adding RR: loginput.lab.company.com 60 TXT \"heritage=external-dns,external-dns/owner=default,external-dns/resource=service/log/graylog-tcp\""
time="2021-05-14T10:23:04Z" level=info msg="Removing RR: loginput.lab.company.com 60 A 192.168.100.118"
time="2021-05-14T10:23:04Z" level=info msg="Adding RR: loginput.lab.company.com 60 A 192.168.100.118"
time="2021-05-14T10:23:04Z" level=info msg="Adding RR: loginput.lab.company.com 60 A 192.168.100.118"
time="2021-05-14T10:23:04Z" level=info msg="Removing RR: loginput.lab.company.com 0 TXT \"heritage=external-dns,external-dns/owner=default,external-dns/resource=service/log/graylog-tcp\""
time="2021-05-14T10:23:04Z" level=info msg="Adding RR: loginput.lab.company.com 60 TXT \"heritage=external-dns,external-dns/owner=default,external-dns/resource=service/log/graylog-tcp\""

Service 1:

apiVersion: v1
kind: Service
metadata:
  annotations:
    external-dns.alpha.kubernetes.io/hostname: loginput.lab.company.com
    meta.helm.sh/release-name: graylog
    meta.helm.sh/release-namespace: log
    metallb.universe.tf/allow-shared-ip: graylog-inputs
[…]

Service 2:

apiVersion: v1
kind: Service
metadata:
  annotations:
    external-dns.alpha.kubernetes.io/hostname: loginput.lab.company.com
    meta.helm.sh/release-name: graylog
    meta.helm.sh/release-namespace: log
    metallb.universe.tf/allow-shared-ip: graylog-inputs

May 14 '21 10:05 stephan2012

We have a similar scenario where two diferent DNS entries point to the same IPs

time="2021-05-18T12:55:01Z" level=info msg="Desired change: UPSERT external-brokers.tenant.eu.stg.data.company.com A [Id: /hostedzone/XXXXXXXXXXXXXXXXX]"
time="2021-05-18T12:55:01Z" level=info msg="Desired change: UPSERT external-brokers.tenant.eu.stg.data.company.com TXT [Id: /hostedzone/XXXXXXXXXXXXXXXXX]"
time="2021-05-18T12:55:01Z" level=info msg="Desired change: UPSERT external-brokers.stg.data.company.com A [Id: /hostedzone/XXXXXXXXXXXXXXXXX]"
time="2021-05-18T12:55:01Z" level=info msg="Desired change: UPSERT external-brokers.stg.data.company.com TXT [Id: /hostedzone/XXXXXXXXXXXXXXXXX]"
time="2021-05-18T12:55:01Z" level=error msg="Failure in zone stg.data.company.com. [Id: /hostedzone/XXXXXXXXXXXXXXXXX]"
time="2021-05-18T12:55:01Z" level=error msg="InvalidChangeBatch: [Duplicate Resource Record: '10.221.8.236', Duplicate Resource Record: '10.221.8.236', Duplicate Resource Record: '10.221.3.179', Duplicate Resource Record: '10.221.7.87']\n\tstatus code: 400, request id: 925592cf-4805-4f16-b288-9269848b8649"

Service 1:

apiVersion: v1
kind: Service
metadata:
  annotations:
    external-dns.alpha.kubernetes.io/hostname: external-brokers.stg.data.company.com
spec:
  type: NodePort
...

Service 2:

apiVersion: v1
kind: Service
metadata:
  annotations:
    external-dns.alpha.kubernetes.io/hostname: external-brokers.tenant.eu.stg.data.company.com
spec:
  type: NodePort
...

With external-dns:0.7.4 this works fine.

May 18 '21 13:05 miguelgmalpha

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

Aug 16 '21 13:08 k8s-triage-robot

/remove-lifecycle stale

Aug 16 '21 13:08 stephan2012

Issue is not yet resolved.

Aug 16 '21 13:08 stephan2012

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

Nov 14 '21 14:11 k8s-triage-robot

/remove-lifecycle stale

Nov 14 '21 16:11 stephan2012

The issue does not disappear just because it is not addressed …

Nov 14 '21 16:11 stephan2012

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

Feb 12 '22 17:02 k8s-triage-robot

Still an issue.

/remove-lifecycle stale

Feb 14 '22 17:02 stephan2012

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

May 15 '22 17:05 k8s-triage-robot

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

Jun 14 '22 18:06 k8s-triage-robot

/remove-lifecycle rotten

Jun 23 '22 09:06 flokli

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

Sep 21 '22 10:09 k8s-triage-robot

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

Oct 21 '22 10:10 k8s-triage-robot

/remove-lifecycle rotten

Oct 21 '22 11:10 flokli

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

Jan 19 '23 12:01 k8s-triage-robot

/remove-lifecycle stale

Jan 20 '23 10:01 flokli

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

Apr 20 '23 10:04 k8s-triage-robot

/remove-lifecycle stale

Apr 20 '23 12:04 flokli

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

Jul 19 '23 13:07 k8s-triage-robot

/remove-lifecycle stale

Jul 19 '23 13:07 flokli

This issue is still occurring occasionally for us. The workaround for us is to delete the duplicated service.

kubectl get service -n your-namespace | grep duplicated-external-name-service-here

kubectl delete service -n your-namespace duplicated-external-name-service-here

Aug 28 '23 23:08 sam-som

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

Jan 27 '24 05:01 k8s-triage-robot

/remove-lifecycle stale

Jan 27 '24 08:01 flokli

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

Apr 26 '24 09:04 k8s-triage-robot

/remove-lifecycle stale

❤️

Apr 26 '24 10:04 flokli

external-dns external-dns copied to clipboard

AWS Route53 and shared service IP address: InvalidChangeBatch: Duplicate Resource Record

external-dns
external-dns copied to clipboard