external-dns Existing weighted CNAMEs and Route 53 A alias records cause External-DNS to error with InvalidChangeBatch

Existing weighted CNAMEs and Route 53 A alias records cause External-DNS to error with InvalidChangeBatch

Open tejaspbajaj opened this issue 1 year ago • 3 comments

What happened: When a weighted CNAME or A alias records exist, external DNS does not skip attempting to create an A and TXT records for the same name and then errors with

time="2022-06-23T00:25:33Z" level=error msg="InvalidChangeBatch: [RRSet of type A with DNS name <record-name> is not permitted because a conflicting RRSet of type CNAME with the same DNS name already exists in zone <zone-name>, RRSet of type TXT with DNS name <record-name> is not permitted because a conflicting RRSet of type CNAME with the same DNS name already exists in zone <zone-name>]\n\tstatus code: 400, request id: <req-id>"

What you expected to happen: Expectation is for External DNS to account for the existing weighted CNAME and skip creating a new record with the same name instead of attempting and then erroring.

How to reproduce it (as minimally and precisely as possible):

Created a weighted CNAME record from route 53 console. It will ask for a Record ID.
Create an ingress gateway with 'hosts' using the same name as the CNAME in step 1.

Anything else we need to know?: I tried debugging this and was able to figure out why it happens but I need help in figuring out the correct fix.

When building a plan of changes, external-dns has 2 lists, current (whats in Route53) and candidates (what it reads from ingress endpoints). It then builds a map of dnsName -> {SetIdentifier -> (current, candidate)} as seen here, where SetIdentifier is the Record Id in Route 53. The record Id for this dnsName in route 53 (current) is a non-empty string while the same is empty for the candidate that external dns builds. Note: Route 53 does not allow empty Record Ids. As a result both end up under different keys in the map. So later when external-dns runs over the plan to create a list of changes, it is not able to match the current to the candidate. It adds this candidate to the create list and attempts to create it and errors.

While going over this I noticed a Todo in comments which could possibly help with the fix for this.

cc @Raffo @linki, you guys seem to have worked on this in the past. Would be great to get some ideas on the fix.

Environment:

External-DNS version (use external-dns --version): 0.11.0
DNS provider: AWS Route 53
Others: Istio: 1.13.5 kubernetes: 1.21.12

Jul 07 '22 00:07 tejaspbajaj

We are seeing the exact same issue with Route53 after attempting to upgrade to 0.12.0.

We are upgrading from v0.5.14 in the same cluster where this issue did not exist and the related CNAME records were seemingly ignored.

Logs for 0.12.0 and configs:

time="2022-07-20T18:10:09Z" level=info msg="Desired change: CREATE 00a8982b-3a6d-4f87-b603-5f25090a2c01.foo.bar.net A [Id: /hostedzone/ZBIUQ9J8XHK3M]"
time="2022-07-20T18:10:09Z" level=info msg="Desired change: CREATE 00a8982b-3a6d-4f87-b603-5f25090a2c01.foo.bar.net TXT [Id: /hostedzone/ZBIUQ9J8XHK3M]"
[RRSet of type A with DNS name 00a8982b-3a6d-4f87-b603-5f25090a2c01.foo.bar.net. is not permitted because a conflicting RRSet of type CNAME with the same DNS name already exists in zone bar.net.

Ingress:

Rules:
  Host                                                             Path  Backends
  ----                                                             ----  --------
  00a8982b-3a6d-4f87-b603-5f25090a2c01.foo.bar.net          
                                                                   /   foo-service:8545 ()
  primary.00a8982b-3a6d-4f87-b603-5f25090a2c01.foo.bar.net  
                                                                   /   foo-service:8545 ()

Route53 records:

Record name
Type
Routing policy
Differentiator
Value/Route traffic to

00a8982b-3a6d-4f87-b603-5f25090a2c01.foo.bar.net	CNAME	Failover	Primary	
primary.00a8982b-3a6d-4f87-b603-5f25090a2c01.foo.bar.net

00a8982b-3a6d-4f87-b603-5f25090a2c01.foo.bar.net	CNAME	Failover	Secondary	
secondary.00a8982b-3a6d-4f87-b603-5f25090a2c01.foo.dev.gcp.bar.net

primary.00a8982b-3a6d-4f87-b603-5f25090a2c01.foo.bar.net	A	Simple	-	
foobar.elb.us-east-2.amazonaws.com.

primary.00a8982b-3a6d-4f87-b603-5f25090a2c01.foo.bar.net	TXT	Simple	-	
"heritage=external-dns,external-dns/owner=<owner>,external-dns/resource=ingress/00a8982b-3a6d-4f87-b603-5f25090a2c01/foobar"

_acme-challenge.primary.00a8982b-3a6d-4f87-b603-5f25090a2c01.bsc.foobar.net	TXT	Simple	-	
"foobar"

External-dns should only manage primary.00a8982b-3a6d-4f87-b603-5f25090a2c01.foo.bar.net but not 00a8982b-3a6d-4f87-b603-5f25090a2c01.foo.bar.net

Jul 20 '22 19:07 janz0390

Adding debug logs:

time="2022-07-20T19:59:38Z" level=debug msg="Endpoints generated from ingress: 00a8982b-3a6d-4f87-b603-5f25090a2c01/foo-qt-primary-rpc: [00a8982b-3a6d-4f87-b603-5f25090a2c01.foo.bar.net 0 IN CNAME  REDACTED.elb.us-east-2.amazonaws.com [] primary.00a8982b-3a6d-4f87-b603-5f25090a2c01.foo.bar.net 0 IN CNAME  REDACTED.elb.us-east-2.amazonaws.com [] 00a8982b-3a6d-4f87-b603-5f25090a2c01.foo.bar.net 0 IN CNAME  REDACTED.elb.us-east-2.amazonaws.com [] primary.00a8982b-3a6d-4f87-b603-5f25090a2c01.foo.bar.net 0 IN CNAME  REDACTED.elb.us-east-2.amazonaws.com []]"
time="2022-07-20T19:59:38Z" level=debug msg="Endpoints generated from ingress: 00a8982b-3a6d-4f87-b603-5f25090a2c01/foo-qt-primary-health: [00a8982b-3a6d-4f87-b603-5f25090a2c01.foo.bar.net 0 IN CNAME  REDACTED.elb.us-east-2.amazonaws.com [] primary.00a8982b-3a6d-4f87-b603-5f25090a2c01.foo.bar.net 0 IN CNAME  REDACTED.elb.us-east-2.amazonaws.com [] 00a8982b-3a6d-4f87-b603-5f25090a2c01.foo.bar.net 0 IN CNAME  REDACTED.elb.us-east-2.amazonaws.com [] primary.00a8982b-3a6d-4f87-b603-5f25090a2c01.foo.bar.net 0 IN CNAME  REDACTED.elb.us-east-2.amazonaws.com []]"
time="2022-07-20T19:59:38Z" level=debug msg="Removing duplicate endpoint 00a8982b-3a6d-4f87-b603-5f25090a2c01.foo.bar.net 0 IN CNAME  REDACTED.elb.us-east-2.amazonaws.com []"
time="2022-07-20T19:59:38Z" level=debug msg="Removing duplicate endpoint primary.00a8982b-3a6d-4f87-b603-5f25090a2c01.foo.bar.net 0 IN CNAME  REDACTED.elb.us-east-2.amazonaws.com []"
time="2022-07-20T19:59:38Z" level=debug msg="Removing duplicate endpoint 00a8982b-3a6d-4f87-b603-5f25090a2c01.foo.bar.net 0 IN CNAME  REDACTED.elb.us-east-2.amazonaws.com []"
time="2022-07-20T19:59:38Z" level=debug msg="Removing duplicate endpoint primary.00a8982b-3a6d-4f87-b603-5f25090a2c01.foo.bar.net 0 IN CNAME  REDACTED.elb.us-east-2.amazonaws.com []"
time="2022-07-20T19:59:38Z" level=debug msg="Removing duplicate endpoint 00a8982b-3a6d-4f87-b603-5f25090a2c01.foo.bar.net 0 IN CNAME  REDACTED.elb.us-east-2.amazonaws.com []"
time="2022-07-20T19:59:38Z" level=debug msg="Removing duplicate endpoint primary.00a8982b-3a6d-4f87-b603-5f25090a2c01.foo.bar.net 0 IN CNAME  REDACTED.elb.us-east-2.amazonaws.com []"
time="2022-07-20T19:59:38Z" level=debug msg="Modifying endpoint: 00a8982b-3a6d-4f87-b603-5f25090a2c01.foo.bar.net 0 IN CNAME  REDACTED.elb.us-east-2.amazonaws.com [], setting alias=true"
time="2022-07-20T19:59:38Z" level=debug msg="Modifying endpoint: 00a8982b-3a6d-4f87-b603-5f25090a2c01.foo.bar.net 0 IN CNAME  REDACTED.elb.us-east-2.amazonaws.com [{alias true}], setting aws/evaluate-target-health=true"
time="2022-07-20T19:59:38Z" level=debug msg="Modifying endpoint: primary.00a8982b-3a6d-4f87-b603-5f25090a2c01.foo.bar.net 0 IN CNAME  REDACTED.elb.us-east-2.amazonaws.com [], setting alias=true"
time="2022-07-20T19:59:38Z" level=debug msg="Modifying endpoint: primary.00a8982b-3a6d-4f87-b603-5f25090a2c01.foo.bar.net 0 IN CNAME  REDACTED.elb.us-east-2.amazonaws.com [{alias true}], setting aws/evaluate-target-health=true"
time="2022-07-20T19:59:38Z" level=debug msg="Skipping endpoint 00a8982b-3a6d-4f87-b603-5f25090a2c01.foo.bar.net 5 IN CNAME secondary secondary.00a8982b-3a6d-4f87-b603-5f25090a2c01.foo.dev.gcp.bar.net [{aws/failover SECONDARY}] because owner id does not match, found: \"\", required: \"dev.aws.us-east-2.data\""
time="2022-07-20T19:59:38Z" level=debug msg="Skipping endpoint 00a8982b-3a6d-4f87-b603-5f25090a2c01.foo.bar.net 5 IN CNAME primary primary.00a8982b-3a6d-4f87-b603-5f25090a2c01.foo.bar.net [{aws/failover PRIMARY} {aws/health-check-id 11fa5260-be39-486a-b798-9f7fe4d015fb}] because owner id does not match, found: \"\", required: \"dev.aws.us-east-2.data\""
time="2022-07-20T19:59:38Z" level=debug msg="Adding 00a8982b-3a6d-4f87-b603-5f25090a2c01.foo.bar.net. to zone bar.net. [Id: /hostedzone/REDACTED]"
time="2022-07-20T19:59:38Z" level=debug msg="Adding cname-primary.00a8982b-3a6d-4f87-b603-5f25090a2c01.foo.bar.net. to zone bar.net. [Id: /hostedzone/REDACTED]"
time="2022-07-20T19:59:38Z" level=debug msg="Adding 00a8982b-3a6d-4f87-b603-5f25090a2c01.foo.bar.net. to zone bar.net. [Id: /hostedzone/REDACTED]"
time="2022-07-20T19:59:38Z" level=debug msg="Adding cname-00a8982b-3a6d-4f87-b603-5f25090a2c01.foo.bar.net. to zone bar.net. [Id: /hostedzone/REDACTED]"
time="2022-07-20T19:59:38Z" level=info msg="Desired change: CREATE 00a8982b-3a6d-4f87-b603-5f25090a2c01.foo.bar.net A [Id: /hostedzone/REDACTED]"
time="2022-07-20T19:59:38Z" level=info msg="Desired change: CREATE 00a8982b-3a6d-4f87-b603-5f25090a2c01.foo.bar.net TXT [Id: /hostedzone/REDACTED]"
time="2022-07-20T19:59:38Z" level=info msg="Desired change: CREATE cname-00a8982b-3a6d-4f87-b603-5f25090a2c01.foo.bar.net TXT [Id: /hostedzone/REDACTED]"
time="2022-07-20T19:59:38Z" level=info msg="Desired change: CREATE cname-primary.00a8982b-3a6d-4f87-b603-5f25090a2c01.foo.bar.net TXT [Id: /hostedzone/REDACTED]"
time="2022-07-20T19:59:39Z" level=error msg="InvalidChangeBatch: [RRSet of type A with DNS name 00a8982b-3a6d-4f87-b603-5f25090a2c01.foo.bar.net. is not permitted because a conflicting RRSet of type CNAME with the same DNS name already exists in zone bar.net., RRSet of type TXT with DNS name 00a8982b-3a6d-4f87-b603-5f25090a2c01.foo.bar.net. is not permitted because a conflicting RRSet of type CNAME with the same DNS name already exists in zone bar.net

Jul 20 '22 20:07 janz0390

My temporary workaround for this issue is to ignore the ingress. Ref: https://github.com/kubernetes-sigs/external-dns/issues/1910#issuecomment-976371247

Aug 07 '22 12:08 bzon

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

Nov 05 '22 12:11 k8s-triage-robot

/remove-lifecycle stale

Nov 06 '22 01:11 seh

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

Feb 04 '23 02:02 k8s-triage-robot

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle rotten
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

Mar 06 '23 02:03 k8s-triage-robot

/remove-lifecycle rotten

Mar 06 '23 13:03 seh

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

Jun 04 '23 14:06 k8s-triage-robot

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle rotten
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

Jul 04 '23 15:07 k8s-triage-robot

/remove-lifecycle rotten

Sep 23 '23 01:09 justinlarose

We're seeing almost the same behaviour using version 0.13.5, but slightly worse. Instead of just logging an error, external-dns additionally crashes, rendering it dysfunctional as it'll end up in CrashLoopBackoff.

Here's a log snippet with the domain and hosted zone ID redacted:

{"level":"info","msg":"Desired change: CREATE cname-subdomain.domain.tld TXT [Id: /hostedzone/ABCDEFGHIJ123]","time":"2023-10-10T06:37:01Z"}
{"level":"info","msg":"Desired change: CREATE subdomain.domain.tld A [Id: /hostedzone/ABCDEFGHIJ123]","time":"2023-10-10T06:37:01Z"}
{"level":"info","msg":"Desired change: CREATE subdomain.domain.tld TXT [Id: /hostedzone/ABCDEFGHIJ123]","time":"2023-10-10T06:37:01Z"}
{"level":"error","msg":"Failure in zone domain.tld. [Id: /hostedzone/ABCDEFGHIJ123] when submitting change batch: InvalidChangeBatch: [RRSet of type A with DNS name subdomain.domain.tld. is not permitted because a conflicting RRSet of type CNAME with the same DNS name already exists in zone domain.tld., RRSet of type TXT with DNS name subdomain.domain.tld. is not permitted because a conflicting RRSet of type CNAME with the same DNS name already exists in zone domain.tld.]\n\tstatus code: 400, request id: abaa0a29-f4a7-426f-adf6-02f19bbe5657","time":"2023-10-10T06:37:01Z"}
{"level":"fatal","msg":"failed to submit all changes for the following zones: [/hostedzone/ABCDEFGHIJ123]","time":"2023-10-10T06:37:02Z"}

Oct 10 '23 06:10 martinohmann

@martinohmann Downgrading to 0.12.2 fixed this for me

Oct 26 '23 23:10 kevupton

Hi, I had the exact same issue, and for me the fix was to add the --txt-prefix= parameter. Not sure why..

Nov 16 '23 18:11 allamand

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

Feb 14 '24 19:02 k8s-triage-robot

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle rotten
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

Mar 15 '24 20:03 k8s-triage-robot

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue with /reopen
Mark this issue as fresh with /remove-lifecycle rotten
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

Apr 14 '24 20:04 k8s-triage-robot

@k8s-triage-robot: Closing this issue, marking it as "Not Planned".

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied

After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied

After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue with /reopen

Mark this issue as fresh with /remove-lifecycle rotten

Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Apr 14 '24 20:04 k8s-ci-robot

external-dns external-dns copied to clipboard

Existing weighted CNAMEs and Route 53 A alias records cause External-DNS to error with InvalidChangeBatch

external-dns
external-dns copied to clipboard