external-dns icon indicating copy to clipboard operation
external-dns copied to clipboard

Some new TXT records are not being cleaned up, causing an "InvalidChangeBatch" error

Open born4new opened this issue 1 year ago • 22 comments

What happened:

After deleting some ingress resources, it seems that the new TXT record is not being cleaned up, but the other two DNS entries (the A record and the legacy TXT record) are being cleaned up. When searching for DNS records in AWS53, this is what we see:

Searching for <our-dns-name>.

[]

Searching for a-<our-dns-name>.

    {
        "Name": "a-<our-dns-name>.",
        "Type": "TXT",
        "TTL": 300,
        "ResourceRecords": [
            {
                "Value": "\"heritage=external-dns,external-dns/owner=<our-owner-string>,external-dns/resource=ingress/<our-ingress>\""
            }
        ]
    }

This later on causes an issue when we redeploy the application, as external-dns tries to create those three DNS entries (A record, legacy TXT and new TXT):

time="2022-11-23T14:06:14Z" level=info msg="Desired change: CREATE a-<our-dns-name> TXT [Id: /hostedzone/<redacted>]"
time="2022-11-23T14:06:14Z" level=info msg="Desired change: CREATE <our-dns-name> A [Id: /hostedzone/<redacted>]"
time="2022-11-23T14:06:14Z" level=info msg="Desired change: CREATE <our-dns-name> TXT [Id: /hostedzone/<redacted>]"
time="2022-11-23T14:06:14Z" level=error msg="Failure in zone <our-dns-zone>. [Id: /hostedzone/<redacted>]"
time="2022-11-23T14:06:14Z" level=error msg="InvalidChangeBatch: [Tried to create resource record set [name='a-<our-dns-name>.', type='TXT'] but it already exists]\n\tstatus code: 400, request id: d65bc8e2-4055-4d9f-8412-4653debd76ff"

What you expected to happen:

The new TXT record should be cleaned up in the first place, or maybe we could also replace the TXT record if it already exists, or have an option to do so.

How to reproduce it (as minimally and precisely as possible):

I do not know how to reproduce this issue easily, but I'm more than happy to provide as much debugging info as needed.

Anything else we need to know?:

N/A

Environment:

  • External-DNS version (use external-dns --version): 0.13.1
  • DNS provider: AWS
  • Others:

born4new avatar Nov 23 '22 15:11 born4new

This definitely looks similar to https://github.com/kubernetes-sigs/external-dns/issues/3007, https://github.com/kubernetes-sigs/external-dns/issues/2421, and https://github.com/kubernetes-sigs/external-dns/issues/2793.

rymai avatar Nov 23 '22 19:11 rymai

@born4new does setting --aws-batch-change-size=1 resolve your problem? (i.e., is it purely the batching that is broken?)

benjimin avatar Nov 23 '22 21:11 benjimin

does setting --aws-batch-change-size=1 resolve your problem?

We haven't specifically tried a size of 1, but we have tried a few values (e.g. 20, 200, 1000), none of them helped.

The fix for us was to go back to an external-dns version below 0.12.0, so that external-dns wouldn't be aware of the newly introduced TXT record. This seems to indicate a problem in the way the new TXT records are cleaned up...

born4new avatar Dec 06 '22 13:12 born4new

We are facing the exact same issue.

JonathanLachapelle avatar Dec 09 '22 17:12 JonathanLachapelle

What happened:

After deleting some ingress resources, it seems that the new TXT record is not being cleaned up, but the other two DNS entries (the A record and the legacy TXT record) are being cleaned up. When searching for DNS records in AWS53, this is what we see:

Searching for <our-dns-name>.

[]

Searching for a-<our-dns-name>.

    {
        "Name": "a-<our-dns-name>.",
        "Type": "TXT",
        "TTL": 300,
        "ResourceRecords": [
            {
                "Value": "\"heritage=external-dns,external-dns/owner=<our-owner-string>,external-dns/resource=ingress/<our-ingress>\""
            }
        ]
    }

This later on causes an issue when we redeploy the application, as external-dns tries to create those three DNS entries (A record, legacy TXT and new TXT):

time="2022-11-23T14:06:14Z" level=info msg="Desired change: CREATE a-<our-dns-name> TXT [Id: /hostedzone/<redacted>]"
time="2022-11-23T14:06:14Z" level=info msg="Desired change: CREATE <our-dns-name> A [Id: /hostedzone/<redacted>]"
time="2022-11-23T14:06:14Z" level=info msg="Desired change: CREATE <our-dns-name> TXT [Id: /hostedzone/<redacted>]"
time="2022-11-23T14:06:14Z" level=error msg="Failure in zone <our-dns-zone>. [Id: /hostedzone/<redacted>]"
time="2022-11-23T14:06:14Z" level=error msg="InvalidChangeBatch: [Tried to create resource record set [name='a-<our-dns-name>.', type='TXT'] but it already exists]\n\tstatus code: 400, request id: d65bc8e2-4055-4d9f-8412-4653debd76ff"

What you expected to happen:

The new TXT record should be cleaned up in the first place, or maybe we could also replace the TXT record if it already exists, or have an option to do so.

How to reproduce it (as minimally and precisely as possible):

I do not know how to reproduce this issue easily, but I'm more than happy to provide as much debugging info as needed.

Anything else we need to know?:

N/A

Environment:

  • External-DNS version (use external-dns --version): 0.13.1
  • DNS provider: AWS
  • Others:

Does it happen on all record or just sometime?

JonathanLachapelle avatar Dec 12 '22 13:12 JonathanLachapelle

Does it happen on all record or just sometime?

@JonathanLachapelle It was happening on some records only.

born4new avatar Dec 14 '22 09:12 born4new

we faced the same issue today: We are using AWS Route53 and our External DNS version is 0.12.2

{"level":"error","msg":"InvalidChangeBatch: [Tried to create resource record set [name='cname-runtime-api-dev-amy.development.voiceflow.com.', type='TXT'] but it already exists]\n\tstatus code: 400, request id: 4db33c47-f34f-4a36-8d60-b2cb0750578d","time":"2022-12-14T11:11:20Z"}

xavidop avatar Dec 14 '22 11:12 xavidop

I have faced the same issue after updating the external-dns version from 0.12.0 to 0.13.1. And instead of syncing with previously created TXT record graylog.<domain> it tries to create cname-graylog.<domain> and it fails with output below:

time="2022-12-14T11:49:54Z" level=error msg="InvalidChangeBatch: [The request contains an invalid set of changes for a resource record set 'TXT cname-graylog.<domain>.', The request contains an invalid set of changes for a resource record set 'TXT cname-mongodb.<domain>.', The request contains an invalid set of changes for a resource record set 'TXT cname-tcp.graylog.<domain>.']\n\tstatus code: 400, request id: <Id>"
time="2022-12-14T11:49:54Z" level=info msg="Desired change: CREATE cname-graylog.<domain> TXT [Id: /hostedzone/<hostedzone>]"
...

ArturChe avatar Dec 14 '22 13:12 ArturChe

I have faced the same issue. I got a new cluster up with external chart version 6.12.1 which is using image 0.13.1 But it errors out with InvalidChangeBatch when trying to create cname-<domain> entry.

Also, when I switch back to version 0.11.0, it keeps on deleting and creating the route53 records instead of updating them. here, I am using --upsert-policy.

Desired change: CREATE 123.dev.cloud A "Desired change: CREATE 123.dev.cloud TXT Applying provider record filter for domains Desired change: CREATE 123.dev.cloud A Desired change: CREATE 123.dev.cloud TXT

It's a huge blocker.

IKohli09 avatar Jan 26 '23 03:01 IKohli09

We are experiencing the same issue with version 0.13.1 and kubernetes 1.21 or higher. In our case when the issue happens, external-dns stops processing requests until we go to AWS and manually remove the leftovers.

logs:

time="2023-02-01T12:55:01Z" level=error msg="Failure in zone qa.controlup.com. [Id: /hostedzone/XXXXXXXXXX]"
time="2023-02-01T12:55:01Z" level=error msg="InvalidChangeBatch: [Tried to create resource record set [name='cname-x.com.', type='TXT'] but it already exists]\n\tstatus code: 400, request id: 8b8e55e1-efe0-452d-96da-af65ff122fca"
time="2023-02-01T12:55:01Z" level=error msg="failed to submit all changes for the following zones: [/hostedzone/XXXXXXXXXX]"

liad5h avatar Feb 01 '23 13:02 liad5h

I'm having the same problem with version 0.13.1 and --aws-batch-change-size=100. I tried --aws-batch-change-size=1 and started to get warnings like

time="2023-02-08T15:32:34Z" level=warning msg="Total changes for xxx.yyy.zzz exceeds max batch size of 1, total changes: 2"

and the errors as described above kept coming.

So I tried --aws-batch-change-size=2 and that has actually resolved the problem for me.

msvticket avatar Feb 08 '23 15:02 msvticket

Same problem as well. I wish there was a "force-overwrite" options where we could just tell external-dns to overwrite records; we have multiple clusters who have this error and are seemingly stuck. The worse part is good, new ingresses never have their DNS records created since they get batched up with these bogus retries.

jbilliau-rcd avatar Feb 13 '23 16:02 jbilliau-rcd

We're facing the same issue with v0.13.2 and the suggested batch size changes do not work:

  • With --aws-batch-change-size=1: it tries to create the already existing TXT record, which fails. It does not even attempt to create the A record, presumably because the first batch change within the sync interval failed. This does not resolve itself eventually and continues like this in every sync interval.
  • With --aws-batch-change-size=2: it tries to create the A record and the already existing TXT record in a batch and this fails. Same behaviour as above, it's stuck.

The only option we have is to either manually create the A record, or to delete the existing TXT records so that external-dns can properly recreate everything.

The expected behaviour would be to not attempt to create the TXT records again (if anything, it should upsert existing records).

Update: from what I can see, there's already a change in master which might partially fix this (https://github.com/kubernetes-sigs/external-dns/commit/7dd84a589d4725ccf25d94f8d71b0146fee4bfcc), but it's still unreleased.

martinohmann avatar Feb 16 '23 12:02 martinohmann

Same problem here...

cyril94440 avatar May 05 '23 17:05 cyril94440

The same problem after updating external-dns from 0.10.2 to 0.13.4

There are some details about the environment:

  1. Provider: aws
  2. EKS: 1.24.0

There are details about the issue:

At the star we have 3 records:

  • (A) - alias for LB, host.example.com
  • (TXT) - old-style TXT for backward-compatibility, host.example.com
  • (TXT) - new-style TXT cname-host.example.com
  1. Test - Removing new-style TXT cname-host.example.com Result: Looks ok, record was restored. time="2023-06-08T13:05:04Z" level=info msg="Desired change: CREATE cname-host.example.com. TXT [Id: /hostedzone/xxx]"

  2. Test - Removing old-style TXT host.example.com Result: Looks ok, record was restored. time="2023-06-08T13:07:05Z" level=debug msg="Adding host.example.com. [Id: /hostedzone/xxx]"

  3. Test - Removing old-style TXT and new-style TXT Result: records were not restored, and no issues or attempts in the logs.

  4. Test - Removing alias host.example.com and both TXT Result: ok, all 3 records were restored. time="2023-06-08T13:18:18Z" level=debug msg="Adding host.example.com. to zone xxx. [Id: /hostedzone/xxx]" time="2023-06-08T13:18:18Z" level=debug msg="Adding host.example.com. to zone xxx. [Id: /hostedzone/xxx]" time="2023-06-08T13:18:18Z" level=debug msg="Adding cname-host.example.com. to zone xxx. [Id: /hostedzone/Z010946512D3RO332W8MB]" time="2023-06-08T13:18:19Z" level=info msg="Desired change: CREATE host.example.com TXT [Id: /hostedzone/xxx]" time="2023-06-08T13:18:19Z" level=info msg="Desired change: CREATE host.example.com A [Id: /hostedzone/xxx]" time="2023-06-08T13:18:19Z" level=info msg="Desired change: CREATE cname-host.example.com TXT [Id: /hostedzone/xxx]"

  5. Test - Removing alias host.example.com only Result: failure, alias was not restored. time="2023-06-08T13:22:23Z" level=error msg="Failure in zone xxx. [Id: /hostedzone/Z010946512D3RO332W8MB] when submitting change batch: InvalidChangeBatch: [Tried to create resource record set [name='cname-host.example.com.', type='TXT'] but it already exists, Tried to create resource record set [name='host.example.com.', type='TXT'] but it already exists]\n\tstatus code: 400, request id: xxx" time="2023-06-08T13:22:24Z" level=error msg="failed to submit all changes for the following zones: [/hostedzone/xxx]"

I guess force override won't lead to issues with Rate exceeded from AWS API, because a case when we lost an alias record is very rare, for us at least. But still, the current behavior is pretty uncomfortable and non-expected, I want to be 100% sure that all our records will be restored automatically if any shit happens.

Additionally, it's weird I don't see any logs in case p.3

Kulagin-G avatar Jun 08 '23 13:06 Kulagin-G

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot avatar Jan 22 '24 04:01 k8s-triage-robot

/remove-lifecycle stale

ddieulivol avatar Jan 22 '24 07:01 ddieulivol

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot avatar Apr 21 '24 08:04 k8s-triage-robot

/remove-lifecycle stale

CameronMackenzie99 avatar May 02 '24 12:05 CameronMackenzie99

I'm seeing this issue when installing v0.14.1 on a brand new EKS 1.25.

rookiehelm avatar May 10 '24 07:05 rookiehelm

Same issue happened in our EKS cluster in version 1.26.

sileyang-sf avatar May 13 '24 20:05 sileyang-sf

Hi guys, I was able to resolve my errors. Couple of pointers that helped:

  • First thing is that the external-dns repo has various branches tagged with release versions. But the release versions don't correspond directly to the image version hosted on GCR.
  • My issue got resolved after I used the following image: registry.k8s.io/external-dns/external-dns:v0.14.1. Please also follow the instructions using the branch tagged v0.14.1 (and not master or some other branch)
  • In my case my cluster was setup using terraform scripts, as I needed to deploy kubeflow. I accidentally have configured the IRSA using 'eksctl' command which was incorrect. The docs suggest directly creating the serviceaccount via 'kubectl'. Please be careful here. I had to manually delete the previous SA and re-create my SA using the right commands. Post that everything worked fine.
  • I also needed to configure 'ingress-nginx' controller first (and not later) as 'external-dns' needs to work with the loadbalancer during the creation of the records (correct me if I'm wrong here)

rookiehelm avatar May 14 '24 08:05 rookiehelm