external-dns One invalid record in ChangeBatch stops all others from updating

What happened:

time="2020-04-13T22:23:40Z" level=info msg="Desired change: CREATE demo.example.io A [Id: /hostedzone/ZVEABCZXYZ123]"
time="2020-04-13T22:23:40Z" level=info msg="Desired change: CREATE demo-us-west-1.example.io A [Id: /hostedzone/ZVEABCZXYZ123]"
time="2020-04-13T22:23:40Z" level=info msg="Desired change: CREATE demo-host-il.example.io A [Id: /hostedzone/ZVEABCZXYZ123]"
time="2020-04-13T22:23:40Z" level=info msg="Desired change: CREATE demo.example.io TXT [Id: /hostedzone/ZVEABCZXYZ123]"
time="2020-04-13T22:23:40Z" level=info msg="Desired change: CREATE demo-us-west-1.example.io TXT [Id: /hostedzone/ZVEABCZXYZ123]"
time="2020-04-13T22:23:40Z" level=info msg="Desired change: CREATE demo-host-il.example.io TXT [Id: /hostedzone/ZVEABCZXYZ123]"
time="2020-04-13T22:23:40Z" level=error msg="Failure in zone example.io. [Id: /hostedzone/ZVEABCZXYZ123]"
time="2020-04-13T22:23:40Z" level=error msg="InvalidChangeBatch: [RRSet of type A with DNS name demo.example.io. is not permitted because a conflicting RRSet of type  CNAME with the same DNS name already exists in zone example.io., RRSet of type TXT with DNS name demo.example.io. is not permitted because a conflicting RRSet of type  CNAME with the same DNS name already exists in zone example.io.]\n\tstatus code: 400, request id: ca31ed28-2fef-4429-b769-ae04d297da51"
time="2020-04-13T22:23:40Z" level=error msg="Failed to submit all changes for the following zones: [/hostedzone/ZVEABCZXYZ123]"

What you expected to happen: Ignore the invalid record, process the others

The use case here is the record demo.example.io is created outside of the K8s cluster. But the K8s ingress still needs to be able to handle traffic for this host, since the CNAME is set up to failover between two K8s clusters.

In previous versions of external-dns (<= v0.5.17) everything worked, since it just ignored any records that already exist. Now its batching changes and failing everything, even when only one of the records is "invalid".

Perhaps we need an "ignore" configuration option that would tell external-dns to continue on failure of N records instead of trying to do bulk, atomic submissions?

Environment: external-dns: v0.7.1 K8s v1.16.2

Apr 13 '20 22:04 richstokes

Also similar problem https://github.com/kubernetes-sigs/external-dns/issues/731

Apr 14 '20 10:04 szuecs

I think as workaround you could try natch site 1 or 2

Apr 14 '20 10:04 szuecs

Thanks @szuecs - adding --aws-batch-change-size=2 as an arg works. Am I missing something or is there no documentation for this?

Apr 14 '20 15:04 richstokes

It’s not really meant to fix your issue. It’s meant to reduce the bytes send, because of aws api limits.

Apr 14 '20 16:04 szuecs

Starting to work on this - https://kubernetes.slack.com/archives/C771MKDKQ/p1592295222475600

Jun 21 '20 14:06 OmerKahani

Issues go stale after 90d of inactivity. Mark the issue as fresh with /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta. /lifecycle stale

Sep 19 '20 14:09 fejta-bot

Can someone confirm this is still an issue with the latest release(v0.7.3)?

Sep 21 '20 18:09 seanmalloy

/remove-lifecycle stale this seems to still be pending - at least this is what the changelogs show

Oct 01 '20 03:10 Blanko2

Issues go stale after 90d of inactivity. Mark the issue as fresh with /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta. /lifecycle stale

Jan 04 '21 17:01 fejta-bot

/remove-lifecycle stale

Jan 11 '21 15:01 szuecs

I can confirm this is still an issue on 0.7.6:

time="2021-03-30T09:40:38Z" level=error msg="failed to submit all changes for the following zones: [/hostedzone/Z00-redacted--VC]"
time="2021-03-30T09:41:39Z" level=info msg="Desired change: CREATE lorem.elpenguino.net CNAME [Id: /hostedzone/Z00-redacted--VC]"
time="2021-03-30T09:41:39Z" level=info msg="Desired change: CREATE loremipsumdolorsitametconsecteturadipiscingelitloremipsumdolorsitametconsecteturadipiscingelit.elpenguino.net CNAME [Id: /hostedzone/Z00-redacted--VC]"
time="2021-03-30T09:41:39Z" level=info msg="Desired change: CREATE txtlorem.elpenguino.net TXT [Id: /hostedzone/Z00-redacted--VC]"
time="2021-03-30T09:41:39Z" level=info msg="Desired change: CREATE txtloremipsumdolorsitametconsecteturadipiscingelitloremipsumdolorsitametconsecteturadipiscingelit.elpenguino.net TXT [Id: /hostedzone/Z00-redacted--VC]"
time="2021-03-30T09:41:39Z" level=error msg="Failure in zone elpenguino.net. [Id: /hostedzone/Z00-redacted--VC]"
time="2021-03-30T09:41:39Z" level=error msg="InvalidChangeBatch: [FATAL problem: DomainLabelTooLong (Domain label is too long) encountered with 'loremipsumdolorsitametconsecteturadipiscingelitloremipsumdolorsi', FATAL problem: DomainLabelTooLong (Domain label is too long) encountered with 'txtloremipsumdolorsitametconsecteturadipiscingelitloremipsumdolo']\n\tstatus code: 400, request id: 065e58ea-8383-4987-8b99-ff2f76634d8c"
time="2021-03-30T09:41:39Z" level=error msg="failed to submit all changes for the following zones: [/hostedzone/Z00-redacted--VC]"

Mar 30 '21 09:03 funkypenguin

Issues go stale after 90d of inactivity. Mark the issue as fresh with /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale

Jun 28 '21 10:06 fejta-bot

/remove-lifecycle stale

Jun 29 '21 14:06 szuecs

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

Sep 27 '21 14:09 k8s-triage-robot

/remove-lifecycle stale

Sep 27 '21 17:09 szuecs

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

Dec 26 '21 19:12 k8s-triage-robot

/remove-lifecycle stale

Dec 30 '21 19:12 szuecs

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

Mar 30 '22 19:03 k8s-triage-robot

/remove-lifecycle stale

Mar 31 '22 09:03 szuecs

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

Jun 29 '22 10:06 k8s-triage-robot

/remove-lifecycle stale

Jun 30 '22 07:06 szuecs

Do we have any inputs on this issue ?

Sep 13 '22 22:09 allamand

I am working on that I will have time to fix it. 😀 In general I would be happy to review a pr here, if it's using binary search to figure out the bad records. So always do a split batch, one hopefully works the other not and follow up with the failing one to split and try again. We need to be careful with aws api limits.

Sep 14 '22 22:09 szuecs

This should have been resolved with the contribution @knackaron and myself (then knackjeff) made back in 2021 with https://github.com/kubernetes-sigs/external-dns/pull/2127 and included in 0.9.0. Just change the batchsize back to 1 resorts back to the old style behavior and simply submits the DNS change requests one at a time and not the "number of bytes sent" as @szuecs stated above. Have tested this in production environments and it solved this issue for us.

Sep 17 '22 02:09 jegeland

@jegeland I don't think it's a great solution but it works for us as well. I meant that reducing batch size is not an optimal solution because the user has to calculate max bytes herself and judge about how big the average dns record might be. So it's not great for the user and not reducing api calls to cloud providers. Having too many issues with number of api calls to aws in the past with several incidents, I want to fix it when I have a bit more time to invest into coding and testing it.

Sep 17 '22 11:09 szuecs

@jegeland how is that different than the workaround suggested in https://github.com/kubernetes-sigs/external-dns/issues/1517#issuecomment-613497305?

I must stress that this is a workaround and does not resolve the issue.

Sep 18 '22 07:09 dudicoco

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

Dec 17 '22 07:12 k8s-triage-robot

/remove-lifecycle stale

Dec 20 '22 05:12 paritoshparmar14

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

Mar 20 '23 06:03 k8s-triage-robot

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle rotten
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

Apr 19 '23 06:04 k8s-triage-robot

external-dns external-dns copied to clipboard

One invalid record in ChangeBatch stops all others from updating

external-dns
external-dns copied to clipboard