external-dns icon indicating copy to clipboard operation
external-dns copied to clipboard

external-dns stops updating route53 on error

Open lemaral opened this issue 7 years ago • 48 comments

Issue: an (obviously broken) service hostname annotation is creating the following error on route53: "InvalidChangeBatch: FATAL problem: DomainLabelEmpty (Domain label is empty) encountered with '.sudbdomain.domain.com' status code: 400, request id: 115b8642-e047-11e7-83a0-a9e81aaef2a3" and apparently this is preventing any further modification of the route53 zone from external-dns. If this is the expected behaviour (any error in the batch means no further modification can be processed), it does not seem very robust as it is (it's also difficult to diagnose). I'm interested in confirmation and workarounds :) Thanks

lemaral avatar Dec 13 '17 21:12 lemaral

I see it the opposite way, allowing only partial update is a way to enter a scenario where state of records is unknown and hence unmanageable. External-DNS has to guarantee safety of the updates e.g. not overwriting records belonging to other parties and created via other means. If we do not do batch/atomic updates this is no longer guaranteed. We can introduce an insecure mode , but with that kind of mode you are on your own and in case of misbehaving system diagnosing would become really difficult (as opposed to checking the logs and deleting a resource with apparently invalid hostname).

ideahitme avatar Dec 14 '17 11:12 ideahitme

Yah, I am seeing only initially created records updated, and not existing.

srflaxu40 avatar Dec 18 '17 18:12 srflaxu40

@lemaral That is correct and thanks for bringing it up.

I kind of disagree with @ideahitme here: A single broken Endpoint should not lead to the overall malfunction of ExternalDNS for literally all other modifications. There's no inherent reason why one broken Endpoint should effect other completely non-related Endpoints.

However, there's a strong connection between an Endpoint's A/CNAME/ALIAS record and its corresponding TXT ownership record, so these two must be modified in a transaction.

linki avatar Jan 03 '18 12:01 linki

@linki, thanks for the follow-up, indeed A/CNAME/ALIAS and the corresponding TXT record must stay in the same batch to keep operations atomic. My concern (as you highlighted it) is definitely about one failing endpoint stopping all further modifications. I understand the purpose of stacking multiple endpoints in the same batch for the purpose of optimisation, however in case of error the valid endpoints should still be processed correctly IMHO. If there is an agreement on this and no available bandwidth within the current contributors, I'm willing to take this and see how it can be done.

lemaral avatar Jan 03 '18 21:01 lemaral

@szuecs @Raffo @ideahitme @mikkeloscar Have a look @lemaral's comment and let's see if we can agree on this.

linki avatar Jan 08 '18 10:01 linki

@lemaral @linki the problem is more complicated. What if you want to switch the record from one old working application to a new which is somehow broken or is the change within one transaction?

Then I would also like to discuss, how the implementation should look like, because we have to be efficient in AWS calls (less AWS calls the better, because of rate limiting). I guess the same is true for other CloudProviders. I thought about split the changeset into 2 and apply both (binary search) and iterate on the errored one to split again until you found the broken record to mark the resource in kubernetes as "external-dns broken record".

@lemaral I would be happy to get PRs from you for this change, but please try to describe how this would work here, such that we can agree on this and might spot a design issue early.

szuecs avatar Jan 08 '18 14:01 szuecs

What if you want to switch the record from one old working application to a new which is somehow broken or is the change within one transaction?

I might misunderstand the point you are making, but switching a record from one application to another would just mean changing the hostname of two resources (e.g. two ingresses) and external-dns would just pick this up and eventually arrive at the new desired state, which in this case might be a broken state if the hostname defined by the user is invalid.

mikkeloscar avatar Jan 08 '18 19:01 mikkeloscar

@szuecs, indeed binary search seems the most efficient wrt api rate limiting, we only need to ensure we're not looping forever. For the question you brought up, can you elaborate as I don't see why adding this error handling can make things worse.

lemaral avatar Jan 29 '18 21:01 lemaral

@lemaral Think about a changeset which will move a record 1 delete, 1 create. If we apply delete, then we get ratelimitted and then we do the create we will have created a downtime for our users.

szuecs avatar Jan 30 '18 10:01 szuecs

This issue is has bitten us in an unexpected way. While a binary search would indeed be optimal wrt API calls usage, a simpler solution would be to implement it as a fallback strategy - try out a batch update first, and then switch to individual pairs of updates if the batch one fails

gtie avatar Mar 15 '18 21:03 gtie

Hi guys I have similar problem: time="2018-08-02T13:50:08Z" level=error msg="InvalidChangeBatch: The request contains an invalid set of changes for a resource record set 'A test.example..com.' status code: 400, request id: f7a6fff3-965a-11e8-8239-47abe2865262" Though I do not see any real problem with the changeset. And no other update can be done. Any update here?

zioalex avatar Aug 02 '18 14:08 zioalex

@zioalex which version you are using and can you post some logs when external-dns starts?

szuecs avatar Aug 03 '18 09:08 szuecs

Hi @szuecs follow some logs: time="2018-08-02T13:15:19Z" level=info msg="config: &{Master: KubeConfig: Sources:[service ingress] Namespace: FQDNTemplate: Compatibility: Provider:aws GoogleProject: DomainFilter:[] AzureConfigFile:/etc/kubernetes/azure.json AzureResourceGroup: Policy:upsert-only Registry: txt TXTOwnerID:my_domain.com TXTPrefix: Interval:1m0s Once:false DryRun:false LogFormat:text MetricsAddress::7979 Debug:false}" time="2018-08-02T13:15:19Z" level=info msg="Connected to cluster at https://100.64.0.1:443" time="2018-08-02T13:15:21Z" level=info msg="Changing records: CREATE { Action: "CREATE",

The actual version is v0.4.3

zioalex avatar Aug 06 '18 14:08 zioalex

Since the error message given by the AWS provider contains the offender DNS record, I think we could just parse it, skip it from the batch and run another Upsert API call with the rest of the records in the initial batch.

cristim avatar Aug 27 '18 16:08 cristim

We are hitting this issue and currently have 57 ingresses that use the fqdns we are monitoring with external-dns. On smaller clusters we have typically 10-15 ingresses monitored and do not hit this issue.

cnuber avatar Jan 02 '19 14:01 cnuber

@cnuber thanks for adding more information here!

@linki @njuettner we should work on a fix the numbers are quite low, so I bet many people hit this issue.

szuecs avatar Jan 02 '19 15:01 szuecs

Issues go stale after 90d of inactivity. Mark the issue as fresh with /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta. /lifecycle stale

fejta-bot avatar Apr 27 '19 23:04 fejta-bot

/remove-lifecycle stale

szuecs avatar Apr 28 '19 08:04 szuecs

Issues go stale after 90d of inactivity. Mark the issue as fresh with /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta. /lifecycle stale

fejta-bot avatar Jul 27 '19 09:07 fejta-bot

/remove-lifecycle stale

szuecs avatar Jul 27 '19 12:07 szuecs

Hey all. See my pull request #1209 for a potential fix (I also included example logs of the error case described in this issue).

alfredkrohmer avatar Sep 25 '19 10:09 alfredkrohmer

Issues go stale after 90d of inactivity. Mark the issue as fresh with /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta. /lifecycle stale

fejta-bot avatar Dec 24 '19 11:12 fejta-bot

/remove-lifecycle stale

szuecs avatar Dec 24 '19 21:12 szuecs

Issues go stale after 90d of inactivity. Mark the issue as fresh with /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta. /lifecycle stale

fejta-bot avatar Mar 23 '20 22:03 fejta-bot

Stale issues rot after 30d of inactivity. Mark the issue as fresh with /remove-lifecycle rotten. Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta. /lifecycle rotten

fejta-bot avatar Apr 22 '20 23:04 fejta-bot

Rotten issues close after 30d of inactivity. Reopen the issue with /reopen. Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta. /close

fejta-bot avatar May 23 '20 00:05 fejta-bot

@fejta-bot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity. Reopen the issue with /reopen. Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta. /close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot avatar May 23 '20 00:05 k8s-ci-robot

/remove-lifecycle rotten

alfredkrohmer avatar Apr 05 '21 20:04 alfredkrohmer

/reopen

alfredkrohmer avatar Apr 05 '21 20:04 alfredkrohmer

@devkid: You can't reopen an issue/PR unless you authored it or you are a collaborator.

In response to this:

/reopen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot avatar Apr 05 '21 20:04 k8s-ci-robot