external-dns
external-dns copied to clipboard
external-dns stops updating route53 on error
Issue: an (obviously broken) service hostname annotation is creating the following error on route53: "InvalidChangeBatch: FATAL problem: DomainLabelEmpty (Domain label is empty) encountered with '.sudbdomain.domain.com' status code: 400, request id: 115b8642-e047-11e7-83a0-a9e81aaef2a3" and apparently this is preventing any further modification of the route53 zone from external-dns. If this is the expected behaviour (any error in the batch means no further modification can be processed), it does not seem very robust as it is (it's also difficult to diagnose). I'm interested in confirmation and workarounds :) Thanks
I see it the opposite way, allowing only partial update is a way to enter a scenario where state of records is unknown and hence unmanageable. External-DNS has to guarantee safety of the updates e.g. not overwriting records belonging to other parties and created via other means. If we do not do batch/atomic updates this is no longer guaranteed. We can introduce an insecure mode , but with that kind of mode you are on your own and in case of misbehaving system diagnosing would become really difficult (as opposed to checking the logs and deleting a resource with apparently invalid hostname).
Yah, I am seeing only initially created records updated, and not existing.
@lemaral That is correct and thanks for bringing it up.
I kind of disagree with @ideahitme here: A single broken Endpoint should not lead to the overall malfunction of ExternalDNS for literally all other modifications. There's no inherent reason why one broken Endpoint should effect other completely non-related Endpoints.
However, there's a strong connection between an Endpoint's A/CNAME/ALIAS record and its corresponding TXT ownership record, so these two must be modified in a transaction.
@linki, thanks for the follow-up, indeed A/CNAME/ALIAS and the corresponding TXT record must stay in the same batch to keep operations atomic. My concern (as you highlighted it) is definitely about one failing endpoint stopping all further modifications. I understand the purpose of stacking multiple endpoints in the same batch for the purpose of optimisation, however in case of error the valid endpoints should still be processed correctly IMHO. If there is an agreement on this and no available bandwidth within the current contributors, I'm willing to take this and see how it can be done.
@szuecs @Raffo @ideahitme @mikkeloscar Have a look @lemaral's comment and let's see if we can agree on this.
@lemaral @linki the problem is more complicated. What if you want to switch the record from one old working application to a new which is somehow broken or is the change within one transaction?
Then I would also like to discuss, how the implementation should look like, because we have to be efficient in AWS calls (less AWS calls the better, because of rate limiting). I guess the same is true for other CloudProviders. I thought about split the changeset into 2 and apply both (binary search) and iterate on the errored one to split again until you found the broken record to mark the resource in kubernetes as "external-dns broken record".
@lemaral I would be happy to get PRs from you for this change, but please try to describe how this would work here, such that we can agree on this and might spot a design issue early.
What if you want to switch the record from one old working application to a new which is somehow broken or is the change within one transaction?
I might misunderstand the point you are making, but switching a record from one application to another would just mean changing the hostname of two resources (e.g. two ingresses) and external-dns would just pick this up and eventually arrive at the new desired state, which in this case might be a broken state if the hostname defined by the user is invalid.
@szuecs, indeed binary search seems the most efficient wrt api rate limiting, we only need to ensure we're not looping forever. For the question you brought up, can you elaborate as I don't see why adding this error handling can make things worse.
@lemaral Think about a changeset which will move a record 1 delete, 1 create. If we apply delete, then we get ratelimitted and then we do the create we will have created a downtime for our users.
This issue is has bitten us in an unexpected way. While a binary search would indeed be optimal wrt API calls usage, a simpler solution would be to implement it as a fallback strategy - try out a batch update first, and then switch to individual pairs of updates if the batch one fails
Hi guys I have similar problem:
time="2018-08-02T13:50:08Z" level=error msg="InvalidChangeBatch: The request contains an invalid set of changes for a resource record set 'A test.example..com.' status code: 400, request id: f7a6fff3-965a-11e8-8239-47abe2865262"
Though I do not see any real problem with the changeset.
And no other update can be done. Any update here?
@zioalex which version you are using and can you post some logs when external-dns starts?
Hi @szuecs follow some logs:
time="2018-08-02T13:15:19Z" level=info msg="config: &{Master: KubeConfig: Sources:[service ingress] Namespace: FQDNTemplate: Compatibility: Provider:aws GoogleProject: DomainFilter:[] AzureConfigFile:/etc/kubernetes/azure.json AzureResourceGroup: Policy:upsert-only Registry: txt TXTOwnerID:my_domain.com TXTPrefix: Interval:1m0s Once:false DryRun:false LogFormat:text MetricsAddress::7979 Debug:false}" time="2018-08-02T13:15:19Z" level=info msg="Connected to cluster at https://100.64.0.1:443" time="2018-08-02T13:15:21Z" level=info msg="Changing records: CREATE { Action: "CREATE",
The actual version is v0.4.3
Since the error message given by the AWS provider contains the offender DNS record, I think we could just parse it, skip it from the batch and run another Upsert API call with the rest of the records in the initial batch.
We are hitting this issue and currently have 57 ingresses that use the fqdns we are monitoring with external-dns. On smaller clusters we have typically 10-15 ingresses monitored and do not hit this issue.
@cnuber thanks for adding more information here!
@linki @njuettner we should work on a fix the numbers are quite low, so I bet many people hit this issue.
Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close.
Send feedback to sig-testing, kubernetes/test-infra and/or fejta. /lifecycle stale
/remove-lifecycle stale
Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close.
Send feedback to sig-testing, kubernetes/test-infra and/or fejta. /lifecycle stale
/remove-lifecycle stale
Hey all. See my pull request #1209 for a potential fix (I also included example logs of the error case described in this issue).
Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close.
Send feedback to sig-testing, kubernetes/test-infra and/or fejta. /lifecycle stale
/remove-lifecycle stale
Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close.
Send feedback to sig-testing, kubernetes/test-infra and/or fejta. /lifecycle stale
Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.
If this issue is safe to close now please do so with /close.
Send feedback to sig-testing, kubernetes/test-infra and/or fejta. /lifecycle rotten
Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.
Send feedback to sig-testing, kubernetes/test-infra and/or fejta. /close
@fejta-bot: Closing this issue.
In response to this:
Rotten issues close after 30d of inactivity. Reopen the issue with
/reopen. Mark the issue as fresh with/remove-lifecycle rotten.Send feedback to sig-testing, kubernetes/test-infra and/or fejta. /close
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.
/remove-lifecycle rotten
/reopen
@devkid: You can't reopen an issue/PR unless you authored it or you are a collaborator.
In response to this:
/reopen
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.