external-dns
external-dns copied to clipboard
Interval parameter doesn't work as expected with Route53
What happened: Interval parameter doesn't work as expected, interval is set to 5 mins, but it updates record to AWS after 20 mins
I recreated my service at around 2023-04-26T08:08, after 7 mins, external-dns got the latest ip, and another 20 mins later, it updated to AWS
Logs: time="2023-04-26T07:55:43Z" level=debug msg="Generating matching endpoint my-domainxxxxx with EndpointAddress IP 10.18.227.38" ... time="2023-04-26T08:15:18Z" level=debug msg="Generating matching endpoint my-domainxxxxx with EndpointAddress IP 10.18.226.80" ... time="2023-04-26T08:34:53Z" level=debug msg="Adding my-domainxxxxx. to zone my-zonexxxx. [Id: /hostedzone/xxxxxx]" time="2023-04-26T08:34:53Z" level=debug msg="Adding my-domainxxxxx. to zone my-zonexxxx. [Id: /hostedzone/xxxxxx]" time="2023-04-26T08:34:53Z" level=info msg="Desired change: UPSERT my-domainxxxxx A [Id: /hostedzone/xxxxxx]" time="2023-04-26T08:34:53Z" level=info msg="Desired change: UPSERT my-domainxxxxx TXT [Id: /hostedzone/xxxxxx]" time="2023-04-26T08:34:53Z" level=info msg="6 record(s) in zone my-zonexxxx. [Id: /hostedzone/xxxxxx] were successfully updated"
Environment:
- External-DNS version (use
external-dns --version): v0.13.2 - DNS provider: route53
- Others: --log-level=debug --log-format=text --interval=5m --source=ingress --source=service --source=istio-gateway --source=istio-virtualservice --policy=sync --registry=txt --txt-owner-id=xxxxxx --domain-filter=xxxxxx --provider=aws --aws-api-retries=3 --aws-batch-change-size=1000 --aws-batch-change-interval=10s --zone-id-filter=xxxxxx
Please show also that the resource had the right state at the time you expected the change to happen. This is missing information and critical to investigate.
What kind of resources, both my service and external-dns were up and running. What's more, this happens every time I restart/recreate my services.
Aws route53 resources.
Right now I think it works as intended and would close the issue.
Aws route53 resources are fine, we have other records in it. Can you explain why does this happen if you think it works as intended
Please provide the information I asked for. We don't have the time for everyone.
Do you mean this, if not, please provide the command or an example
aws route53 list-resource-record-sets --hosted-zone-id my-zone --query "ResourceRecordSets[?Name == 'my-domainxxxxx.']"
[
{
"Name": "my-domainxxxxx.",
"Type": "A",
"TTL": 30,
"ResourceRecords": [
{
"Value": "10.18.227.38"
}
]
},
{
"Name": "my-domainxxxxx.",
"Type": "TXT",
"TTL": 300,
"ResourceRecords": [
{
"Value": "\"heritage=external-dns,external-dns/owner=my-external-dns,external-dns/resource=service/my-ns/my-servicexxxx\""
}
]
}
]
Do I get this right that at 2023-04-26T08:34:53Z the A record was changed in the provider to 10.18.226.80 ? That looks really bad. Do you see any errors like rate limits or retry batches caused by errors?
We run with 120 batch size and you with 1000, which could lead to this problem. I think we reduced it because of similar issues. Basically the AWS API call will not allow batch calls bigger than X bytes and likely in these problem cases you have a bigger change and then external-dns falls back to single changes, which could slow down the propagation as you see.
Can you test if the same issue exists in v0.13.4 (it has a change that tries to fix the batch issue)?
You are right, A record remained unchanged until UPSERT action happened. No rate limits error in external-dns log.
I tried with v0.13.4, still same behavior, and I also tried with 120 batch size, still no luck(
Maybe try to reduce to batch size 1? 120 works for us does not meant it works for you. Just try and let's see if it helps.
Tried with v0.13.5 and got same result. I don't see any rate limit logs anywhere, but it seems here something is really slow.
When it iterates over all virtual services it spits hundreds of No endpoints could be generated from VirtualService xxx in a second, but it seems every Endpoints generated from VirtualService: xxx takes up to a second. And this is what seems to be taking all time. We run with --aws-batch-change-size=100.
@gerasym you are right, I tried to remove --source=istio-virtualservice, and everything works as expected.
@BlueBlueSummer thanks, we can't run it like this - it is it's sole purpose for us to generate records in route 53 based on virtual services :)
@gerasym ok, I removed virtualservice just to confirm it does cause the delay)
The Kubernetes project currently lacks enough contributors to adequately respond to all issues.
This bot triages un-triaged issues according to the following rules:
- After 90d of inactivity,
lifecycle/staleis applied - After 30d of inactivity since
lifecycle/stalewas applied,lifecycle/rottenis applied - After 30d of inactivity since
lifecycle/rottenwas applied, the issue is closed
You can:
- Mark this issue as fresh with
/remove-lifecycle stale - Close this issue with
/close - Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.
This bot triages un-triaged issues according to the following rules:
- After 90d of inactivity,
lifecycle/staleis applied - After 30d of inactivity since
lifecycle/stalewas applied,lifecycle/rottenis applied - After 30d of inactivity since
lifecycle/rottenwas applied, the issue is closed
You can:
- Mark this issue as fresh with
/remove-lifecycle rotten - Close this issue with
/close - Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle rotten
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.
This bot triages issues according to the following rules:
- After 90d of inactivity,
lifecycle/staleis applied - After 30d of inactivity since
lifecycle/stalewas applied,lifecycle/rottenis applied - After 30d of inactivity since
lifecycle/rottenwas applied, the issue is closed
You can:
- Reopen this issue with
/reopen - Mark this issue as fresh with
/remove-lifecycle rotten - Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/close not-planned
@k8s-triage-robot: Closing this issue, marking it as "Not Planned".
In response to this:
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.
This bot triages issues according to the following rules:
- After 90d of inactivity,
lifecycle/staleis applied- After 30d of inactivity since
lifecycle/stalewas applied,lifecycle/rottenis applied- After 30d of inactivity since
lifecycle/rottenwas applied, the issue is closedYou can:
- Reopen this issue with
/reopen- Mark this issue as fresh with
/remove-lifecycle rotten- Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/close not-planned
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.