external-dns Interval parameter doesn't work as expected with Route53

What happened: Interval parameter doesn't work as expected, interval is set to 5 mins, but it updates record to AWS after 20 mins

I recreated my service at around 2023-04-26T08:08, after 7 mins, external-dns got the latest ip, and another 20 mins later, it updated to AWS

Logs: time="2023-04-26T07:55:43Z" level=debug msg="Generating matching endpoint my-domainxxxxx with EndpointAddress IP 10.18.227.38" ... time="2023-04-26T08:15:18Z" level=debug msg="Generating matching endpoint my-domainxxxxx with EndpointAddress IP 10.18.226.80" ... time="2023-04-26T08:34:53Z" level=debug msg="Adding my-domainxxxxx. to zone my-zonexxxx. [Id: /hostedzone/xxxxxx]" time="2023-04-26T08:34:53Z" level=debug msg="Adding my-domainxxxxx. to zone my-zonexxxx. [Id: /hostedzone/xxxxxx]" time="2023-04-26T08:34:53Z" level=info msg="Desired change: UPSERT my-domainxxxxx A [Id: /hostedzone/xxxxxx]" time="2023-04-26T08:34:53Z" level=info msg="Desired change: UPSERT my-domainxxxxx TXT [Id: /hostedzone/xxxxxx]" time="2023-04-26T08:34:53Z" level=info msg="6 record(s) in zone my-zonexxxx. [Id: /hostedzone/xxxxxx] were successfully updated"

Environment:

External-DNS version (use external-dns --version): v0.13.2
DNS provider: route53
Others: --log-level=debug --log-format=text --interval=5m --source=ingress --source=service --source=istio-gateway --source=istio-virtualservice --policy=sync --registry=txt --txt-owner-id=xxxxxx --domain-filter=xxxxxx --provider=aws --aws-api-retries=3 --aws-batch-change-size=1000 --aws-batch-change-interval=10s --zone-id-filter=xxxxxx

Apr 26 '23 09:04 BlueBlueSummer

Please show also that the resource had the right state at the time you expected the change to happen. This is missing information and critical to investigate.

Apr 26 '23 10:04 szuecs

What kind of resources, both my service and external-dns were up and running. What's more, this happens every time I restart/recreate my services.

Apr 27 '23 00:04 BlueBlueSummer

Aws route53 resources.

Right now I think it works as intended and would close the issue.

Apr 27 '23 07:04 szuecs

Aws route53 resources are fine, we have other records in it. Can you explain why does this happen if you think it works as intended

Apr 27 '23 07:04 BlueBlueSummer

Please provide the information I asked for. We don't have the time for everyone.

Apr 27 '23 07:04 szuecs

Do you mean this, if not, please provide the command or an example

aws route53 list-resource-record-sets --hosted-zone-id my-zone --query "ResourceRecordSets[?Name == 'my-domainxxxxx.']"
[
    {
        "Name": "my-domainxxxxx.",
        "Type": "A",
        "TTL": 30,
        "ResourceRecords": [
            {
                "Value": "10.18.227.38"
            }
        ]
    },
    {
        "Name": "my-domainxxxxx.",
        "Type": "TXT",
        "TTL": 300,
        "ResourceRecords": [
            {
                "Value": "\"heritage=external-dns,external-dns/owner=my-external-dns,external-dns/resource=service/my-ns/my-servicexxxx\""
            }
        ]
    }
]

Apr 27 '23 08:04 BlueBlueSummer

Do I get this right that at 2023-04-26T08:34:53Z the A record was changed in the provider to 10.18.226.80 ? That looks really bad. Do you see any errors like rate limits or retry batches caused by errors?

We run with 120 batch size and you with 1000, which could lead to this problem. I think we reduced it because of similar issues. Basically the AWS API call will not allow batch calls bigger than X bytes and likely in these problem cases you have a bigger change and then external-dns falls back to single changes, which could slow down the propagation as you see.

Can you test if the same issue exists in v0.13.4 (it has a change that tries to fix the batch issue)?

Apr 28 '23 18:04 szuecs

You are right, A record remained unchanged until UPSERT action happened. No rate limits error in external-dns log.

I tried with v0.13.4, still same behavior, and I also tried with 120 batch size, still no luck(

May 01 '23 02:05 BlueBlueSummer

Maybe try to reduce to batch size 1? 120 works for us does not meant it works for you. Just try and let's see if it helps.

May 09 '23 19:05 szuecs

Tried with v0.13.5 and got same result. I don't see any rate limit logs anywhere, but it seems here something is really slow. When it iterates over all virtual services it spits hundreds of No endpoints could be generated from VirtualService xxx in a second, but it seems every Endpoints generated from VirtualService: xxx takes up to a second. And this is what seems to be taking all time. We run with --aws-batch-change-size=100.

Jul 25 '23 12:07 gerasym

@gerasym you are right, I tried to remove --source=istio-virtualservice, and everything works as expected.

Nov 22 '23 11:11 BlueBlueSummer

@BlueBlueSummer thanks, we can't run it like this - it is it's sole purpose for us to generate records in route 53 based on virtual services :)

Nov 22 '23 14:11 gerasym

@gerasym ok, I removed virtualservice just to confirm it does cause the delay)

Nov 23 '23 01:11 BlueBlueSummer

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

Feb 21 '24 01:02 k8s-triage-robot

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle rotten
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

Mar 22 '24 02:03 k8s-triage-robot

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue with /reopen
Mark this issue as fresh with /remove-lifecycle rotten
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

Apr 21 '24 03:04 k8s-triage-robot

@k8s-triage-robot: Closing this issue, marking it as "Not Planned".

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied

After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied

After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue with /reopen

Mark this issue as fresh with /remove-lifecycle rotten

Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Apr 21 '24 03:04 k8s-ci-robot

external-dns external-dns copied to clipboard

Interval parameter doesn't work as expected with Route53

external-dns
external-dns copied to clipboard