external-dns icon indicating copy to clipboard operation
external-dns copied to clipboard

external-dns continuously deleting and adding the same A and TXT records

Open paul-at-cybr opened this issue 1 year ago • 18 comments

What happened: After upgrading the external-dns image from 0.13.4 => 0.14.0, external-dns seems to have gotten stuck trying to continuously delete and recreate a subset of the records it manages.

A common thread among the domains external-dns misbehaves on, is that they all have NS records that are not managed by external-dns. In our case, these records are manually configured.

What you expected to happen: A low rate of log output, allowing me to breathe easy in the knowledge that we are not hammering the Google Cloud DNS API.

How to reproduce it (as minimally and precisely as possible):

  1. Tell external-dns to manage a domain that has existing NS-records.
  2. If that isn't enough on its own, try adding other factors:
  • Cloud dns provider: Google
  • external-dns args (attached)
  • K8S platform: GKE
  • Ingress controller: traefik
  • Ingress apiVersion: networking.k8s.io/v1

Anything else we need to know?: I've collected a series of debug-level log entries, washed them for public consumption, and attached them. Hope they prove useful.

Environment:

  • External-DNS version: v0.14.0
  • DNS provider: Google
  • Ingress controller: Traefik

paul-at-cybr avatar Nov 20 '23 19:11 paul-at-cybr

Do I understand correctly that you have 1 zone that has some subdomains that delegate the subdomain to other NS servers? Can you add which subdomain you delegate and which record should have been created?

szuecs avatar Nov 21 '23 10:11 szuecs

Hi, thanks for following up!

There are no records missing in our case, and our availability has (so far) not been impacted. The primary symptom is that we see a lot of delete + add activity on A- and TXT records associated with zones that have externally managed NS records.

Edit: The rest of this comment was based on an erroneous read of the logs on a particular environment. See my latest comment for an update.

~~I've had a closer look at the logs across environments, and while the problem seemed to reliably impact all zones with externally managed NS records, the problem appears to have resolved itself on zones with more simple setups.~~

~~What remains is environments where the set up is more complex, with several layers of zones across multiple google cloud projects. In the environment from which the example log was gathered, there are three nested zones, where each nested zone also exists as a set of A/TXT records in the parent zone.~~

~~The external-dns instance from which the example log was gathered manages a single zone, stage.apps.cybr.ai, but this zone has a parent zone (apps.cybr.ai) which resides in a separate project and is managed by a separate instance of external-dns.~~ ~~My new suspicion is that this somehow causes a conflict, but the logs on the external-dns instance managing the parent zone are very quiet.~~

paul-at-cybr avatar Nov 21 '23 12:11 paul-at-cybr

We also see external-dns delete/recreate zone apex records on every reconciliation. I suspect that this is related to the new TXT registry format where the record name contains the record type:

By default, the new TXT registry name is <record-type>-<dns name>, which when creating an A record for example.com becomes a-example.com. That hostname is outside the example.com zone, so the TXT record creation fails, apparently causing a delete/recreate on the next reconciliation. This will happen in any delegated zone, such as stage.apps.cybr.ai in your example.

This happens even if you have a prefix defined for registry TXT records, since the record type is prepended to the DNS name of the A record, not the prefixed name of the TXT record.

I think https://github.com/kubernetes-sigs/external-dns/pull/3774 would fix this problem.

eyvind avatar Nov 21 '23 13:11 eyvind

Follow-up: Turns out the issue is still present on the simpler setups, I was just using the wrong log filters. This should simplify the process of reproducing the problem.

We have several domains where there's only one zone and no nesting, and the root zone is affected by the TXT / A issue. Each of these domains have an externally managed NS record. I've attached a log snippet with output filtered for one of these domains, litly.io

paul-at-cybr avatar Nov 22 '23 15:11 paul-at-cybr

Thanks! So from @paul-at-cybr logs:

2023-11-22T14:43:36Z/info: Add records: litly.io. TXT [\"heritage=external-dns,external-dns/owner=dns-frontend-prod-9347b6ff,external-dns/resource=ingress/litly-api/litly-app\"] 300
2023-11-22T14:43:36Z/info: Add records: litly.io. A [34.147.116.243] 300
2023-11-22T14:43:36Z/info: Del records: litly.io. TXT [\"heritage=external-dns,external-dns/owner=dns-frontend-prod-9347b6ff,external-dns/resource=ingress/litly-api/litly-app\"] 300
2023-11-22T14:43:36Z/info: Del records: litly.io. A [34.147.116.243] 300
2023-11-22T14:43:36Z/info: Change zone: litly-io-root batch #0

@paul-at-cybr Can you confirm that's only on APEX records as @eyvind wrote?

Maybe a workaround to try is to use a subdomain like tags.litly.io to store the ownership txt records --txt-suffix="-%{record_type}.tags" in order to have the ownership correctly set because it seems. This would omit having APEX records for the ownership

szuecs avatar Nov 29 '23 14:11 szuecs

@szuecs We've observed the same thing (Just the Google provider) affecting all records, not just apex domains

E.g. Our zone is testing.k8.tld and records such as app.grafana.testing.k8.tld were being recreated constantly. Worth noting we also use a subdomain --txt-suffix, i.e. meta.

(.tld is in Cloudflare, .testing.k8.tld is delegated to Google Cloud DNS)

Evesy avatar Nov 30 '23 15:11 Evesy

Looks like the regression was introduced between 0.13.5 and 0.13.6

In our case, it is affecting any records that have the following provider configuration:

Ingress:

metadata:
  annotations:
    external-dns.alpha.kubernetes.io/cloudflare-proxied: "false"

CRD:

    providerSpecific:
    - name: external-dns.alpha.kubernetes.io/cloudflare-proxied
      value: "false"

Arguably this config shouldn't be present on the CRD endpoint when dnsName is one that will be managed by the Google provider, but in an Ingress resource with mixed domains it can't be helped

Evesy avatar Dec 01 '23 10:12 Evesy

we have same issue

  • Cloud dns provider: Azure
  • external-dns args : traeafik-proxy
  • K8S platform: AKS
  • Ingress Controller: Treafik
  • IngressRoute: traefik.io/v1alpha1

seem that there 2 same values CNAME and TXT being update by externalDNS,

  • though that CNAME is the goal but shouldn't update after it be created,
  • and TXT should not need to be create in this case

hungran avatar Dec 15 '23 09:12 hungran

Terribly sorry, I've struggled to find the time to properly follow up on this.

Can you confirm that's only on APEX records as @eyvind wrote?

We're seeing the issue on subdomains as well, such as stage.apps.cybr.ai, though those domains are in the nested zone situation that I initially suspected to be the determining factor for this bug: stage.apps.cybr.ai is a subdomain of apps.cybr.ai, which has its own google_dns_managed_zone while also being a subdomain of another google_dns_managed_zone (cybr.ai).

We have not observed the bug in any other subdomains. Only on apex records and on subdomains with nested managed zones.

I've considered introducing a txt suffix to see if this fixes things, but this section from the txt registry readme indicates I might not want to do that:

The prefix or suffix may not be changed after initial deployment, lest the registry records be orphaned and the metadata be lost.

paul-at-cybr avatar Dec 15 '23 11:12 paul-at-cybr

Quick note that I found this bug while beginning to look into what seems to be a similar syndrome observed in our dev setup. Any pointers on how to debug would be appreciated. So far all I've seen is that in DO (where the zone is hosted) records appear and vanish at random. There's nothing obviously related in the external dns container logs (it isn't saying "I'm creating this record...", "I'm deleting this record..." although clearly it is. So presumably the default log level is not very informative?

dboreham avatar Dec 18 '23 22:12 dboreham

Looks like the regression was introduced between 0.13.5 and 0.13.6

In our case, it is affecting any records that have the following provider configuration:

Ingress:

metadata:
  annotations:
    external-dns.alpha.kubernetes.io/cloudflare-proxied: "false"

CRD:

    providerSpecific:
    - name: external-dns.alpha.kubernetes.io/cloudflare-proxied
      value: "false"

Arguably this config shouldn't be present on the CRD endpoint when dnsName is one that will be managed by the Google provider, but in an Ingress resource with mixed domains it can't be helped

I've tracked the regression in this instance down to 5339c0c72c2cc6cf04b985223009cc03865db3b7

@johngmyers Hoping you might be able to advise on the changes in that commit. I see the key difference is previously the provider-specific logic:

  • Updated if current and desired mismatched
  • Updated if there is a current property (unless the value is "") but is not in the desired

Whereas now it has an additional logic:

  • Update If there's a desired provider-specific config that is not in the current

This logic change does make sense on the surface, but feels like a breaking change from the old behaviour. Having multiple DNS entries that use different providers in the same resource (e.g. Ingress, CRD etc.) will cause constant deletes/recreates if that resource uses any provider-specific annotations

cc @szuecs for your thoughts too

Feels like with this change there needs to be a way to tie provider-specific config to providers, and not have them be considered for records that are ultimately handled by a different provider

Evesy avatar Jan 02 '24 10:01 Evesy

@Evesy thanks for your comment! In our clusters we run v0.13.6 without provider specific values and are fine. Thanks for bisecting to the commit. I tried to create a test case https://github.com/kubernetes-sigs/external-dns/pull/4189 but I can't reproduce it also if I checkout v0.13.6 tag and add the test case.

szuecs avatar Jan 17 '24 16:01 szuecs

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot avatar Apr 16 '24 16:04 k8s-triage-robot

This issue does not seem to be resolved for us as of 0.14.1. Will consider looking into it myself if time permits, though neither DNS, networking or k8s-internals are among my strong suits.

/remove-lifecycle stale

paul-at-cybr avatar Apr 16 '24 19:04 paul-at-cybr

@paul-at-cybr Which providers are you using and which are affected? Are the resources responsible for the records affected using any provider specific config?

Evesy avatar Apr 19 '24 13:04 Evesy

Seeing the same thing here too with either 0.13.6 or 0.14.1 on Azure Kubernetes using Azure Private DNS.

Cloud dns provider: Azure external-dns args:

args:
            - '--log-level=info'
            - '--log-format=text'
            - '--interval=1m'
            - '--source=ingress'
            - '--source=pod'
            - '--policy=sync'
            - '--registry=txt'
            - '--txt-owner-id=external-dns'
            - '--domain-filter=domain1.example.com'
            - '--domain-filter=domain2.example.com'
            - '--provider=azure-private-dns'

K8S platform: AKS Ingress Controller: ingress ingressRoute: networking.k8s.io/v1

thecmdradama avatar May 02 '24 03:05 thecmdradama

same issue happening for me too. Kubernetes version 1.29.3 with k3s on bare-metal external-dns version: v0.14.2 exertnal-dns helm chart: v1.14.5

installed with the following flags:

export CLUSTER_NAME="vu-ams-02"
helm repo add external-dns https://kubernetes-sigs.github.io/external-dns/

export GOOGLE_PROJECT="MY_GOOGLE_PROJECT_REDACTED"
export CLUSTER_DOMAIN="${CLUSTER_NAME}.switchboard-oracles.xyz"

helm upgrade --install                                                 \
  external-dns external-dns/external-dns                               \
  -n external-dns --create-namespace                                   \
  --version 1.14.5                                                     \
  --set provider=google                                                \
  --set policy=sync                                                    \
  --set sources[0]="ingress"                                           \
  --set domainFilters[0]"=${CLUSTER_DOMAIN}"                           \
  --set txtOwnerId="${CLUSTER_NAME}"                                   \
  --set extraArgs[0]='--google-project='"${GOOGLE_PROJECT}"            \
  --set extraVolumes[0].name="google-service-account"                  \
  --set extraVolumes[0].secret.secretName="external-dns"               \
  --set extraVolumeMounts[0].name="google-service-account"             \
  --set extraVolumeMounts[0].mountPath="/etc/secrets/service-account/" \
  --set env[0].name="GOOGLE_APPLICATION_CREDENTIALS"                   \
  --set env[0].value="/etc/secrets/service-account/credentials.json"

you can check the DNS record here:

$ dig +short A vu-ams-02.switchboard-oracles.xyz
136.244.110.43
$ dig +short TXT vu-ams-02.switchboard-oracles.xyz
"heritage=external-dns,external-dns/owner=vu-ams-02,external-dns/resource=ingress/switchboard-oracle-devnet/switchboard-ingress"

Let me know if you need any other hint or logs or want me to dig in any direction in the code.. this is kinda annoying-ish 😬

eldios avatar Jul 08 '24 22:07 eldios

This is still an issue as of the latest version helm chart v8.3.7 app version - v0.15.0. In my case its actually seriously removing all records inspite setting the policy as upsert-only . We have almost 50+ records and this doesnt seem to be stable at all . Our dev environment is constantly deleting and recreating the records .

This issue shouldn't be closed.

Below are my helm chart values -

logLevel: debug
provider: google

serviceAccount:
  annotations:
    iam.gke.io/gcp-service-account: [email protected]

region: europe-west1

nodeSelector:
  kubernetes.io/os: linux
  iam.gke.io/gke-metadata-server-enabled: "true"

google:
  project: company-management
  zoneVisibility: public
ingressClassFilters:
- nginx

domainFilters:
- dev.company.com
- dev.company.com
policy: upsert-only
txtOwnerId: external-dns-public


sources:

- ingress

This is happening on one of our GCP clusters on dev . We are scared to use it on prod now .

Update: Found the culprit to be domain filters - For some reason the domainFilter also takes the subdomains into the considerations - We have company.com and dev.company.com are managed by two different external-dns . But without domain filter the external-dns for company.com also seems to update dev.company.com and delete the records continueously .

defyjoy avatar Sep 10 '24 01:09 defyjoy