external-dns external-dns quietly stops working

What happened:

external-dns quietly stops executing, does not error and does not recover until pod is manually deleted

What you expected to happen:

Either for external-dns to continue executing as normal, or for it to error and register the pod as unhealthy, prompting a replacement.

How to reproduce it (as minimally and precisely as possible):

Unable to reproduce consistently, the issue is intermittent.

Anything else we need to know?:

Originally we thought we may have been hitting an api limit with AWS so we added --aws-zones-cache-duration=24h as this does not change in our environment, this has made no difference however.

Environment:

External-DNS version (use external-dns --version): v20230327-v0.13.4
DNS provider: AWS Route53
Helm Chart Version: 1.12.2
Helm Chart Values:

env:
  - name: AWS_DEFAULT_REGION
    value: eu-west-1
  - name: AWS_STS_REGIONAL_ENDPOINTS
    value: regional
  - name: http_proxy
    value: exampe.proxy
  - name: https_proxy
    value: example.proxy
  - name: no_proxy
    value: 169.254.169.254,s3.eu-west-1.amazonaws.com,172.20.0.1,sts.eu-west-1.amazonaws.com

txtPrefix: "registry-"
policy: sync

extraArgs: [
  "--aws-zones-cache-duration=24h"
]

logLevel: debug

resources:
  limits:
    cpu: 100m
    memory: 100Mi
  requests:
    cpu: 100m
    memory: 50Mi

podSecurityContext:
  fsGroup: 65534

securityContext:
  runAsNonRoot: true
  runAsUser: 65534
  runAsGroup: 65534

image:
  repository: XXXXXXXXXX.dkr.ecr.eu-west-1.amazonaws.com/k8s.gcr.io/external-dns/external-dns

serviceAccount:
  annotations:
    eks.amazonaws.com/role-arn: arn:aws:iam::XXXXXXXXXX:role/external-dns

domainFilters: [ "example.zone" ]

Logs: Last logs before it stops working, note last log time 00:26 where pod is still "healthy" at 09:54

time="2023-04-27T00:25:19Z" level=debug msg="Using cached zones list"
time="2023-04-27T00:25:19Z" level=debug msg="Adding external-dns-test-rzosxxpsexpluhg.example.zone. to zone example.zone. [Id: /hostedzone/ZXXXXXXXXXXXXXXXXXXXX]"
time="2023-04-27T00:25:19Z" level=debug msg="Adding registry-external-dns-test-rzosxxpsexpluhg.example.zone. to zone example.zone. [Id: /hostedzone/ZXXXXXXXXXXXXXXXXXXXX]"
time="2023-04-27T00:25:19Z" level=debug msg="Adding registry-cname-external-dns-test-rzosxxpsexpluhg.example.zone. to zone example.zone. [Id: /hostedzone/ZXXXXXXXXXXXXXXXXXXXX]"
time="2023-04-27T00:25:19Z" level=debug msg="Adding external-dns-test-gbmhqlmrmvtxmgc.example.zone. to zone example.zone. [Id: /hostedzone/ZXXXXXXXXXXXXXXXXXXXX]"
time="2023-04-27T00:25:19Z" level=debug msg="Adding registry-external-dns-test-gbmhqlmrmvtxmgc.example.zone. to zone example.zone. [Id: /hostedzone/ZXXXXXXXXXXXXXXXXXXXX]"
time="2023-04-27T00:25:19Z" level=debug msg="Adding registry-cname-external-dns-test-gbmhqlmrmvtxmgc.example.zone. to zone example.zone. [Id: /hostedzone/ZXXXXXXXXXXXXXXXXXXXX]"
time="2023-04-27T00:25:19Z" level=info msg="Desired change: DELETE external-dns-test-gbmhqlmrmvtxmgc.example.zone CNAME [Id: /hostedzone/ZXXXXXXXXXXXXXXXXXXXX]"
time="2023-04-27T00:25:19Z" level=info msg="Desired change: DELETE registry-cname-external-dns-test-gbmhqlmrmvtxmgc.example.zone TXT [Id: /hostedzone/ZXXXXXXXXXXXXXXXXXXXX]"
time="2023-04-27T00:25:19Z" level=info msg="Desired change: DELETE registry-external-dns-test-gbmhqlmrmvtxmgc.example.zone TXT [Id: /hostedzone/ZXXXXXXXXXXXXXXXXXXXX]"
time="2023-04-27T00:25:19Z" level=info msg="Desired change: CREATE external-dns-test-rzosxxpsexpluhg.example.zone CNAME [Id: /hostedzone/ZXXXXXXXXXXXXXXXXXXXX]"
time="2023-04-27T00:25:19Z" level=info msg="Desired change: CREATE registry-cname-external-dns-test-rzosxxpsexpluhg.example.zone TXT [Id: /hostedzone/ZXXXXXXXXXXXXXXXXXXXX]"
time="2023-04-27T00:25:19Z" level=info msg="Desired change: CREATE registry-external-dns-test-rzosxxpsexpluhg.example.zone TXT [Id: /hostedzone/ZXXXXXXXXXXXXXXXXXXXX]"
time="2023-04-27T00:25:19Z" level=info msg="6 record(s) in zone example.zone. [Id: /hostedzone/ZXXXXXXXXXXXXXXXXXXXX] were successfully updated"
time="2023-04-27T00:26:20Z" level=debug msg="Using cached zones list"

Apr 27 '23 09:04 davejab

I'm having this same issue. For context, I am running with these args:

"--source=ingress", "--provider=aws", "--aws-zone-type=public", "--aws-prefer-cname", "--registry=txt", "--txt-owner-id=external-dns-${var.name}", "--txt-prefix=external-dns"

Jun 03 '23 21:06 JackFlukinger

So you expect any kind of health log line?

Right now I don't see that's a bug but maybe you can explain it to us. Did an ingress change and external-dns didn't update the records?

Jun 05 '23 08:06 szuecs

@szuecs the issue is that the application stops processing with no indication of why and requires manual intervention (deleting of the pod) before it can start processing again. I would expect at the very least here that the pod would become aware of this and intervene before it became a problem.

Jun 05 '23 09:06 davejab

Also, when this happens, the livenessprobe and the readinessprobe never get tripped. /healthz on port 80 still merrily reports that everything is fine.

Jun 06 '23 12:06 bocan

I think we run it close to the same (no helm) and don't really see any issue like that in 200 clusters, that is why I wonder. I need more Information to understand what happens.

Jun 08 '23 07:06 szuecs

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

Jan 22 '24 03:01 k8s-triage-robot

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle rotten
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

Feb 21 '24 03:02 k8s-triage-robot

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue with /reopen
Mark this issue as fresh with /remove-lifecycle rotten
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

Mar 22 '24 04:03 k8s-triage-robot

@k8s-triage-robot: Closing this issue, marking it as "Not Planned".

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied

After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied

After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue with /reopen

Mark this issue as fresh with /remove-lifecycle rotten

Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Mar 22 '24 04:03 k8s-ci-robot

/reopen

I see this behavior also, these are the logs from the single pod:

external-dns-664c9c75b5-4rffw time="2024-03-22T06:30:18Z" level=info msg="config: {APIServerURL: KubeConfig: RequestTimeout:30s DefaultTargets:[] GlooNamespaces:[gloo-system] SkipperRouteGroupVersion:zalando.org
/v1 Sources:[service ingress] Namespace: AnnotationFilter: LabelFilter: IngressClassNames:[] FQDNTemplate: CombineFQDNAndAnnotation:false IgnoreHostnameAnnotation:false IgnoreIngressTLSSpec:false IgnoreIngressRu
lesSpec:false GatewayNamespace: GatewayLabelFilter: Compatibility: PublishInternal:false PublishHostIP:false AlwaysPublishNotReadyAddresses:false ConnectorSourceServer:localhost:8080 Provider:aws GoogleProject: 
GoogleBatchChangeSize:1000 GoogleBatchChangeInterval:1s GoogleZoneVisibility: DomainFilter:[] ExcludeDomains:[] RegexDomainFilter: RegexDomainExclusion: ZoneNameFilter:[] ZoneIDFilter:[] TargetNetFilter:[] Exclu
deTargetNets:[] AlibabaCloudConfigFile:/etc/kubernetes/alibaba-cloud.json AlibabaCloudZoneType: AWSZoneType: AWSZoneTagFilter:[] AWSAssumeRole: AWSAssumeRoleExternalID: AWSBatchChangeSize:1000 AWSBatchChangeInte
rval:1s AWSEvaluateTargetHealth:true AWSAPIRetries:3 AWSPreferCNAME:false AWSZoneCacheDuration:0s AWSSDServiceCleanup:false AWSDynamoDBRegion: AWSDynamoDBTable:external-dns AzureConfigFile:/etc/kubernetes/azure.
json AzureResourceGroup: AzureSubscriptionID: AzureUserAssignedIdentityClientID: BluecatDNSConfiguration: BluecatConfigFile:/etc/kubernetes/bluecat.json BluecatDNSView: BluecatGatewayHost: BluecatRootZone: Bluec
atDNSServerName: BluecatDNSDeployType:no-deploy BluecatSkipTLSVerify:false CloudflareProxied:false CloudflareDNSRecordsPerPage:100 CoreDNSPrefix:/skydns/ RcodezeroTXTEncrypt:false AkamaiServiceConsumerDomain: Ak
amaiClientToken: AkamaiClientSecret: AkamaiAccessToken: AkamaiEdgercPath: AkamaiEdgercSection: InfobloxGridHost: InfobloxWapiPort:443 InfobloxWapiUsername:admin InfobloxWapiPassword: InfobloxWapiVersion:2.3.1 In
fobloxSSLVerify:true InfobloxView: InfobloxMaxResults:0 InfobloxFQDNRegEx: InfobloxNameRegEx: InfobloxCreatePTR:false InfobloxCacheDuration:0 DynCustomerName: DynUsername: DynPassword: DynMinTTLSeconds:0 OCIConf
igFile:/etc/kubernetes/oci.yaml OCICompartmentOCID: OCIAuthInstancePrincipal:false InMemoryZones:[] OVHEndpoint:ovh-eu OVHApiRateLimit:20 PDNSServer:http://localhost:8081 PDNSAPIKey: PDNSSkipTLSVerify:false TLSC
A: TLSClientCert: TLSClientCertKey: Policy:sync Registry:txt TXTOwnerID:external-dns TXTPrefix: TXTSuffix: TXTEncryptEnabled:false TXTEncryptAESKey: Interval:1m0s MinEventSyncInterval:5s Once:false DryRun:false 
UpdateEvents:false LogFormat:text MetricsAddress::7979 LogLevel:info TXTCacheInterval:0s TXTWildcardReplacement: ExoscaleEndpoint: ExoscaleAPIKey: ExoscaleAPISecret: ExoscaleAPIEnvironment:api ExoscaleAPIZone:ch
-gva-2 CRDSourceAPIVersion:externaldns.k8s.io/v1alpha1 CRDSourceKind:DNSEndpoint ServiceTypeFilter:[] CFAPIEndpoint: CFUsername: CFPassword: ResolveServiceLoadBalancerHostname:false RFC2136Host: RFC2136Port:0 RF
C2136Zone: RFC2136Insecure:false RFC2136GSSTSIG:false RFC2136KerberosRealm: RFC2136KerberosUsername: RFC2136KerberosPassword: RFC2136TSIGKeyName: RFC2136TSIGSecret: RFC2136TSIGSecretAlg: RFC2136TAXFR:false RFC21
36MinTTL:0s RFC2136BatchChangeSize:50 NS1Endpoint: NS1IgnoreSSL:false NS1MinTTLSeconds:0 TransIPAccountName: TransIPPrivateKeyFile: DigitalOceanAPIPageSize:50 ManagedDNSRecordTypes:[A AAAA CNAME] ExcludeDNSRecor
dTypes:[] GoDaddyAPIKey: GoDaddySecretKey: GoDaddyTTL:0 GoDaddyOTE:false OCPRouterName: IBMCloudProxied:false IBMCloudConfigFile:/etc/kubernetes/ibmcloud.json TencentCloudConfigFile:/etc/kubernetes/tencent-cloud
.json TencentCloudZoneType: PiholeServer: PiholePassword: PiholeTLSInsecureSkipVerify:false PluralCluster: PluralProvider: WebhookProviderURL:http://localhost:8888 WebhookProviderReadTimeout:5s WebhookProviderWr
iteTimeout:10s WebhookServer:false}"                                                                                                                                                                               
external-dns-664c9c75b5-4rffw time="2024-03-22T06:30:18Z" level=info msg="Instantiating new Kubernetes client"                                                                                                     
external-dns-664c9c75b5-4rffw time="2024-03-22T06:30:18Z" level=info msg="Using inCluster-config based on serviceaccount-token"                                                                                    
external-dns-664c9c75b5-4rffw time="2024-03-22T06:30:18Z" level=info msg="Created Kubernetes client https://172.20.0.1:443"

Killing the pod or deleting the deployment and recreating, doesn't solve the problem.

Mar 22 '24 06:03 TLmaK0

@TLmaK0: You can't reopen an issue/PR unless you authored it or you are a collaborator.

In response to this:

/reopen

I see this behavior also, these are the logs from the single pod:

external-dns-664c9c75b5-4rffw time="2024-03-22T06:30:18Z" level=info msg="config: {APIServerURL: KubeConfig: RequestTimeout:30s DefaultTargets:[] GlooNamespaces:[gloo-system] SkipperRouteGroupVersion:zalando.org
/v1 Sources:[service ingress] Namespace: AnnotationFilter: LabelFilter: IngressClassNames:[] FQDNTemplate: CombineFQDNAndAnnotation:false IgnoreHostnameAnnotation:false IgnoreIngressTLSSpec:false IgnoreIngressRu
lesSpec:false GatewayNamespace: GatewayLabelFilter: Compatibility: PublishInternal:false PublishHostIP:false AlwaysPublishNotReadyAddresses:false ConnectorSourceServer:localhost:8080 Provider:aws GoogleProject: 
GoogleBatchChangeSize:1000 GoogleBatchChangeInterval:1s GoogleZoneVisibility: DomainFilter:[] ExcludeDomains:[] RegexDomainFilter: RegexDomainExclusion: ZoneNameFilter:[] ZoneIDFilter:[] TargetNetFilter:[] Exclu
deTargetNets:[] AlibabaCloudConfigFile:/etc/kubernetes/alibaba-cloud.json AlibabaCloudZoneType: AWSZoneType: AWSZoneTagFilter:[] AWSAssumeRole: AWSAssumeRoleExternalID: AWSBatchChangeSize:1000 AWSBatchChangeInte
rval:1s AWSEvaluateTargetHealth:true AWSAPIRetries:3 AWSPreferCNAME:false AWSZoneCacheDuration:0s AWSSDServiceCleanup:false AWSDynamoDBRegion: AWSDynamoDBTable:external-dns AzureConfigFile:/etc/kubernetes/azure.
json AzureResourceGroup: AzureSubscriptionID: AzureUserAssignedIdentityClientID: BluecatDNSConfiguration: BluecatConfigFile:/etc/kubernetes/bluecat.json BluecatDNSView: BluecatGatewayHost: BluecatRootZone: Bluec
atDNSServerName: BluecatDNSDeployType:no-deploy BluecatSkipTLSVerify:false CloudflareProxied:false CloudflareDNSRecordsPerPage:100 CoreDNSPrefix:/skydns/ RcodezeroTXTEncrypt:false AkamaiServiceConsumerDomain: Ak
amaiClientToken: AkamaiClientSecret: AkamaiAccessToken: AkamaiEdgercPath: AkamaiEdgercSection: InfobloxGridHost: InfobloxWapiPort:443 InfobloxWapiUsername:admin InfobloxWapiPassword: InfobloxWapiVersion:2.3.1 In
fobloxSSLVerify:true InfobloxView: InfobloxMaxResults:0 InfobloxFQDNRegEx: InfobloxNameRegEx: InfobloxCreatePTR:false InfobloxCacheDuration:0 DynCustomerName: DynUsername: DynPassword: DynMinTTLSeconds:0 OCIConf
igFile:/etc/kubernetes/oci.yaml OCICompartmentOCID: OCIAuthInstancePrincipal:false InMemoryZones:[] OVHEndpoint:ovh-eu OVHApiRateLimit:20 PDNSServer:http://localhost:8081 PDNSAPIKey: PDNSSkipTLSVerify:false TLSC
A: TLSClientCert: TLSClientCertKey: Policy:sync Registry:txt TXTOwnerID:external-dns TXTPrefix: TXTSuffix: TXTEncryptEnabled:false TXTEncryptAESKey: Interval:1m0s MinEventSyncInterval:5s Once:false DryRun:false 
UpdateEvents:false LogFormat:text MetricsAddress::7979 LogLevel:info TXTCacheInterval:0s TXTWildcardReplacement: ExoscaleEndpoint: ExoscaleAPIKey: ExoscaleAPISecret: ExoscaleAPIEnvironment:api ExoscaleAPIZone:ch
-gva-2 CRDSourceAPIVersion:externaldns.k8s.io/v1alpha1 CRDSourceKind:DNSEndpoint ServiceTypeFilter:[] CFAPIEndpoint: CFUsername: CFPassword: ResolveServiceLoadBalancerHostname:false RFC2136Host: RFC2136Port:0 RF
C2136Zone: RFC2136Insecure:false RFC2136GSSTSIG:false RFC2136KerberosRealm: RFC2136KerberosUsername: RFC2136KerberosPassword: RFC2136TSIGKeyName: RFC2136TSIGSecret: RFC2136TSIGSecretAlg: RFC2136TAXFR:false RFC21
36MinTTL:0s RFC2136BatchChangeSize:50 NS1Endpoint: NS1IgnoreSSL:false NS1MinTTLSeconds:0 TransIPAccountName: TransIPPrivateKeyFile: DigitalOceanAPIPageSize:50 ManagedDNSRecordTypes:[A AAAA CNAME] ExcludeDNSRecor
dTypes:[] GoDaddyAPIKey: GoDaddySecretKey: GoDaddyTTL:0 GoDaddyOTE:false OCPRouterName: IBMCloudProxied:false IBMCloudConfigFile:/etc/kubernetes/ibmcloud.json TencentCloudConfigFile:/etc/kubernetes/tencent-cloud
.json TencentCloudZoneType: PiholeServer: PiholePassword: PiholeTLSInsecureSkipVerify:false PluralCluster: PluralProvider: WebhookProviderURL:http://localhost:8888 WebhookProviderReadTimeout:5s WebhookProviderWr
iteTimeout:10s WebhookServer:false}"                                                                                                                                                                               
external-dns-664c9c75b5-4rffw time="2024-03-22T06:30:18Z" level=info msg="Instantiating new Kubernetes client"                                                                                                     
external-dns-664c9c75b5-4rffw time="2024-03-22T06:30:18Z" level=info msg="Using inCluster-config based on serviceaccount-token"                                                                                    
external-dns-664c9c75b5-4rffw time="2024-03-22T06:30:18Z" level=info msg="Created Kubernetes client https://172.20.0.1:443"

Killing the pod or deleting the deployment and recreating, doesn't solve the problem.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Mar 22 '24 06:03 k8s-ci-robot

external-dns external-dns copied to clipboard

external-dns quietly stops working

external-dns
external-dns copied to clipboard