external-dns
external-dns copied to clipboard
external-dns quietly stops working
What happened:
external-dns quietly stops executing, does not error and does not recover until pod is manually deleted
What you expected to happen:
Either for external-dns to continue executing as normal, or for it to error and register the pod as unhealthy, prompting a replacement.
How to reproduce it (as minimally and precisely as possible):
Unable to reproduce consistently, the issue is intermittent.
Anything else we need to know?:
Originally we thought we may have been hitting an api limit with AWS so we added --aws-zones-cache-duration=24h
as this does not change in our environment, this has made no difference however.
Environment:
- External-DNS version (use
external-dns --version
): v20230327-v0.13.4 - DNS provider: AWS Route53
- Helm Chart Version: 1.12.2
- Helm Chart Values:
env:
- name: AWS_DEFAULT_REGION
value: eu-west-1
- name: AWS_STS_REGIONAL_ENDPOINTS
value: regional
- name: http_proxy
value: exampe.proxy
- name: https_proxy
value: example.proxy
- name: no_proxy
value: 169.254.169.254,s3.eu-west-1.amazonaws.com,172.20.0.1,sts.eu-west-1.amazonaws.com
txtPrefix: "registry-"
policy: sync
extraArgs: [
"--aws-zones-cache-duration=24h"
]
logLevel: debug
resources:
limits:
cpu: 100m
memory: 100Mi
requests:
cpu: 100m
memory: 50Mi
podSecurityContext:
fsGroup: 65534
securityContext:
runAsNonRoot: true
runAsUser: 65534
runAsGroup: 65534
image:
repository: XXXXXXXXXX.dkr.ecr.eu-west-1.amazonaws.com/k8s.gcr.io/external-dns/external-dns
serviceAccount:
annotations:
eks.amazonaws.com/role-arn: arn:aws:iam::XXXXXXXXXX:role/external-dns
domainFilters: [ "example.zone" ]
- Logs: Last logs before it stops working, note last log time 00:26 where pod is still "healthy" at 09:54
time="2023-04-27T00:25:19Z" level=debug msg="Using cached zones list"
time="2023-04-27T00:25:19Z" level=debug msg="Adding external-dns-test-rzosxxpsexpluhg.example.zone. to zone example.zone. [Id: /hostedzone/ZXXXXXXXXXXXXXXXXXXXX]"
time="2023-04-27T00:25:19Z" level=debug msg="Adding registry-external-dns-test-rzosxxpsexpluhg.example.zone. to zone example.zone. [Id: /hostedzone/ZXXXXXXXXXXXXXXXXXXXX]"
time="2023-04-27T00:25:19Z" level=debug msg="Adding registry-cname-external-dns-test-rzosxxpsexpluhg.example.zone. to zone example.zone. [Id: /hostedzone/ZXXXXXXXXXXXXXXXXXXXX]"
time="2023-04-27T00:25:19Z" level=debug msg="Adding external-dns-test-gbmhqlmrmvtxmgc.example.zone. to zone example.zone. [Id: /hostedzone/ZXXXXXXXXXXXXXXXXXXXX]"
time="2023-04-27T00:25:19Z" level=debug msg="Adding registry-external-dns-test-gbmhqlmrmvtxmgc.example.zone. to zone example.zone. [Id: /hostedzone/ZXXXXXXXXXXXXXXXXXXXX]"
time="2023-04-27T00:25:19Z" level=debug msg="Adding registry-cname-external-dns-test-gbmhqlmrmvtxmgc.example.zone. to zone example.zone. [Id: /hostedzone/ZXXXXXXXXXXXXXXXXXXXX]"
time="2023-04-27T00:25:19Z" level=info msg="Desired change: DELETE external-dns-test-gbmhqlmrmvtxmgc.example.zone CNAME [Id: /hostedzone/ZXXXXXXXXXXXXXXXXXXXX]"
time="2023-04-27T00:25:19Z" level=info msg="Desired change: DELETE registry-cname-external-dns-test-gbmhqlmrmvtxmgc.example.zone TXT [Id: /hostedzone/ZXXXXXXXXXXXXXXXXXXXX]"
time="2023-04-27T00:25:19Z" level=info msg="Desired change: DELETE registry-external-dns-test-gbmhqlmrmvtxmgc.example.zone TXT [Id: /hostedzone/ZXXXXXXXXXXXXXXXXXXXX]"
time="2023-04-27T00:25:19Z" level=info msg="Desired change: CREATE external-dns-test-rzosxxpsexpluhg.example.zone CNAME [Id: /hostedzone/ZXXXXXXXXXXXXXXXXXXXX]"
time="2023-04-27T00:25:19Z" level=info msg="Desired change: CREATE registry-cname-external-dns-test-rzosxxpsexpluhg.example.zone TXT [Id: /hostedzone/ZXXXXXXXXXXXXXXXXXXXX]"
time="2023-04-27T00:25:19Z" level=info msg="Desired change: CREATE registry-external-dns-test-rzosxxpsexpluhg.example.zone TXT [Id: /hostedzone/ZXXXXXXXXXXXXXXXXXXXX]"
time="2023-04-27T00:25:19Z" level=info msg="6 record(s) in zone example.zone. [Id: /hostedzone/ZXXXXXXXXXXXXXXXXXXXX] were successfully updated"
time="2023-04-27T00:26:20Z" level=debug msg="Using cached zones list"
I'm having this same issue. For context, I am running with these args:
"--source=ingress", "--provider=aws", "--aws-zone-type=public", "--aws-prefer-cname", "--registry=txt", "--txt-owner-id=external-dns-${var.name}", "--txt-prefix=external-dns"
So you expect any kind of health log line?
Right now I don't see that's a bug but maybe you can explain it to us. Did an ingress change and external-dns didn't update the records?
@szuecs the issue is that the application stops processing with no indication of why and requires manual intervention (deleting of the pod) before it can start processing again. I would expect at the very least here that the pod would become aware of this and intervene before it became a problem.
Also, when this happens, the livenessprobe and the readinessprobe never get tripped. /healthz on port 80 still merrily reports that everything is fine.
I think we run it close to the same (no helm) and don't really see any issue like that in 200 clusters, that is why I wonder. I need more Information to understand what happens.
The Kubernetes project currently lacks enough contributors to adequately respond to all issues.
This bot triages un-triaged issues according to the following rules:
- After 90d of inactivity,
lifecycle/stale
is applied - After 30d of inactivity since
lifecycle/stale
was applied,lifecycle/rotten
is applied - After 30d of inactivity since
lifecycle/rotten
was applied, the issue is closed
You can:
- Mark this issue as fresh with
/remove-lifecycle stale
- Close this issue with
/close
- Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.
This bot triages un-triaged issues according to the following rules:
- After 90d of inactivity,
lifecycle/stale
is applied - After 30d of inactivity since
lifecycle/stale
was applied,lifecycle/rotten
is applied - After 30d of inactivity since
lifecycle/rotten
was applied, the issue is closed
You can:
- Mark this issue as fresh with
/remove-lifecycle rotten
- Close this issue with
/close
- Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle rotten
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.
This bot triages issues according to the following rules:
- After 90d of inactivity,
lifecycle/stale
is applied - After 30d of inactivity since
lifecycle/stale
was applied,lifecycle/rotten
is applied - After 30d of inactivity since
lifecycle/rotten
was applied, the issue is closed
You can:
- Reopen this issue with
/reopen
- Mark this issue as fresh with
/remove-lifecycle rotten
- Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/close not-planned
@k8s-triage-robot: Closing this issue, marking it as "Not Planned".
In response to this:
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.
This bot triages issues according to the following rules:
- After 90d of inactivity,
lifecycle/stale
is applied- After 30d of inactivity since
lifecycle/stale
was applied,lifecycle/rotten
is applied- After 30d of inactivity since
lifecycle/rotten
was applied, the issue is closedYou can:
- Reopen this issue with
/reopen
- Mark this issue as fresh with
/remove-lifecycle rotten
- Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/close not-planned
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.
/reopen
I see this behavior also, these are the logs from the single pod:
external-dns-664c9c75b5-4rffw time="2024-03-22T06:30:18Z" level=info msg="config: {APIServerURL: KubeConfig: RequestTimeout:30s DefaultTargets:[] GlooNamespaces:[gloo-system] SkipperRouteGroupVersion:zalando.org
/v1 Sources:[service ingress] Namespace: AnnotationFilter: LabelFilter: IngressClassNames:[] FQDNTemplate: CombineFQDNAndAnnotation:false IgnoreHostnameAnnotation:false IgnoreIngressTLSSpec:false IgnoreIngressRu
lesSpec:false GatewayNamespace: GatewayLabelFilter: Compatibility: PublishInternal:false PublishHostIP:false AlwaysPublishNotReadyAddresses:false ConnectorSourceServer:localhost:8080 Provider:aws GoogleProject:
GoogleBatchChangeSize:1000 GoogleBatchChangeInterval:1s GoogleZoneVisibility: DomainFilter:[] ExcludeDomains:[] RegexDomainFilter: RegexDomainExclusion: ZoneNameFilter:[] ZoneIDFilter:[] TargetNetFilter:[] Exclu
deTargetNets:[] AlibabaCloudConfigFile:/etc/kubernetes/alibaba-cloud.json AlibabaCloudZoneType: AWSZoneType: AWSZoneTagFilter:[] AWSAssumeRole: AWSAssumeRoleExternalID: AWSBatchChangeSize:1000 AWSBatchChangeInte
rval:1s AWSEvaluateTargetHealth:true AWSAPIRetries:3 AWSPreferCNAME:false AWSZoneCacheDuration:0s AWSSDServiceCleanup:false AWSDynamoDBRegion: AWSDynamoDBTable:external-dns AzureConfigFile:/etc/kubernetes/azure.
json AzureResourceGroup: AzureSubscriptionID: AzureUserAssignedIdentityClientID: BluecatDNSConfiguration: BluecatConfigFile:/etc/kubernetes/bluecat.json BluecatDNSView: BluecatGatewayHost: BluecatRootZone: Bluec
atDNSServerName: BluecatDNSDeployType:no-deploy BluecatSkipTLSVerify:false CloudflareProxied:false CloudflareDNSRecordsPerPage:100 CoreDNSPrefix:/skydns/ RcodezeroTXTEncrypt:false AkamaiServiceConsumerDomain: Ak
amaiClientToken: AkamaiClientSecret: AkamaiAccessToken: AkamaiEdgercPath: AkamaiEdgercSection: InfobloxGridHost: InfobloxWapiPort:443 InfobloxWapiUsername:admin InfobloxWapiPassword: InfobloxWapiVersion:2.3.1 In
fobloxSSLVerify:true InfobloxView: InfobloxMaxResults:0 InfobloxFQDNRegEx: InfobloxNameRegEx: InfobloxCreatePTR:false InfobloxCacheDuration:0 DynCustomerName: DynUsername: DynPassword: DynMinTTLSeconds:0 OCIConf
igFile:/etc/kubernetes/oci.yaml OCICompartmentOCID: OCIAuthInstancePrincipal:false InMemoryZones:[] OVHEndpoint:ovh-eu OVHApiRateLimit:20 PDNSServer:http://localhost:8081 PDNSAPIKey: PDNSSkipTLSVerify:false TLSC
A: TLSClientCert: TLSClientCertKey: Policy:sync Registry:txt TXTOwnerID:external-dns TXTPrefix: TXTSuffix: TXTEncryptEnabled:false TXTEncryptAESKey: Interval:1m0s MinEventSyncInterval:5s Once:false DryRun:false
UpdateEvents:false LogFormat:text MetricsAddress::7979 LogLevel:info TXTCacheInterval:0s TXTWildcardReplacement: ExoscaleEndpoint: ExoscaleAPIKey: ExoscaleAPISecret: ExoscaleAPIEnvironment:api ExoscaleAPIZone:ch
-gva-2 CRDSourceAPIVersion:externaldns.k8s.io/v1alpha1 CRDSourceKind:DNSEndpoint ServiceTypeFilter:[] CFAPIEndpoint: CFUsername: CFPassword: ResolveServiceLoadBalancerHostname:false RFC2136Host: RFC2136Port:0 RF
C2136Zone: RFC2136Insecure:false RFC2136GSSTSIG:false RFC2136KerberosRealm: RFC2136KerberosUsername: RFC2136KerberosPassword: RFC2136TSIGKeyName: RFC2136TSIGSecret: RFC2136TSIGSecretAlg: RFC2136TAXFR:false RFC21
36MinTTL:0s RFC2136BatchChangeSize:50 NS1Endpoint: NS1IgnoreSSL:false NS1MinTTLSeconds:0 TransIPAccountName: TransIPPrivateKeyFile: DigitalOceanAPIPageSize:50 ManagedDNSRecordTypes:[A AAAA CNAME] ExcludeDNSRecor
dTypes:[] GoDaddyAPIKey: GoDaddySecretKey: GoDaddyTTL:0 GoDaddyOTE:false OCPRouterName: IBMCloudProxied:false IBMCloudConfigFile:/etc/kubernetes/ibmcloud.json TencentCloudConfigFile:/etc/kubernetes/tencent-cloud
.json TencentCloudZoneType: PiholeServer: PiholePassword: PiholeTLSInsecureSkipVerify:false PluralCluster: PluralProvider: WebhookProviderURL:http://localhost:8888 WebhookProviderReadTimeout:5s WebhookProviderWr
iteTimeout:10s WebhookServer:false}"
external-dns-664c9c75b5-4rffw time="2024-03-22T06:30:18Z" level=info msg="Instantiating new Kubernetes client"
external-dns-664c9c75b5-4rffw time="2024-03-22T06:30:18Z" level=info msg="Using inCluster-config based on serviceaccount-token"
external-dns-664c9c75b5-4rffw time="2024-03-22T06:30:18Z" level=info msg="Created Kubernetes client https://172.20.0.1:443"
Killing the pod or deleting the deployment and recreating, doesn't solve the problem.
@TLmaK0: You can't reopen an issue/PR unless you authored it or you are a collaborator.
In response to this:
/reopen
I see this behavior also, these are the logs from the single pod:
external-dns-664c9c75b5-4rffw time="2024-03-22T06:30:18Z" level=info msg="config: {APIServerURL: KubeConfig: RequestTimeout:30s DefaultTargets:[] GlooNamespaces:[gloo-system] SkipperRouteGroupVersion:zalando.org /v1 Sources:[service ingress] Namespace: AnnotationFilter: LabelFilter: IngressClassNames:[] FQDNTemplate: CombineFQDNAndAnnotation:false IgnoreHostnameAnnotation:false IgnoreIngressTLSSpec:false IgnoreIngressRu lesSpec:false GatewayNamespace: GatewayLabelFilter: Compatibility: PublishInternal:false PublishHostIP:false AlwaysPublishNotReadyAddresses:false ConnectorSourceServer:localhost:8080 Provider:aws GoogleProject: GoogleBatchChangeSize:1000 GoogleBatchChangeInterval:1s GoogleZoneVisibility: DomainFilter:[] ExcludeDomains:[] RegexDomainFilter: RegexDomainExclusion: ZoneNameFilter:[] ZoneIDFilter:[] TargetNetFilter:[] Exclu deTargetNets:[] AlibabaCloudConfigFile:/etc/kubernetes/alibaba-cloud.json AlibabaCloudZoneType: AWSZoneType: AWSZoneTagFilter:[] AWSAssumeRole: AWSAssumeRoleExternalID: AWSBatchChangeSize:1000 AWSBatchChangeInte rval:1s AWSEvaluateTargetHealth:true AWSAPIRetries:3 AWSPreferCNAME:false AWSZoneCacheDuration:0s AWSSDServiceCleanup:false AWSDynamoDBRegion: AWSDynamoDBTable:external-dns AzureConfigFile:/etc/kubernetes/azure. json AzureResourceGroup: AzureSubscriptionID: AzureUserAssignedIdentityClientID: BluecatDNSConfiguration: BluecatConfigFile:/etc/kubernetes/bluecat.json BluecatDNSView: BluecatGatewayHost: BluecatRootZone: Bluec atDNSServerName: BluecatDNSDeployType:no-deploy BluecatSkipTLSVerify:false CloudflareProxied:false CloudflareDNSRecordsPerPage:100 CoreDNSPrefix:/skydns/ RcodezeroTXTEncrypt:false AkamaiServiceConsumerDomain: Ak amaiClientToken: AkamaiClientSecret: AkamaiAccessToken: AkamaiEdgercPath: AkamaiEdgercSection: InfobloxGridHost: InfobloxWapiPort:443 InfobloxWapiUsername:admin InfobloxWapiPassword: InfobloxWapiVersion:2.3.1 In fobloxSSLVerify:true InfobloxView: InfobloxMaxResults:0 InfobloxFQDNRegEx: InfobloxNameRegEx: InfobloxCreatePTR:false InfobloxCacheDuration:0 DynCustomerName: DynUsername: DynPassword: DynMinTTLSeconds:0 OCIConf igFile:/etc/kubernetes/oci.yaml OCICompartmentOCID: OCIAuthInstancePrincipal:false InMemoryZones:[] OVHEndpoint:ovh-eu OVHApiRateLimit:20 PDNSServer:http://localhost:8081 PDNSAPIKey: PDNSSkipTLSVerify:false TLSC A: TLSClientCert: TLSClientCertKey: Policy:sync Registry:txt TXTOwnerID:external-dns TXTPrefix: TXTSuffix: TXTEncryptEnabled:false TXTEncryptAESKey: Interval:1m0s MinEventSyncInterval:5s Once:false DryRun:false UpdateEvents:false LogFormat:text MetricsAddress::7979 LogLevel:info TXTCacheInterval:0s TXTWildcardReplacement: ExoscaleEndpoint: ExoscaleAPIKey: ExoscaleAPISecret: ExoscaleAPIEnvironment:api ExoscaleAPIZone:ch -gva-2 CRDSourceAPIVersion:externaldns.k8s.io/v1alpha1 CRDSourceKind:DNSEndpoint ServiceTypeFilter:[] CFAPIEndpoint: CFUsername: CFPassword: ResolveServiceLoadBalancerHostname:false RFC2136Host: RFC2136Port:0 RF C2136Zone: RFC2136Insecure:false RFC2136GSSTSIG:false RFC2136KerberosRealm: RFC2136KerberosUsername: RFC2136KerberosPassword: RFC2136TSIGKeyName: RFC2136TSIGSecret: RFC2136TSIGSecretAlg: RFC2136TAXFR:false RFC21 36MinTTL:0s RFC2136BatchChangeSize:50 NS1Endpoint: NS1IgnoreSSL:false NS1MinTTLSeconds:0 TransIPAccountName: TransIPPrivateKeyFile: DigitalOceanAPIPageSize:50 ManagedDNSRecordTypes:[A AAAA CNAME] ExcludeDNSRecor dTypes:[] GoDaddyAPIKey: GoDaddySecretKey: GoDaddyTTL:0 GoDaddyOTE:false OCPRouterName: IBMCloudProxied:false IBMCloudConfigFile:/etc/kubernetes/ibmcloud.json TencentCloudConfigFile:/etc/kubernetes/tencent-cloud .json TencentCloudZoneType: PiholeServer: PiholePassword: PiholeTLSInsecureSkipVerify:false PluralCluster: PluralProvider: WebhookProviderURL:http://localhost:8888 WebhookProviderReadTimeout:5s WebhookProviderWr iteTimeout:10s WebhookServer:false}" external-dns-664c9c75b5-4rffw time="2024-03-22T06:30:18Z" level=info msg="Instantiating new Kubernetes client" external-dns-664c9c75b5-4rffw time="2024-03-22T06:30:18Z" level=info msg="Using inCluster-config based on serviceaccount-token" external-dns-664c9c75b5-4rffw time="2024-03-22T06:30:18Z" level=info msg="Created Kubernetes client https://172.20.0.1:443"
Killing the pod or deleting the deployment and recreating, doesn't solve the problem.
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.