external-dns icon indicating copy to clipboard operation
external-dns copied to clipboard

Failed to list hosted zones after updating to v0.15.0

Open jamescjchan opened this issue 1 year ago • 12 comments

I recently updated to external-dns v0.15.0 from v0.14.2 but I'm seeing this error. image

This is my manifest

---
apiVersion: v1
kind: ServiceAccount
metadata:
  name: external-dns
  namespace: external-dns
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: external-dns
rules:
  - apiGroups: [""]
    resources: ["services","endpoints","pods"]
    verbs: ["get","watch","list"]
  - apiGroups: ["extensions","networking.k8s.io","getambassador.io"]
    resources: ["ingresses","hosts"]
    verbs: ["get","watch","list"]
  - apiGroups: [""]
    resources: ["nodes"]
    verbs: ["list","watch"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: external-dns-viewer
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: external-dns
subjects:
  - kind: ServiceAccount
    name: external-dns
    namespace: external-dns
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: external-dns
  namespace: external-dns
spec:
  strategy:
    type: Recreate
  selector:
    matchLabels:
      app: external-dns
  template:
    metadata:
      labels:
        app: external-dns
    spec:
      serviceAccountName: external-dns
      containers:
      - name: external-dns
        image: "registry.k8s.io/external-dns/external-dns:v0.15.0"
        args:
        - --source=service
        - --source=ingress
        - --domain-filter=my.hostedzone.net
        - --provider=aws
        - --aws-zone-type=private # only look at public hosted zones (valid values are public, private or no value for both)
        - --registry=txt
        - --txt-owner-id=my-eks-cluster
      securityContext:
        fsGroup: 65534 # For ExternalDNS to be able to read Kubernetes and AWS token files

Could you please advise?

jamescjchan avatar Oct 31 '24 21:10 jamescjchan

I also tried Helm deployment through Terraform but was still hit with the same issue.

resource "helm_release" "external_dns" {
  name = "external-dns"

  repository = "https://kubernetes-sigs.github.io/external-dns"
  chart      = "external-dns"
  create_namespace = true
  namespace  = "external-dns"
  version    = "1.15.0"

  set {
    name  = "serviceAccount.name"
    value = "external-dns"
  }

  set {
    name  = "domainFilters"
    value = [var.route53_zone]
  }

  set {
    name  = "txtOwnerId"
    value = data.aws_eks_cluster.cluster.name
  }
}

jamescjchan avatar Nov 13 '24 22:11 jamescjchan

It looks like your external-dns deployment is failing to reach https://route53.amazonaws.com to list hosted zones - I would take a double-check at your networking configuration. Perhaps a security group isn't allowing this traffic or something similar?

mjlshen avatar Dec 06 '24 22:12 mjlshen

Is this something new in v0.15.0 because I have no issues with v0.14.2?

jamescjchan avatar Dec 06 '24 22:12 jamescjchan

No, it should be unrelated, I would strongly suspect AWS networking misconfiguration. Are you able to exec into the external-dns pod? Do you get results similar to this?

❯ curl -vkL https://route53.amazonaws.com/2013-04-01/hostedzone
* Host route53.amazonaws.com:443 was resolved.
* IPv6: (none)
* IPv4: 54.239.31.187
*   Trying 54.239.31.187:443...
* Connected to route53.amazonaws.com (54.239.31.187) port 443
* ALPN: curl offers h2,http/1.1
* (304) (OUT), TLS handshake, Client hello (1):
* (304) (IN), TLS handshake, Server hello (2):
* (304) (OUT), TLS handshake, Client hello (1):
* (304) (IN), TLS handshake, Server hello (2):
* (304) (IN), TLS handshake, Unknown (8):
* (304) (IN), TLS handshake, Certificate (11):
* (304) (IN), TLS handshake, CERT verify (15):
* (304) (IN), TLS handshake, Finished (20):
* (304) (OUT), TLS handshake, Finished (20):
* SSL connection using TLSv1.3 / AEAD-AES128-GCM-SHA256 / [blank] / UNDEF
* ALPN: server accepted http/1.1
* Server certificate:
*  subject: CN=route53.amazonaws.com
*  start date: Aug 31 00:00:00 2024 GMT
*  expire date: Aug 13 23:59:59 2025 GMT
*  issuer: C=US; O=Amazon; CN=Amazon RSA 2048 M01
*  SSL certificate verify ok.
* using HTTP/1.x
> GET /2013-04-01/hostedzone HTTP/1.1
> Host: route53.amazonaws.com
> User-Agent: curl/8.7.1
> Accept: */*
> 
* Request completely sent off
< HTTP/1.1 403 Forbidden
< x-amzn-RequestId: 586c0b6d-d166-4996-951c-a3cea16c0629
< Content-Type: text/xml
< Content-Length: 297
< Date: Fri, 06 Dec 2024 23:00:22 GMT
< 
<?xml version="1.0"?>
* Connection #0 to host route53.amazonaws.com left intact
<ErrorResponse xmlns="https://route53.amazonaws.com/doc/2013-04-01/"><Error><Type>Sender</Type><Code>MissingAuthenticationToken</Code><Message>Request is missing Authentication Token</Message></Error><RequestId>586c0b6d-d166-4996-951c-a3cea16c0629</RequestId></ErrorResponse>

mjlshen avatar Dec 06 '24 23:12 mjlshen

Unfortunately, the container doesn't seem to have shell and curl installed.

jamescjchan avatar Dec 07 '24 00:12 jamescjchan

I have some updates. It looks like this is only happening in the us-east-1 region. I have similar deployments in ap-southeast-1, eu-central-1, and ca-central-1, but they all came up normally.

jamescjchan avatar Jan 02 '25 15:01 jamescjchan

/help /area provider/aws

Does it mean helm chart version 0.14.2 works with no issues in us-east-1? What about AWS IAM permissions? Any restrictions, it most likely safe to share IAM policy JSON.

Worth to write a simple service, deploy to us-east-1 and try to debug networking issues with it.

What about environment variables, not sure what the setup, but for STS, how AWS_ env vars looks like?

    - name: AWS_STS_REGIONAL_ENDPOINTS
      value: regional
    - name: AWS_DEFAULT_REGION
      value: eu-west-1
    - name: AWS_REGION
      value: eu-west-1
    - name: AWS_ROLE_ARN
      value: arn:aws:iam::xsdfasdfasdfasdf

ivankatliarchuk avatar Feb 01 '25 18:02 ivankatliarchuk

@ivankatliarchuk: This request has been marked as needing help from a contributor.

Guidelines

Please ensure that the issue body includes answers to the following questions:

  • Why are we solving this issue?
  • To address this issue, are there any code changes? If there are code changes, what needs to be done in the code and what places can the assignee treat as reference points?
  • Does this issue have zero to low barrier of entry?
  • How can the assignee reach out to you for help?

For more details on the requirements of such an issue, please see here and ensure that they are met.

If this request no longer meets these requirements, the label can be removed by commenting with the /remove-help command.

In response to this:

/help /area provider/aws

Does it mean helm chart version 0.14.2 works with no issues in us-east-1? What about AWS IAM permissions? Any restrictions, it most likely safe to share IAM policy JSON.

Worth to write a simple service, deploy to us-east-1 and try to debug networking issues with it.

What about environment variables, not sure what the setup, but for STS, how AWS_ env vars looks like?

   - name: AWS_STS_REGIONAL_ENDPOINTS
     value: regional
   - name: AWS_DEFAULT_REGION
     value: eu-west-1
   - name: AWS_REGION
     value: eu-west-1
   - name: AWS_ROLE_ARN
     value: arn:aws:iam::xsdfasdfasdfasdf

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

k8s-ci-robot avatar Feb 01 '25 18:02 k8s-ci-robot

/help /area provider/aws

Does it mean helm chart version 0.14.2 works with no issues in us-east-1? What about AWS IAM permissions? Any restrictions, it most likely safe to share IAM policy JSON.

Worth to write a simple service, deploy to us-east-1 and try to debug networking issues with it.

What about environment variables, not sure what the setup, but for STS, how AWS_ env vars looks like?

    - name: AWS_STS_REGIONAL_ENDPOINTS
      value: regional
    - name: AWS_DEFAULT_REGION
      value: eu-west-1
    - name: AWS_REGION
      value: eu-west-1
    - name: AWS_ROLE_ARN
      value: arn:aws:iam::xsdfasdfasdfasdf

AWS_DEFAULT_REGION is the only env variable set when I deployed both 0.14.2 and 0.15.0 through helm chart. Please the attached role policy. I used the pod identity way to associate the role with the service account assigned to the pod.

ExternalDNS.json

jamescjchan avatar Feb 03 '25 18:02 jamescjchan

Another finding is that if I incorrectly set the AWS_DEFAULT_REGION to us-east-1 in an AWS account servicing a different region (e.g. ap-southeast-2), the external-dns runs without the failed to list hosted zones error.

jamescjchan avatar Mar 17 '25 16:03 jamescjchan

We are still seeing the error with the v0.18.0 release. Since this is a private VPC, is there any regional VPC endpoint required to be set up to allow access? Please advise.

jamescjchan avatar Jul 25 '25 22:07 jamescjchan

AWS is one of the supported providers. Please either raise the question in the AWS SDK repository (https://github.com/aws/aws-sdk-go-v2/) or, if you’re an AWS paid customer, open a support case directly with AWS or hire an SRE to debug. They should be able to advise based on your infrastructure requirements.

From the ExternalDNS side, there is no visibility into your exact environment constraints (VPC layout, subnets, resolvers, security groups, ACLs, and so on), and the behavior ultimately depends on how the AWS SDK and your environemt are configured.

ivankatliarchuk avatar Dec 07 '25 12:12 ivankatliarchuk