kops icon indicating copy to clipboard operation
kops copied to clipboard

pods ips leaks into cluster api. DNS record via dns-controller

Open ValeriiVozniuk opened this issue 2 years ago • 8 comments

/kind bug

1. What kops version are you running? The command kops version, will display this information. Version 1.23.3 (git-70da5cc4b0aa56088952f7792e53ff4ee486a275)

2. What Kubernetes version are you running? kubectl version will print the version if a cluster is running or provide the Kubernetes version specified as a kops flag. Server Version: version.Info{Major:"1", Minor:"22", GitVersion:"v1.22.12", GitCommit:"b058e1760c79f46a834ba59bd7a3486ecf28237d", GitTreeState:"clean", BuildDate:"2022-07-13T14:53:39Z", GoVersion:"go1.16.15", Compiler:"gc", Platform:"linux/amd64"}

3. What cloud provider are you using? aws

4. What commands did you run? What is the simplest way to reproduce this issue? See below

5. What happened after the commands executed? See below

6. What did you expect to happen? See below

7. Please provide your cluster manifest. Execute kops get --name my.example.com -o yaml to display your cluster manifest. You may want to remove your cluster name and other sensitive information. Will provide a test one in case if needed

8. Please run the commands with most verbose logging by adding the -v 10 flag. Paste the logs into this report, or in a gist and provide the gist link here.

9. Anything else do we need to know?

Hi, after we started rolling out kops 1.23.3 to our working environments, we started receiving alerts about etcd endpoints not being available. Upon check, it was discovered that dns-controller component adds random pods ips to api.internal DNS record. The issue is present with dns-controller 1.23.x and 1.24.x versions, the last properly working for us is 1.22.6. Logs:

kubectl logs dns-controller-58c7db7548-4gv24 -n kube-system | grep api.internal.domain.net
I0804 09:57:42.325660       1 dnscontroller.go:586] Adding DNS changes to batch {A api.internal.domain.net.} [10.12.1.42 10.12.104.196 10.12.107.55 10.12.3.70 10.12.74.5 10.12.75.5]
I0804 09:57:47.704012       1 dnscontroller.go:586] Adding DNS changes to batch {A api.internal.domain.net.} [10.12.1.42 10.12.107.55 10.12.3.70]
I0804 10:03:38.499342       1 dnscontroller.go:586] Adding DNS changes to batch {A api.internal.domain.net.} [10.12.1.42 10.12.104.196 10.12.107.55 10.12.3.70 10.12.74.5 10.12.75.5]
I0804 10:03:43.932347       1 dnscontroller.go:586] Adding DNS changes to batch {A api.internal.domain.net.} [10.12.1.42 10.12.107.55 10.12.3.70]
I0804 10:04:14.658230       1 dnscontroller.go:586] Adding DNS changes to batch {A api.internal.domain.net.} [10.12.1.42 10.12.105.120 10.12.105.64 10.12.107.55 10.12.124.110 10.12.3.70 10.12.34.120 10.12.46.143]
kubectl get pods -o wide -A | grep '10.12.105.120\|10.12.105.64\|10.12.124.110\|10.12.34.120\|10.12.46.143'
kube-system                      aws-load-balancer-aws-load-balancer-controller-794948f67c-2fjqm      1/1     Running             7 (24h ago)     24h     10.12.105.120   ip-10-12-3-70.eu-west-2.compute.internal      <none>           <none>
kube-system                      ebs-csi-controller-5c85c9898c-jq88w                                  6/6     Running             12 (24h ago)    24h     10.12.124.110   ip-10-12-3-70.eu-west-2.compute.internal      <none>           <none>
kube-system                      ebs-csi-node-fh99b                                                   3/3     Running             7 (24h ago)     25h     10.12.46.143    ip-10-12-3-70.eu-west-2.compute.internal      <none>           <none>
kubernetes-dashboard             dashboard-metrics-scraper-76585494d8-2c5lq                           1/1     Running             6 (24h ago)     24h     10.12.34.120    ip-10-12-3-70.eu-west-2.compute.internal      <none>           <none>

Could you please check, and fix this issue? If provided information is not enough, please write about that, and I will try to reproduce this on clean cluster. The params we set for DNS in cluster template are

  dnsZone: {{$.kubernetes_cluster_name.value}}
  topology:
    dns:
      type: Private
  kubeDNS:
    provider: CoreDNS

ValeriiVozniuk avatar Aug 12 '22 13:08 ValeriiVozniuk

Do you happen to run self-managed CNI? And running AWS VPC CNI?

olemarkus avatar Aug 15 '22 13:08 olemarkus

There is calico deployed on nodes from helm chart, and aws vpc cni params are

    amazonvpc:
      env:
      - name: AWS_VPC_K8S_CNI_EXTERNALSNAT
        value: "true"
      - name: WARM_IP_TARGET
        value: "1"

ValeriiVozniuk avatar Aug 15 '22 13:08 ValeriiVozniuk

I think something in your custom configuration is confusing dns-controller into thinking there are more than one IP per node and therefore adding all of them. So the IPs you see above are probably all the IPs set in the status of the node objects.

olemarkus avatar Aug 15 '22 13:08 olemarkus

But why it was not confusing for many versions before? :)

ValeriiVozniuk avatar Aug 15 '22 14:08 ValeriiVozniuk

Updated calico to the latest version, nothing changed. Tested intermediate dns-controller builds, issue appeared on 1.23.0-alpha.2, suspected reason - https://github.com/kubernetes/kops/pull/12640

ValeriiVozniuk avatar Aug 17 '22 11:08 ValeriiVozniuk

/cc @hakman @johngmyers

olemarkus avatar Aug 17 '22 19:08 olemarkus

To diagnose further, we would need the YAML for the kube-apiserver pods and the control plane nodes they run on.

johngmyers avatar Aug 19 '22 20:08 johngmyers

If it's adding "random" pods' IPs, we would also want the YAML of the pods whose IPs were incorrectly added.

johngmyers avatar Aug 19 '22 20:08 johngmyers

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot avatar Nov 17 '22 21:11 k8s-triage-robot

/lifecycle stale

ValeriiVozniuk avatar Nov 18 '22 10:11 ValeriiVozniuk

/remove-lifecycle stale

ValeriiVozniuk avatar Nov 18 '22 10:11 ValeriiVozniuk

We are trying to get rid of calico to see if it will help, but not much progress there, and no free time to create a proper reproduction scenario without much our cluster specifics

ValeriiVozniuk avatar Nov 18 '22 10:11 ValeriiVozniuk

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot avatar Feb 16 '23 11:02 k8s-triage-robot

/remove-lifecycle stale

ValeriiVozniuk avatar Feb 16 '23 17:02 ValeriiVozniuk

Finally removed Calico today from master nodes, unfortunately that didn't help

I0216 17:06:31.023393       1 dnscontroller.go:586] Adding DNS changes to batch {A api.internal.domain.net.} [10.12.224.200 10.12.224.43 10.12.225.154 10.12.225.30 10.12.225.72 10.12.225.76 10.12.226.23 10.12.227.243]
I0216 17:06:36.763119       1 dnscontroller.go:586] Adding DNS changes to batch {A api.internal.domain.net.} [10.12.225.154 10.12.225.30 10.12.227.243]

kubectl get pods -o wide -A | grep '10.12.224.200\|10.12.224.43\|10.12.225.72\|10.12.225.76\|10.12.226.23'
devex-infra          aws-node-termination-handler-vs2d2                                      1/1     Running   0               5h42m   10.12.225.72    ip-10-12-227-243.eu-west-2.compute.internal   <none>           <none>
kube-system          aws-load-balancer-6845f45db6-m8bws                                      1/1     Running   0               5h47m   10.12.224.200   ip-10-12-227-243.eu-west-2.compute.internal   <none>           <none>
kube-system          ebs-csi-controller-6d7446dcb5-5g8gx                                     6/6     Running   0               5h9m    10.12.225.76    ip-10-12-227-243.eu-west-2.compute.internal   <none>           <none>
kube-system          ebs-csi-node-spzdz                                                      3/3     Running   0               5h42m   10.12.226.23    ip-10-12-227-243.eu-west-2.compute.internal   <none>           <none>

Will try to set up clean cluster to reproduce the issue, but that wouldn't be fast

ValeriiVozniuk avatar Feb 16 '23 17:02 ValeriiVozniuk

And the issue is reproducible up to 1.27.0-alpha.1

ValeriiVozniuk avatar Feb 16 '23 17:02 ValeriiVozniuk

Without providing the info requested above, there isn't anything we can do here.

olemarkus avatar Feb 17 '23 07:02 olemarkus

Caused by https://github.com/kubernetes/cloud-provider-aws/issues/349, resolved by moving to external ccm 1.24

ValeriiVozniuk avatar May 17 '23 09:05 ValeriiVozniuk