kops
kops copied to clipboard
pods ips leaks into cluster api. DNS record via dns-controller
/kind bug
1. What kops
version are you running? The command kops version
, will display
this information.
Version 1.23.3 (git-70da5cc4b0aa56088952f7792e53ff4ee486a275)
2. What Kubernetes version are you running? kubectl version
will print the
version if a cluster is running or provide the Kubernetes version specified as
a kops
flag.
Server Version: version.Info{Major:"1", Minor:"22", GitVersion:"v1.22.12", GitCommit:"b058e1760c79f46a834ba59bd7a3486ecf28237d", GitTreeState:"clean", BuildDate:"2022-07-13T14:53:39Z", GoVersion:"go1.16.15", Compiler:"gc", Platform:"linux/amd64"}
3. What cloud provider are you using? aws
4. What commands did you run? What is the simplest way to reproduce this issue? See below
5. What happened after the commands executed? See below
6. What did you expect to happen? See below
7. Please provide your cluster manifest. Execute
kops get --name my.example.com -o yaml
to display your cluster manifest.
You may want to remove your cluster name and other sensitive information.
Will provide a test one in case if needed
8. Please run the commands with most verbose logging by adding the -v 10
flag.
Paste the logs into this report, or in a gist and provide the gist link here.
9. Anything else do we need to know?
Hi, after we started rolling out kops 1.23.3 to our working environments, we started receiving alerts about etcd endpoints not being available. Upon check, it was discovered that dns-controller component adds random pods ips to api.internal DNS record. The issue is present with dns-controller 1.23.x and 1.24.x versions, the last properly working for us is 1.22.6. Logs:
kubectl logs dns-controller-58c7db7548-4gv24 -n kube-system | grep api.internal.domain.net
I0804 09:57:42.325660 1 dnscontroller.go:586] Adding DNS changes to batch {A api.internal.domain.net.} [10.12.1.42 10.12.104.196 10.12.107.55 10.12.3.70 10.12.74.5 10.12.75.5]
I0804 09:57:47.704012 1 dnscontroller.go:586] Adding DNS changes to batch {A api.internal.domain.net.} [10.12.1.42 10.12.107.55 10.12.3.70]
I0804 10:03:38.499342 1 dnscontroller.go:586] Adding DNS changes to batch {A api.internal.domain.net.} [10.12.1.42 10.12.104.196 10.12.107.55 10.12.3.70 10.12.74.5 10.12.75.5]
I0804 10:03:43.932347 1 dnscontroller.go:586] Adding DNS changes to batch {A api.internal.domain.net.} [10.12.1.42 10.12.107.55 10.12.3.70]
I0804 10:04:14.658230 1 dnscontroller.go:586] Adding DNS changes to batch {A api.internal.domain.net.} [10.12.1.42 10.12.105.120 10.12.105.64 10.12.107.55 10.12.124.110 10.12.3.70 10.12.34.120 10.12.46.143]
kubectl get pods -o wide -A | grep '10.12.105.120\|10.12.105.64\|10.12.124.110\|10.12.34.120\|10.12.46.143'
kube-system aws-load-balancer-aws-load-balancer-controller-794948f67c-2fjqm 1/1 Running 7 (24h ago) 24h 10.12.105.120 ip-10-12-3-70.eu-west-2.compute.internal <none> <none>
kube-system ebs-csi-controller-5c85c9898c-jq88w 6/6 Running 12 (24h ago) 24h 10.12.124.110 ip-10-12-3-70.eu-west-2.compute.internal <none> <none>
kube-system ebs-csi-node-fh99b 3/3 Running 7 (24h ago) 25h 10.12.46.143 ip-10-12-3-70.eu-west-2.compute.internal <none> <none>
kubernetes-dashboard dashboard-metrics-scraper-76585494d8-2c5lq 1/1 Running 6 (24h ago) 24h 10.12.34.120 ip-10-12-3-70.eu-west-2.compute.internal <none> <none>
Could you please check, and fix this issue? If provided information is not enough, please write about that, and I will try to reproduce this on clean cluster. The params we set for DNS in cluster template are
dnsZone: {{$.kubernetes_cluster_name.value}}
topology:
dns:
type: Private
kubeDNS:
provider: CoreDNS
Do you happen to run self-managed CNI? And running AWS VPC CNI?
There is calico deployed on nodes from helm chart, and aws vpc cni params are
amazonvpc:
env:
- name: AWS_VPC_K8S_CNI_EXTERNALSNAT
value: "true"
- name: WARM_IP_TARGET
value: "1"
I think something in your custom configuration is confusing dns-controller into thinking there are more than one IP per node and therefore adding all of them. So the IPs you see above are probably all the IPs set in the status of the node objects.
But why it was not confusing for many versions before? :)
Updated calico to the latest version, nothing changed. Tested intermediate dns-controller builds, issue appeared on 1.23.0-alpha.2, suspected reason - https://github.com/kubernetes/kops/pull/12640
/cc @hakman @johngmyers
To diagnose further, we would need the YAML for the kube-apiserver pods and the control plane nodes they run on.
If it's adding "random" pods' IPs, we would also want the YAML of the pods whose IPs were incorrectly added.
The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.
This bot triages issues and PRs according to the following rules:
- After 90d of inactivity,
lifecycle/stale
is applied - After 30d of inactivity since
lifecycle/stale
was applied,lifecycle/rotten
is applied - After 30d of inactivity since
lifecycle/rotten
was applied, the issue is closed
You can:
- Mark this issue or PR as fresh with
/remove-lifecycle stale
- Mark this issue or PR as rotten with
/lifecycle rotten
- Close this issue or PR with
/close
- Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale
/lifecycle stale
/remove-lifecycle stale
We are trying to get rid of calico to see if it will help, but not much progress there, and no free time to create a proper reproduction scenario without much our cluster specifics
The Kubernetes project currently lacks enough contributors to adequately respond to all issues.
This bot triages un-triaged issues according to the following rules:
- After 90d of inactivity,
lifecycle/stale
is applied - After 30d of inactivity since
lifecycle/stale
was applied,lifecycle/rotten
is applied - After 30d of inactivity since
lifecycle/rotten
was applied, the issue is closed
You can:
- Mark this issue as fresh with
/remove-lifecycle stale
- Close this issue with
/close
- Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale
/remove-lifecycle stale
Finally removed Calico today from master nodes, unfortunately that didn't help
I0216 17:06:31.023393 1 dnscontroller.go:586] Adding DNS changes to batch {A api.internal.domain.net.} [10.12.224.200 10.12.224.43 10.12.225.154 10.12.225.30 10.12.225.72 10.12.225.76 10.12.226.23 10.12.227.243]
I0216 17:06:36.763119 1 dnscontroller.go:586] Adding DNS changes to batch {A api.internal.domain.net.} [10.12.225.154 10.12.225.30 10.12.227.243]
kubectl get pods -o wide -A | grep '10.12.224.200\|10.12.224.43\|10.12.225.72\|10.12.225.76\|10.12.226.23'
devex-infra aws-node-termination-handler-vs2d2 1/1 Running 0 5h42m 10.12.225.72 ip-10-12-227-243.eu-west-2.compute.internal <none> <none>
kube-system aws-load-balancer-6845f45db6-m8bws 1/1 Running 0 5h47m 10.12.224.200 ip-10-12-227-243.eu-west-2.compute.internal <none> <none>
kube-system ebs-csi-controller-6d7446dcb5-5g8gx 6/6 Running 0 5h9m 10.12.225.76 ip-10-12-227-243.eu-west-2.compute.internal <none> <none>
kube-system ebs-csi-node-spzdz 3/3 Running 0 5h42m 10.12.226.23 ip-10-12-227-243.eu-west-2.compute.internal <none> <none>
Will try to set up clean cluster to reproduce the issue, but that wouldn't be fast
And the issue is reproducible up to 1.27.0-alpha.1
Without providing the info requested above, there isn't anything we can do here.
Caused by https://github.com/kubernetes/cloud-provider-aws/issues/349, resolved by moving to external ccm 1.24