cloud-provider-aws icon indicating copy to clipboard operation
cloud-provider-aws copied to clipboard

Kubernetes 1.23.1-00 and Cloud-Provider 1.23.0 - error running controllers: failed to parse cidr value:"" with error:invalid CIDR address:

Open andrewstec opened this issue 2 years ago • 13 comments

What happened:

The aws-cloud-controller-manager keeps on restarting due to an ability to find the CIDR value on the routes. Hence, we cannot use the AWS cloud controller manager to handle our AWS API communication.

ubuntu@ip-10-0-4-213:~$  kubectl get pods --namespace kube-system
NAME                                      READY   STATUS    RESTARTS      AGE
aws-cloud-controller-manager-sqzd5        1/1     Running   4 (57s ago)   3m3s

What you expected to happen:

I expected the aws-cloud-controller-manager container to operate normally so that my cluster could utilize AWS APIs.

How to reproduce it (as minimally and precisely as possible):

  1. Fresh install of kubernetes 1.23.1-00 on EC2 instance and init as the control plane.
  2. Replace KUBE_CONFIG_ARGS line with Environment="KUBELET_KUBECONFIG_ARGS=--bootstrap-kubeconfig=/etc/kubernetes/bootstrap-kubelet.conf --kubeconfig=/etc/kubernetes/kubelet.conf --cloud-provider=external", which adds the --cloud-provider=external flag.
  3. Reload service configuration and restart kubelet using systemctl.
  4. Install AWS Cloud Provider according to instructions for release-1.23.0:
helm repo add aws-cloud-controller-manager https://kubernetes.github.io/cloud-provider-aws
helm repo update
helm upgrade --install aws-cloud-controller-manager aws-cloud-controller-manager/aws-cloud-controller-manager
  1. Run kubectl get pods --namespace kube-system to view restarting container.

Anything else we need to know?:

You can view the logs of the API communication with AWS inside the container: Use command: kubectl logs aws-cloud-controller-manager-<REPLACEWITHACTIVECONTAINERID> --namespace kube-system

I0612 16:20:17.191966       1 requestheader_controller.go:244] Loaded a new request header values for RequestHeaderAuthRequestController
W0612 16:20:17.192377       1 client_config.go:617] Neither --kubeconfig nor --master was specified.  Using the inClusterConfig.  This might not work.
I0612 16:20:17.193205       1 aws.go:1300] Building AWS cloudprovider
I0612 16:20:17.193269       1 aws.go:1260] Zone not specified in configuration file; querying AWS metadata service
I0612 16:20:23.674080       1 tags.go:77] AWS cloud filtering on ClusterID: gothboots
I0612 16:20:23.674150       1 aws.go:1407] The following IP families will be added to nodes: [ipv4]
I0612 16:20:23.674234       1 controllermanager.go:144] Version: v0.0.0-master+$Format:%H$
I0612 16:20:23.679590       1 tlsconfig.go:200] "Loaded serving cert" certName="Generated self signed cert" certDetail="\"localhost@1655050816\" [serving] validServingFor=[127.0.0.1,localhost,localhost] issuer=\"localhost-ca@1655050816\" (2022-06-12 15:20:15 +0000 UTC to 2023-06-12 15:20:15 +0000 UTC (now=2022-06-12 16:20:23.679555456 +0000 UTC))"
I0612 16:20:23.679864       1 named_certificates.go:53] "Loaded SNI cert" index=0 certName="self-signed loopback" certDetail="\"apiserver-loopback-client@1655050817\" [serving] validServingFor=[apiserver-loopback-client] issuer=\"apiserver-loopback-client-ca@1655050816\" (2022-06-12 15:20:16 +0000 UTC to 2023-06-12 15:20:16 +0000 UTC (now=2022-06-12 16:20:23.679833018 +0000 UTC))"
I0612 16:20:23.679896       1 secure_serving.go:200] Serving securely on [::]:10258
I0612 16:20:23.680156       1 leaderelection.go:248] attempting to acquire leader lease kube-system/cloud-controller-manager...
I0612 16:20:23.680470       1 requestheader_controller.go:169] Starting RequestHeaderAuthRequestController
I0612 16:20:23.680595       1 shared_informer.go:240] Waiting for caches to sync for RequestHeaderAuthRequestController
I0612 16:20:23.680706       1 tlsconfig.go:240] "Starting DynamicServingCertificateController"
I0612 16:20:23.680840       1 configmap_cafile_content.go:201] "Starting controller" name="client-ca::kube-system::extension-apiserver-authentication::client-ca-file"
I0612 16:20:23.680860       1 shared_informer.go:240] Waiting for caches to sync for client-ca::kube-system::extension-apiserver-authentication::client-ca-file
I0612 16:20:23.680938       1 configmap_cafile_content.go:201] "Starting controller" name="client-ca::kube-system::extension-apiserver-authentication::requestheader-client-ca-file"
I0612 16:20:23.680956       1 shared_informer.go:240] Waiting for caches to sync for client-ca::kube-system::extension-apiserver-authentication::requestheader-client-ca-file
I0612 16:20:23.781045       1 shared_informer.go:247] Caches are synced for client-ca::kube-system::extension-apiserver-authentication::requestheader-client-ca-file 
I0612 16:20:23.781046       1 shared_informer.go:247] Caches are synced for RequestHeaderAuthRequestController 
I0612 16:20:23.781475       1 tlsconfig.go:178] "Loaded client CA" index=0 certName="client-ca::kube-system::extension-apiserver-authentication::client-ca-file,client-ca::kube-system::extension-apiserver-authentication::requestheader-client-ca-file" certDetail="\"front-proxy-ca\" [] validServingFor=[front-proxy-ca] issuer=\"<self>\" (2022-06-12 16:07:38 +0000 UTC to 2032-06-09 16:07:38 +0000 UTC (now=2022-06-12 16:20:23.781428087 +0000 UTC))"
I0612 16:20:23.781670       1 tlsconfig.go:200] "Loaded serving cert" certName="Generated self signed cert" certDetail="\"localhost@1655050816\" [serving] validServingFor=[127.0.0.1,localhost,localhost] issuer=\"localhost-ca@1655050816\" (2022-06-12 15:20:15 +0000 UTC to 2023-06-12 15:20:15 +0000 UTC (now=2022-06-12 16:20:23.781649295 +0000 UTC))"
I0612 16:20:23.781846       1 named_certificates.go:53] "Loaded SNI cert" index=0 certName="self-signed loopback" certDetail="\"apiserver-loopback-client@1655050817\" [serving] validServingFor=[apiserver-loopback-client] issuer=\"apiserver-loopback-client-ca@1655050816\" (2022-06-12 15:20:16 +0000 UTC to 2023-06-12 15:20:16 +0000 UTC (now=2022-06-12 16:20:23.781818898 +0000 UTC))"
I0612 16:20:23.781067       1 shared_informer.go:247] Caches are synced for client-ca::kube-system::extension-apiserver-authentication::client-ca-file 
I0612 16:20:23.782120       1 tlsconfig.go:178] "Loaded client CA" index=0 certName="client-ca::kube-system::extension-apiserver-authentication::client-ca-file,client-ca::kube-system::extension-apiserver-authentication::requestheader-client-ca-file" certDetail="\"kubernetes\" [] validServingFor=[kubernetes] issuer=\"<self>\" (2022-06-12 16:07:38 +0000 UTC to 2032-06-09 16:07:38 +0000 UTC (now=2022-06-12 16:20:23.782101859 +0000 UTC))"
I0612 16:20:23.782208       1 tlsconfig.go:178] "Loaded client CA" index=1 certName="client-ca::kube-system::extension-apiserver-authentication::client-ca-file,client-ca::kube-system::extension-apiserver-authentication::requestheader-client-ca-file" certDetail="\"front-proxy-ca\" [] validServingFor=[front-proxy-ca] issuer=\"<self>\" (2022-06-12 16:07:38 +0000 UTC to 2032-06-09 16:07:38 +0000 UTC (now=2022-06-12 16:20:23.78219022 +0000 UTC))"
I0612 16:20:23.782375       1 tlsconfig.go:200] "Loaded serving cert" certName="Generated self signed cert" certDetail="\"localhost@1655050816\" [serving] validServingFor=[127.0.0.1,localhost,localhost] issuer=\"localhost-ca@1655050816\" (2022-06-12 15:20:15 +0000 UTC to 2023-06-12 15:20:15 +0000 UTC (now=2022-06-12 16:20:23.78235841 +0000 UTC))"
I0612 16:20:23.782489       1 named_certificates.go:53] "Loaded SNI cert" index=0 certName="self-signed loopback" certDetail="\"apiserver-loopback-client@1655050817\" [serving] validServingFor=[apiserver-loopback-client] issuer=\"apiserver-loopback-client-ca@1655050816\" (2022-06-12 15:20:16 +0000 UTC to 2023-06-12 15:20:16 +0000 UTC (now=2022-06-12 16:20:23.782473702 +0000 UTC))"
I0612 16:20:38.724319       1 leaderelection.go:258] successfully acquired lease kube-system/cloud-controller-manager
I0612 16:20:38.725343       1 event.go:294] "Event occurred" object="kube-system/cloud-controller-manager" kind="Lease" apiVersion="coordination.k8s.io/v1" type="Normal" reason="LeaderElection" message="aws-cloud-controller-manager-sqzd5_784da6b2-30ef-44e5-9048-9bbf1e1f40d6 became leader"
I0612 16:20:38.880477       1 aws.go:826] Setting up informers for Cloud
I0612 16:20:38.880520       1 controllermanager.go:279] Starting "service"
I0612 16:20:38.880949       1 controllermanager.go:298] Started "service"
I0612 16:20:38.880969       1 controllermanager.go:279] Starting "route"
E0612 16:20:38.880981       1 controllermanager.go:282] Error starting "route"
F0612 16:20:38.880988       1 controllermanager.go:189] error running controllers: failed to parse cidr value:"" with error:invalid CIDR address: 
goroutine 244 [running]:
k8s.io/klog/v2.stacks(0x1)
	k8s.io/klog/[email protected]/klog.go:1038 +0x8a
k8s.io/klog/v2.(*loggingT).output(0x3eae440, 0x3, 0x0, 0xc0002e2bd0, 0x0, {0x30c3a2d, 0x1}, 0xc0006e29a0, 0x0)
	k8s.io/klog/[email protected]/klog.go:987 +0x5fd
k8s.io/klog/v2.(*loggingT).printf(0xc000550140, 0x299f390, 0x0, {0x0, 0x0}, {0x2548505, 0x1d}, {0xc0006e29a0, 0x1, 0x1})
	k8s.io/klog/[email protected]/klog.go:753 +0x1c5
k8s.io/klog/v2.Fatalf(...)
	k8s.io/klog/[email protected]/klog.go:1532
k8s.io/cloud-provider/app.Run.func1({0x29bd3b0, 0xc000624a40}, 0x0)
	k8s.io/[email protected]/app/controllermanager.go:189 +0x327
k8s.io/cloud-provider/app.Run.func2({0x29bd3b0, 0xc000624a40})
	k8s.io/[email protected]/app/controllermanager.go:234 +0xe4
created by k8s.io/client-go/tools/leaderelection.(*LeaderElector).Run
	k8s.io/[email protected]/tools/leaderelection/leaderelection.go:211 +0x154

goroutine 1 [select (no cases)]:

Environment:

  • Kubernetes version (use kubectl version): Client Version: version.Info{Major:"1", Minor:"23", GitVersion:"v1.23.1", GitCommit:"86ec240af8cbd1b60bcc4c03c20da9b98005b92e", GitTreeState:"clean", BuildDate:"2021-12-16T11:41:01Z", GoVersion:"go1.17.5", Compiler:"gc", Platform:"linux/amd64"}
  • Cloud provider or hardware configuration: AWS configured with Terraform
  • OS (e.g. from /etc/os-release):
NAME="Ubuntu"
VERSION="18.04.6 LTS (Bionic Beaver)"
  • Kernel (e.g. uname -a): Linux ip-10-0-4-213 5.4.0-1078-aws #84~18.04.1-Ubuntu SMP Fri Jun 3 12:59:49 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux

  • Install tools: kubeadm

  • Others:

/kind bug

andrewstec avatar Jun 12 '22 16:06 andrewstec

@andrewstec: This issue is currently awaiting triage.

If cloud-provider-aws contributors determine this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot avatar Jun 12 '22 16:06 k8s-ci-robot

Looks like you aren't passing on the cluster CIDR.

olemarkus avatar Jun 12 '22 16:06 olemarkus

Looks like you aren't passing on the cluster CIDR.

That may be a good point. It looks like that based on the logs inside the aws-cloud-controller-manager container:

ubuntu@ip-10-0-4-55:/etc/kubernetes/manifests$ kubectl logs aws-cloud-controller-manager-bx6hd  --namespace kube-system
I0612 17:09:25.663303       1 flags.go:64] FLAG: --add_dir_header="false"
I0612 17:09:25.663347       1 flags.go:64] FLAG: --address="0.0.0.0"
I0612 17:09:25.663354       1 flags.go:64] FLAG: --allocate-node-cidrs="false"
I0612 17:09:25.663360       1 flags.go:64] FLAG: --allow-untagged-cloud="false"
I0612 17:09:25.663375       1 flags.go:64] FLAG: --alsologtostderr="false"
I0612 17:09:25.663387       1 flags.go:64] FLAG: --authentication-kubeconfig=""
I0612 17:09:25.663392       1 flags.go:64] FLAG: --authentication-skip-lookup="false"
I0612 17:09:25.663396       1 flags.go:64] FLAG: --authentication-token-webhook-cache-ttl="10s"
I0612 17:09:25.663402       1 flags.go:64] FLAG: --authentication-tolerate-lookup-failure="false"
I0612 17:09:25.663406       1 flags.go:64] FLAG: --authorization-always-allow-paths="[/healthz,/readyz,/livez]"
I0612 17:09:25.663416       1 flags.go:64] FLAG: --authorization-kubeconfig=""
I0612 17:09:25.663420       1 flags.go:64] FLAG: --authorization-webhook-cache-authorized-ttl="10s"
I0612 17:09:25.663425       1 flags.go:64] FLAG: --authorization-webhook-cache-unauthorized-ttl="10s"
I0612 17:09:25.663429       1 flags.go:64] FLAG: --bind-address="0.0.0.0"
I0612 17:09:25.663433       1 flags.go:64] FLAG: --cert-dir=""
I0612 17:09:25.663437       1 flags.go:64] FLAG: --cidr-allocator-type="RangeAllocator"
I0612 17:09:25.663442       1 flags.go:64] FLAG: --client-ca-file=""
I0612 17:09:25.663445       1 flags.go:64] FLAG: --cloud-config=""
I0612 17:09:25.663449       1 flags.go:64] FLAG: --cloud-provider="aws"
I0612 17:09:25.663453       1 flags.go:64] FLAG: --cluster-cidr=""
I0612 17:09:25.663458       1 flags.go:64] FLAG: --cluster-name="kubernetes"

However, the cidr-range is set here in the kube-controller-manager.yaml as 10.0.0.0/16:

ubuntu@ip-10-0-4-55:/etc/kubernetes/manifests$ sudo cat kube-controller-manager.yaml 
apiVersion: v1
kind: Pod
metadata:
  creationTimestamp: null
  labels:
    component: kube-controller-manager
    tier: control-plane
  name: kube-controller-manager
  namespace: kube-system
spec:
  containers:
  - command:
    - kube-controller-manager
    - --allocate-node-cidrs=false
    - --authentication-kubeconfig=/etc/kubernetes/controller-manager.conf
    - --authorization-kubeconfig=/etc/kubernetes/controller-manager.conf
    - --bind-address=127.0.0.1
    - --client-ca-file=/etc/kubernetes/pki/ca.crt
    - --cluster-cidr=10.0.0.0/16

andrewstec avatar Jun 12 '22 17:06 andrewstec

The purpose of Cloud Controller Manager is to separate the cloud specifics out of Kube Controller Manager. Hence they run much of the same code. You need to configure CCM with many of the same settings as KCM.

olemarkus avatar Jun 12 '22 17:06 olemarkus

The purpose of Cloud Controller Manager is to separate the cloud specifics out of Kube Controller Manager. Hence they run much of the same code. You need to configure CCM with many of the same settings as KCM.

This was great information. It wasn't totally obvious to me where the configuration is set (ie. does it come from AWS based off of the resource kubernetes.io/cluster/ tags, kube-controller yaml file, or how it was injected in this CCM container). Your feedback was super helpful @olemarkus.

Instead of following the instructions, I did these commands:

helm repo add aws-cloud-controller-manager https://kubernetes.github.io/cloud-provider-aws
helm repo update
helm fetch aws-cloud-controller-manager/aws-cloud-controller-manager --untar

I then added this to the values.yaml file in the untarred contents:

  - --allocate-node-cidrs=false
  - --cluster-cidr=172.20.0.0/16 
  - --cluster-name=<clustername>

One can then install the cloud manager from the file-system with helm:

helm install aws-cloud-controller-manager .

It works. I was able to create an ingress-nginx load balancer with an external IP. There's still issues that I am figuring out, such as the hostname naming conventions that AWS requires, but I think that should be fixed by this feature https://github.com/kubernetes/cloud-provider-aws/pull/286:

E0612 21:07:38.441437       1 node_controller.go:242] Error getting instance metadata for node addresses: error fetching node by provider ID: Invalid format for AWS instance (), and error by node name: getInstanceByNodeName failed for "ip-172-31-15-237" with "instance not found"

Thanks again!

andrewstec avatar Jun 12 '22 21:06 andrewstec

/remove-kind bug /kind support

olemarkus avatar Jun 13 '22 05:06 olemarkus

Hello @andrewstec I have similar issue. Could you answer a few questions please? Did I understand correctly that the key --cluster-cidr should have the value of a VPC network where EC2 instances run? Also could you show your policies that you use for CCM please? My policies differ from the example from documentation example due to security requirements.

qx10 avatar Jun 15 '22 13:06 qx10

--cluster-cidr is the CIDR used for Pod IPs. What to set it to depends on CNI and how its IPAM works. Some CNIs do not use the node's podCIDR list at all. e.g Calico does not by default.

olemarkus avatar Jun 15 '22 18:06 olemarkus

Hello @andrewstec I have similar issue. Could you answer a few questions please? Did I understand correctly that the key --cluster-cidr should have the value of a VPC network where EC2 instances run? Also could you show your policies that you use for CCM please? My policies differ from the example from documentation example due to security requirements.

@qx10 My policies were pretty much the example documentation, minus the three first three autoscaling lines because I was getting an error about the policy file being too big. @olemarkus is likely correct about the --cluster-cidr input and Calico. It seems like he knows this module and kubernetes in-depth. I made an educated guess because that's the CIDR range that was carved out for my VPC in Terraform. I still have more testing to do when I get a chance. I may have to make that CIDR range the private subnet based on what olemarkus said, as pod IPs are different than the cluster IPs. I hope this helps! If you figure it out, please let us know :-)

andrewstec avatar Jun 16 '22 14:06 andrewstec

@andrewstec @olemarkus thank you for your answers so much! Yes, I use Calico in my cluster, and autoscaling actions is present in policy. Unfortunately, unable to devote much time to this task. I'll let you know about the results, when I resolve the problem.

qx10 avatar Jun 17 '22 07:06 qx10

There is a ./hack/kops-example.sh script that provisions a working kops cluster using CCM. It may worth looking at that configuration and copy it for your own setup.

olemarkus avatar Jun 17 '22 18:06 olemarkus

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot avatar Sep 15 '22 18:09 k8s-triage-robot

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

k8s-triage-robot avatar Oct 15 '22 19:10 k8s-triage-robot

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue with /reopen
  • Mark this issue as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

k8s-triage-robot avatar Nov 14 '22 19:11 k8s-triage-robot

@k8s-triage-robot: Closing this issue, marking it as "Not Planned".

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue with /reopen
  • Mark this issue as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot avatar Nov 14 '22 19:11 k8s-ci-robot

Why do you provide a helm chart that can't be used without manually editing the output? Or am I missing something here?

discordianfish avatar May 03 '23 13:05 discordianfish