kubespray icon indicating copy to clipboard operation
kubespray copied to clipboard

kubespray 2.18.0 calico failes without local-loadbalancer

Open Talangor opened this issue 3 years ago • 9 comments

hi guys and thank you for your hard work previously I had installed kubernetes cluster with kubespray and weave cni without any problem (kubespray 2.18.0) but since we need bgp functionality we decided to move to Calico CNI for a week i have tried the default configuration, the config you see today and tested with Kubernetes 1.23.6 to 1.22.2 with no success i have been searching and found out if i run the localhost load balancer everything will work as expected but i don't want to use a local (nginx,haproxy) load balancer is it mandatory to have use_localhost_as_kubeapi_loadbalancer: true?

Environment:

  • Cloud provider or hardware configuration: bare-metal installation

  • OS (printf "$(uname -srm)\n$(cat /etc/os-release)\n"): Linux 5.4.0-113-generic x86_64 NAME="Ubuntu" VERSION="20.04.4 LTS (Focal Fossa)" ID=ubuntu ID_LIKE=debian PRETTY_NAME="Ubuntu 20.04.4 LTS" VERSION_ID="20.04" HOME_URL="https://www.ubuntu.com/" SUPPORT_URL="https://help.ubuntu.com/" BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/" PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy" VERSION_CODENAME=focal UBUNTU_CODENAME=focal

  • Version of Ansible (ansible --version): ansible [core 2.12.5] config file = /home/ubuntu/kubespray-v2.18.1/ansible.cfg configured module search path = ['/home/ubuntu/kubespray-v2.18.1/library'] ansible python module location = /usr/local/lib/python3.8/dist-packages/ansible ansible collection location = /home/ubuntu/.ansible/collections:/usr/share/ansible/collections executable location = /usr/local/bin/ansible python version = 3.8.10 (default, Mar 15 2022, 12:22:08) [GCC 9.4.0] jinja version = 2.11.3 libyaml = True

  • Version of Python (python --version): Python 3.8.10

Kubespray version (commit) (git rev-parse --short HEAD): 85bd1eea 2.18.1 Network plugin used: netplan

Full inventory with variables (ansible -i inventory/sample/inventory.ini all -m debug -a "var=hostvars[inventory_hostname]"):

Command used to invoke ansible: ansible-playbook -i inventory/pre-production/hosts.yaml --become -u sadmin -K cluster.yml

Output of ansible run:

calico kube controller log:

all pods that need calico to create a network for them fail with the below log:

thanks in advance for taking your time

Talangor avatar May 24 '22 15:05 Talangor

use_localhost_as_kubeapi_loadbalancer: true is only needed when using calico with ebpf and if you don't set it kubespray defaults to False. The sample specifically states that this setting is there for cillium but it is needed for calico in ebpf mode as well, else it's not needed.

Could you be more precise with regard to the error you are seeing when setting this to False?

cristicalin avatar May 24 '22 17:05 cristicalin

@cristicalin thanks for fast responce i did reinstall with calico_bpf_enabled: false and use_localhost_as_kubeapi_loadbalancer line is commented out i thought you meant disabling bpf should i reinstall with use_localhost_as_kubeapi_loadbalancer: false ? as far as i can tell its disabled by default result : no change except API URL changed to api-service IP address

│ Warning FailedCreatePodSandBox 18m kubelet Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup │ │ network for sandbox "10a355bb378aa245368c5c9ac05f3f4045e6aeadc8af25167c8a1a808b70782d": plugin type="calico" failed (add): error getting ClusterInformatio │ │ n: Get "https://10.233.0.1:443/apis/crd.projectcalico.org/v1/clusterinformations/default": Service Unavailable │ │ Warning FailedCreatePodSandBox 15m kubelet Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup │ │ network for sandbox "b14793930f295689742ca2112f49174c7634ec121283924955e83925fc2d5898": plugin type="calico" failed (add): error getting ClusterInformatio │ │ n: Get "https://10.233.0.1:443/apis/crd.projectcalico.org/v1/clusterinformations/default": context deadline exceeded │ │ Warning FailedCreatePodSandBox 13m kubelet Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup │ │ network for sandbox "20e2ec4dd4d96b68f45d37dfe587e76bf3e19341b353aff57bd28545b2b467c4": plugin type="calico" failed (add): error getting ClusterInformatio │ │ n: Get "https://10.233.0.1:443/apis/crd.projectcalico.org/v1/clusterinformations/default": context deadline exceeded │ │ Warning FailedCreatePodSandBox 10m kubelet Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup │ │ network for sandbox "5ff44b8fd7abd7ef5c9aedf2f3aad5407b4c1b1b82e8e171dde1319b4620695a": plugin type="calico" failed (add): error getting ClusterInformatio │ │ n: Get "https://10.233.0.1:443/apis/crd.projectcalico.org/v1/clusterinformations/default": context deadline exceeded │ │ Warning FailedCreatePodSandBox 8m16s kubelet Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup │ │ network for sandbox "f581e717e804dc43e5f6c0c814efe5270227846edc317a546f2d0e847f0d0dcf": plugin type="calico" failed (add): error getting ClusterInformatio │ │ n: Get "https://10.233.0.1:443/apis/crd.projectcalico.org/v1/clusterinformations/default": Service Unavailable │ │ Warning FailedCreatePodSandBox 30s (x3 over 5m30s) kubelet (combined from similar events): Failed to create pod sandbox: rpc error: code = │ │ Unknown desc = failed to setup network for sandbox "b5bf26f8e2ac33cf2ce5c01105745670d6e19ea652bb18e27aaab32b7f8cae70": plugin type="calico" failed (add): │ │ error getting ClusterInformation: Get "https://10.233.0.1:443/apis/crd.projectcalico.org/v1/clusterinformations/default": Service Unavailable

Talangor avatar May 24 '22 18:05 Talangor

the thing is that the default configuration available in kubespray does this too my past experience with kubespray was that I could deploy kubernetes cluster with default yaml files and get it to work but this time I cant get it to work I was suspecting RBAC or compatibility issue between calico version and kube versions but since it works with use_localhost_as_kubeapi_loadbalancer I discarded this thought excuse me for lack of knowledge ### just a thought: isn't this issue due to kube-proxy and kubeadm refusing to serve API due to policy?

for more info ( clarification )

  • **** calico kube controller log: **** W0525 07:23:09.449630 1 reflector.go:436] pkg/mod/github.com/projectcalico/[email protected]/tools/cache/reflector │ │ 2022-05-25 07:23:09.449 [INFO][1] watchercache.go 97: Watch channel closed by remote - recreate watcher ListRoot="/calico/resources/v3/projectcalico.org/n │ │ 2022-05-25 07:23:09.450 [INFO][1] watchercache.go 188: Failed to perform list of current data during resync ListRoot="/calico/ipam/v2/assignment/" error=G │ │ 2022-05-25 07:23:09.450 [INFO][1] watchercache.go 245: Failed to create watcher ListRoot="/calico/resources/v3/projectcalico.org/nodes" error=Get "https:/ │ │ 2022-05-25 07:23:09.450 [INFO][1] watchercache.go 175: Full resync is required ListRoot="/calico/resources/v3/projectcalico.org/nodes" │ │ 2022-05-25 07:23:09.450 [INFO][1] watchercache.go 188: Failed to perform list of current data during resync ListRoot="/calico/resources/v3/projectcalico.o │ │ 2022-05-25 07:23:10.446 [WARNING][1] runconfig.go 161: unable to get KubeControllersConfiguration(default) error=Get "https://10.233.0.1:443/apis/crd.proj │ │ 2022-05-25 07:23:10.450 [INFO][1] watchercache.go 175: Full resync is required ListRoot="/calico/ipam/v2/assignment/" │ │ 2022-05-25 07:23:10.450 [INFO][1] watchercache.go 175: Full resync is required ListRoot="/calico/resources/v3/projectcalico.org/nodes" │ │ 2022-05-25 07:23:10.451 [INFO][1] watchercache.go 188: Failed to perform list of current data during resync ListRoot="/calico/ipam/v2/assignment/" error=G │ │ 2022-05-25 07:23:10.451 [INFO][1] watchercache.go 188: Failed to perform list of current data during resync ListRoot="/calico/resources/v3/projectcalico.o │ │ E0525 07:23:10.801148 1 reflector.go:138] pkg/mod/github.com/projectcalico/[email protected]/tools/cache/reflector │ │ 2022-05-25 07:23:11.448 [WARNING][1] runconfig.go 161: unable to get KubeControllersConfiguration(default) error=Get "https://10.233.0.1:443/apis/crd.proj │ │ 2022-05-25 07:23:11.454 [INFO][1] watchercache.go 175: Full resync is required ListRoot="/calico/ipam/v2/assignment/" │ │ 2022-05-25 07:23:11.454 [INFO][1] watchercache.go 175: Full resync is required ListRoot="/calico/resources/v3/projectcalico.org/nodes" │ │ 2022-05-25 07:23:11.454 [INFO][1] watchercache.go 188: Failed to perform list of current data during resync ListRoot="/calico/ipam/v2/assignment/" error=G │ │ 2022-05-25 07:23:11.454 [INFO][1] watchercache.go 188: Failed to perform list of current data during resync ListRoot="/calico/resources/v3/projectcalico.o │ │ 2022-05-25 07:23:11.708 [ERROR][1] client.go 272: Error getting cluster information config ClusterInformation="default" error=Get "https://10.233.0.1:443/ │ │ 2022-05-25 07:23:11.708 [ERROR][1] main.go 226: Failed to verify datastore error=Get "https://10.233.0.1:443/apis/crd.projectcalico.org/v1/clusterinformat │ │ 2022-05-25 07:23:11.708 [ERROR][1] main.go 257: Failed to reach apiserver error=Get "https://10.233.0.1:443/apis/crd.projectcalico.org/v1/clusterinformati │ │ 2022-05-25 07:23:12.449 [WARNING][1] runconfig.go 161: unable to get KubeControllersConfiguration(default) error=Get "https://10.233.0.1:443/apis/crd.proj │ │ 2022-05-25 07:23:12.455 [INFO][1] watchercache.go 175: Full resync is required ListRoot="/calico/resources/v3/projectcalico.org/nodes"

  • **** calico pods**** Warning Unhealthy 33m (x2 over 33m) kubelet Readiness probe failed: calico/node is not ready: felix is not ready: Get "http://localhost:90 │ │ 99/readiness": dial tcp 127.0.0.1:9099: connect: connection refused │ │ Warning Unhealthy 33m kubelet Readiness probe failed: calico/node is not ready: felix is not ready: readiness probe reportin │ │ g 503

  • **** other pods **** Warning FailedCreatePodSandBox 14m kubelet Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup │ │ network for sandbox "812cde7dc2338ecec5a205dd437943bba4a6cb21e761ae53d4be1a693acd6814": plugin type="calico" failed (add): error getting ClusterInformati │ │ on: Get "https://10.233.0.1:443/apis/crd.projectcalico.org/v1/clusterinformations/default": Service Unavailable │ │ Warning FailedCreatePodSandBox 11m kubelet Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup │ │ network for sandbox "d771f1ba4b37cb3d5be2ab5b2441be59242facbbda25c5976a271c4359c4e53e": plugin type="calico" failed (add): error getting ClusterInformati │ │ on: Get "https://10.233.0.1:443/apis/crd.projectcalico.org/v1/clusterinformations/default": context deadline exceeded │ │ Warning FailedCreatePodSandBox 9m31s kubelet Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup │ │ network for sandbox "f0c1377a528f09735df62a331928ef62305182398f1468fc1be8fb2a4bc1a781": plugin type="calico" failed (add): error getting ClusterInformati │ │ on: Get "https://10.233.0.1:443/apis/crd.projectcalico.org/v1/clusterinformations/default": context deadline exceeded │ │ Warning FailedCreatePodSandBox 106s (x3 over 6m46s) kubelet (combined from similar events): Failed to create pod sandbox: rpc error: code │ │ = Unknown desc = failed to setup network for sandbox "594499d0afbb43fcdcc18cfb00d3a9eae92572857322bf91f55c2114e1562911": plugin type="calico" failed (add) │ │ : error getting ClusterInformation: Get "https://10.233.0.1:443/apis/crd.projectcalico.org/v1/clusterinformations/default": Service Unavailable

Talangor avatar May 25 '22 08:05 Talangor

@cristicalin I tried to install with weave and this happened https://github.com/kubernetes-sigs/kubespray/issues/8881 What's happening here am I so far off? I'm sure you guys had tested the code but its really strange

Talangor avatar May 27 '22 15:05 Talangor

it seems that this proxy settings is propagated down to some process calling [https://10.233.0.1:443/apis/crd.projectcalico.org/v1/clusterinformations/default] API.

Finally I add NO_PROXY to all private subnet (e.g. 10.233.0.0/16 , 10.233.64.0/16) and fix this issue.

i suggest putting cluster domain ( .cluster.local ) and network cidrs in no_proxy default configuration

Talangor avatar May 29 '22 10:05 Talangor

it seems that this proxy settings is propagated down to some process calling [https://10.233.0.1:443/apis/crd.projectcalico.org/v1/clusterinformations/default] API.

Finally I add NO_PROXY to all private subnet (e.g. 10.233.0.0/16 , 10.233.64.0/16) and fix this issue.

i suggest putting cluster domain ( .cluster.local ) and network cidrs in no_proxy default configuration

So this comes from setting http_proxy in your environment? Unfortunately we don't have a CI test case for this scenario so it's difficult to catch it when it's broken. Personally my envoronments don't require a proxy so its not a part of the code I see often.

If you want to push a PR with the code you changed we are happy to review and include it.

cristicalin avatar May 29 '22 19:05 cristicalin

i did add it like this in group_vars/all/all.yml no_proxy: "node01,node02,node03,node04,node05,localhost,127.0.0.0,127.0.1.1,127.0.1.1,10.233.0.0/18,10.233.64.0/18,.cluster.local,local.home"

but if we want it to use variable maybe it should be like this:

in roles/kubespray-defaults/defaults/main.yaml no_proxy: "{{ no_proxy | default ('{{ kube_service_addresses }}, {{ kube_pods_subnet }}, .{{ cluster_name }}') }}" NO_PROXY: "{{ no_proxy | default ('{{ kube_service_addresses }}, {{ kube_pods_subnet }}, .{{ cluster_name }}') }}"

in inventory/sample/group_vars/all/all.yml // Refer to roles/kubespray-defaults/defaults/main.yml before modifying no_proxy // Make sure you add kube_service_addresses, kube_pods_subnet and cluster_name no_proxy: "{{ kube_service_addresses }}, {{ kube_pods_subnet }}, {{ cluster_name }}" unfortunately, I don't have my test lab for some time now it's best if you could review it if not ill add this to my todo list and test it later on I'm truly sorry I should test and then give suggestions but I'm helpless right now maybe it helps a bit tho

Talangor avatar May 30 '22 15:05 Talangor

@cristicalin update fortunately, I had the opportunity to test this code and its working as expected

Talangor avatar May 31 '22 13:05 Talangor

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot avatar Aug 29 '22 13:08 k8s-triage-robot

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

k8s-triage-robot avatar Sep 28 '22 13:09 k8s-triage-robot

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue with /reopen
  • Mark this issue as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

k8s-triage-robot avatar Oct 28 '22 14:10 k8s-triage-robot

@k8s-triage-robot: Closing this issue, marking it as "Not Planned".

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue with /reopen
  • Mark this issue as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot avatar Oct 28 '22 14:10 k8s-ci-robot

/reopen

vyom-soft avatar Mar 16 '23 21:03 vyom-soft

@vyom-soft: You can't reopen an issue/PR unless you authored it or you are a collaborator.

In response to this:

/reopen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot avatar Mar 16 '23 21:03 k8s-ci-robot

Hello, I am seeing the following error

Events:
  Type     Reason                  Age   From               Message
  ----     ------                  ----  ----               -------
  Normal   Scheduled               72s   default-scheduler  Successfully assigned kube-system/kube-proxy-jhf8d to node5
  Warning  FailedCreatePodSandBox  12s   kubelet            Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "1f81d650b5e9f17d4a01973d52b53417352babd39290e1267770d6b2141f6a8b": plugin type="calico" failed (add): error getting ClusterInformation: Get "https://10.233.0.1:443/apis/crd.projectcalico.org/v1/clusterinformations/default": dial tcp 10.233.0.1:443: i/o timeout

vyom-soft avatar Mar 16 '23 21:03 vyom-soft

hi @vyom-soft are you using proxy in your deployment? if so you should eather set correct exception for your cluster or use offline installation and avoid using proxy. in my case when i was using proxy my cluster would send its entire traffic through it and it would cause numerous problems.

Talangor avatar Mar 17 '23 05:03 Talangor