calico
calico copied to clipboard
Enabling Calico eBPF makes a mess with kube-apiserver
Hi! Kubeapiserver (kubectl) start to freeze (work slowly) after enabling eBPF dataplane. Any thoughts?
Expected Behavior
kubectl works as expected
Current Behavior
After enabling eBPF mode (operator mode), kubectl start to respond for commands ~5-10 seconds.
Possible Solution
- Restart kube-apiserver container
- Disable calico eBPF mode
Steps to Reproduce (for bugs)
- Install kubernetes with kubeadm
- Install calico with operator mode
- Adding kubernetes-services-endpoint configmap
- Enabling eBPF for Calico
- Wait 1-5 minutes
- kubectl get nodes
Context
After enabling calcio eBPF dataplane, I've tried to create pods and I've noticed that kubectl start to work really slowly (each request take around 5-10sec). I don't know what cause the issue, but if you manually kill the kube-apiserver image (I'm killing it with docker) it starts to work properly.
Your Environment
- Calico version: 3.21.4/3.23.1 (operator mode installation). Encapsulation: None (nodes in the same network)
- Orchestrator version Kubernetes (1.224/1.22.9/1.22.10) (docker - 20.10.11/16)
- Operating System and version: Ubuntu 20.04/22.04 LTS
Logs from kubectl
command: kubectl get nodes
I0601 11:13:41.057935 319065 round_trippers.go:435] curl -v -XGET -H "User-Agent: kubectl/v1.22.9 (linux/amd64) kubernetes/6df4433" -H "Accept: application/json, */*" 'https://10.101.3.50:6443/apis/projectcalico.org/v3?timeout=32s'
I0601 11:13:41.057937 319065 cached_discovery.go:71] returning cached discovery info from /home/ubuntu/.kube/cache/discovery/10.101.3.50_6443/storage.k8s.io/v1beta1/serverresources.json
I0601 11:13:41.057957 319065 cached_discovery.go:71] returning cached discovery info from /home/ubuntu/.kube/cache/discovery/10.101.3.50_6443/policy/v1beta1/serverresources.json
I0601 11:13:41.057972 319065 cached_discovery.go:71] returning cached discovery info from /home/ubuntu/.kube/cache/discovery/10.101.3.50_6443/node.k8s.io/v1/serverresources.json
I0601 11:13:41.057977 319065 cached_discovery.go:71] returning cached discovery info from /home/ubuntu/.kube/cache/discovery/10.101.3.50_6443/scheduling.k8s.io/v1/serverresources.json
I0601 11:13:41.057978 319065 cached_discovery.go:71] returning cached discovery info from /home/ubuntu/.kube/cache/discovery/10.101.3.50_6443/v1/serverresources.json
I0601 11:13:46.059599 319065 round_trippers.go:454] GET https://10.101.3.50:6443/apis/projectcalico.org/v3?timeout=32s 503 Service Unavailable in 5001 milliseconds
I0601 11:13:46.059618 319065 round_trippers.go:460] Response Headers:
I0601 11:13:46.059626 319065 round_trippers.go:463] Content-Length: 57
I0601 11:13:46.059631 319065 round_trippers.go:463] Date: Wed, 01 Jun 2022 11:13:46 GMT
I0601 11:13:46.059636 319065 round_trippers.go:463] Audit-Id: d7a9b592-3ddb-4430-a0c4-0b24cfffc8a1
I0601 11:13:46.059641 319065 round_trippers.go:463] Cache-Control: no-cache, private
I0601 11:13:46.059646 319065 round_trippers.go:463] Content-Type: text/plain; charset=utf-8
I0601 11:13:46.059650 319065 round_trippers.go:463] X-Content-Type-Options: nosniff
I0601 11:13:46.059655 319065 round_trippers.go:463] X-Kubernetes-Pf-Flowschema-Uid: 34c8a48b-8612-464e-a5d7-55d361044238
I0601 11:13:46.059660 319065 round_trippers.go:463] X-Kubernetes-Pf-Prioritylevel-Uid: d61ab1f5-fcf1-4006-b7a0-1527b6e13d93
I0601 11:13:46.060338 319065 request.go:1181] Response Body: error trying to reach service: context deadline exceeded
I0601 11:13:46.060960 319065 request.go:1372] body was not decodable (unable to check for Status): Object 'Kind' is missing in 'error trying to reach service: context deadline exceeded
'
I0601 11:13:46.060976 319065 cached_discovery.go:78] skipped caching discovery info due to the server is currently unable to handle the request
I0601 11:13:46.062637 319065 round_trippers.go:435] curl -v -XGET -H "User-Agent: kubectl/v1.22.9 (linux/amd64) kubernetes/6df4433" -H "Accept: application/json;as=Table;v=v1;g=meta.k8s.io,application/json;as=Table;v=v1beta1;g=meta.k8s.io,application/json" 'https://10.101.3.50:6443/api/v1/nodes?limit=500'
I0601 11:13:46.064113 319065 round_trippers.go:454] GET https://10.101.3.50:6443/api/v1/nodes?limit=500 200 OK in 1 millisecondsI0601 11:13:41.057935 319065 round_trippers.go:435] curl -v -XGET -H "User-Agent: kubectl/v1.22.9 (linux/amd64) kubernetes/6df4433" -H "Accept: application/json, */*" 'https://10.101.3.50:6443/apis/projectcalico.org/v3?timeout=32s'
I0601 11:13:41.057937 319065 cached_discovery.go:71] returning cached discovery info from /home/ubuntu/.kube/cache/discovery/10.101.3.50_6443/storage.k8s.io/v1beta1/serverresources.json
I0601 11:13:41.057957 319065 cached_discovery.go:71] returning cached discovery info from /home/ubuntu/.kube/cache/discovery/10.101.3.50_6443/policy/v1beta1/serverresources.json
I0601 11:13:41.057972 319065 cached_discovery.go:71] returning cached discovery info from /home/ubuntu/.kube/cache/discovery/10.101.3.50_6443/node.k8s.io/v1/serverresources.json
I0601 11:13:41.057977 319065 cached_discovery.go:71] returning cached discovery info from /home/ubuntu/.kube/cache/discovery/10.101.3.50_6443/scheduling.k8s.io/v1/serverresources.json
I0601 11:13:41.057978 319065 cached_discovery.go:71] returning cached discovery info from /home/ubuntu/.kube/cache/discovery/10.101.3.50_6443/v1/serverresources.json
I0601 11:13:46.059599 319065 round_trippers.go:454] GET https://10.101.3.50:6443/apis/projectcalico.org/v3?timeout=32s 503 Service Unavailable in 5001 milliseconds
I0601 11:13:46.059618 319065 round_trippers.go:460] Response Headers:
I0601 11:13:46.059626 319065 round_trippers.go:463] Content-Length: 57
I0601 11:13:46.059631 319065 round_trippers.go:463] Date: Wed, 01 Jun 2022 11:13:46 GMT
I0601 11:13:46.059636 319065 round_trippers.go:463] Audit-Id: d7a9b592-3ddb-4430-a0c4-0b24cfffc8a1
I0601 11:13:46.059641 319065 round_trippers.go:463] Cache-Control: no-cache, private
I0601 11:13:46.059646 319065 round_trippers.go:463] Content-Type: text/plain; charset=utf-8
I0601 11:13:46.059650 319065 round_trippers.go:463] X-Content-Type-Options: nosniff
I0601 11:13:46.059655 319065 round_trippers.go:463] X-Kubernetes-Pf-Flowschema-Uid: 34c8a48b-8612-464e-a5d7-55d361044238
I0601 11:13:46.059660 319065 round_trippers.go:463] X-Kubernetes-Pf-Prioritylevel-Uid: d61ab1f5-fcf1-4006-b7a0-1527b6e13d93
I0601 11:13:46.060338 319065 request.go:1181] Response Body: error trying to reach service: context deadline exceeded
I0601 11:13:46.060960 319065 request.go:1372] body was not decodable (unable to check for Status): Object 'Kind' is missing in 'error trying to reach service: context deadline exceeded
'
I0601 11:13:46.060976 319065 cached_discovery.go:78] skipped caching discovery info due to the server is currently unable to handle the request
I0601 11:13:46.062637 319065 round_trippers.go:435] curl -v -XGET -H "User-Agent: kubectl/v1.22.9 (linux/amd64) kubernetes/6df4433" -H "Accept: application/json;as=Table;v=v1;g=meta.k8s.io,application/json;as=Table;v=v1beta1;g=meta.k8s.io,application/json" 'https://10.101.3.50:6443/api/v1/nodes?limit=500'
I0601 11:13:46.064113 319065 round_trippers.go:454] GET https://10.101.3.50:6443/api/v1/nodes?limit=500 200 OK in 1 milliseconds
Okay. So, from logs of kubectl I can see that for some reasons:
- Kubectl trying to make a request https://10.101.3.50:6443/apis/projectcalico.org/v3?timeout=32s
- After 5 sec we are reaching deadline (Service Unavailable in 5001 milliseconds)
- Kubectl trying to make a request https://10.101.3.50:6443/apis/projectcalico.org/v3?timeout=32s again
- Kubectl trying to reach GET https://10.101.3.50:6443/api/v1/nodes?limit=500 200 OK in 1 milliseconds and get results from api-server
It explains why I can see the delay for 10 seconds before the result.
But why kubectl trying to reach /projectcalico.org/v3?timeout=32s and it's failed - this is the mystery
Hi! Do you have any update?
@tomastigera @fasaxc might be able to provide some insight.
I don't believe that we have seen this behavior in our own testing
You say the API server is in a container? Can you explain exactly how you set up the API server with kubeadm? Did you create a container and then run kubeadm inside, or is creating the container something kubeadm did?
I'm wondering if you're trying to simiulate a k8s cluster in docker or something like that (in which case there are gotchas with BPF mode; it can't be fully containerised.)
Hi! Yes, API server is in a container. I didn't setup server manually. It was part of kubeadm init task.
By 'Restart kube-apiserver container' I meant this command:
docker ps | grep "k8s_kube-apiserver" | docker kill $(awk '{print $1}')
(pinging @tomastigera @fasaxc)
Kubernetes were installed with pod network 10.233.64.0/18.
Maybe it will be useful. This is my custom-resources file:
apiVersion: operator.tigera.io/v1
kind: Installation
metadata:
name: default
spec:
# Configures Calico networking.
calicoNetwork:
nodeAddressAutodetectionV4:
cidrs:
- 10.101.3.0/24
# Note: The ipPools section cannot be modified post-install.
ipPools:
- blockSize: 26
cidr: 10.233.64.0/18
encapsulation: None
natOutgoing: Disabled
nodeSelector: all()
# This section configures the Calico API server.
# For more information, see: https://projectcalico.docs.tigera.io/v3.23/reference/installation/api#operator.tigera.io/v1.APIServer
apiVersion: operator.tigera.io/v1
kind: APIServer
metadata:
name: default
spec: {}
UPD: I was able to reproduce the same issue for 2 clusters ( 2 nodes setup - 1 cp and 1 worker node)
could pls decribe the exact steps for setting up such a cluster. That would help us to repro it and diagnose it. Thanks
Tested with 2 node cluster.
Each host has 2 NICs (1 ssh interface, 1 node interface 10.101.3.0/24)
- Create a kubeadm cluster with config kubeadm init --config ./config-init.yaml
apiVersion: kubeadm.k8s.io/v1beta3
bootstrapTokens:
- groups:
- system:bootstrappers:kubeadm:default-node-token
token: abcdef.0123456789abcdef
ttl: 24h0m0s
usages:
- signing
- authentication
kind: InitConfiguration
localAPIEndpoint:
advertiseAddress: 10.101.3.50
bindPort: 6443
nodeRegistration:
taints: null
---
apiServer:
timeoutForControlPlane: 4m0s
apiVersion: kubeadm.k8s.io/v1beta3
kind: ClusterConfiguration
kubernetesVersion: 1.22.9
networking:
dnsDomain: cluster.local
serviceSubnet: 10.233.0.0/18
podSubnet: 10.233.64.0/18
scheduler: {}
---
apiVersion: kubelet.config.k8s.io/v1beta1
kind: KubeletConfiguration
cpuManagerPolicy: static
reservedSystemCPUs: 0-1
-
Join the second node from the output
-
Create Calico with tigera-operator. kubectl create -f https://projectcalico.docs.tigera.io/manifests/tigera-operator.yaml kubectl create -f custom-resources:
apiVersion: operator.tigera.io/v1
kind: Installation
metadata:
name: default
spec:
# Configures Calico networking.
calicoNetwork:
nodeAddressAutodetectionV4:
cidrs:
- 10.101.3.0/24
# Note: The ipPools section cannot be modified post-install.
ipPools:
- blockSize: 26
cidr: 10.233.64.0/18
encapsulation: None
natOutgoing: Disabled
nodeSelector: all()
# This section configures the Calico API server.
# For more information, see: https://projectcalico.docs.tigera.io/v3.23/reference/installation/api#operator.tigera.io/v1.APIServer
apiVersion: operator.tigera.io/v1
kind: APIServer
metadata:
name: default
spec: {}
kubectl create -f tigera-operator-configMap.yaml
kind: ConfigMap
apiVersion: v1
metadata:
name: kubernetes-services-endpoint
namespace: tigera-operator
data:
KUBERNETES_SERVICE_HOST: "10.101.3.50"
KUBERNETES_SERVICE_PORT: "6443"`
- enabled eBPF mode
kubectl patch ds -n kube-system kube-proxy -p '{"spec":{"template":{"spec":{"nodeSelector":{"non-calico": "true"}}}}}'
kubectl patch installation.operator.tigera.io default --type merge -p '{"spec":{"calicoNetwork":{"linuxDataplane":"BPF", "hostPorts":null}}}'
-
Running some network workload for 5-15 minutes
-
kubectl get pods/nodes (output here: https://github.com/projectcalico/calico/issues/6168#issuecomment-1143500246)
I can try to reproduce the issue and send more logs from kubernetes/calico/etcd
@PanKaker Once the kubernetes-services-endpoint configmap is created, does it trigger a reload of the calico-node? Also the calico-node should be reloaded automatically once BPF mode is enabled. Can you please confirm this?
@PanKaker I am not able to reproduce this issue. Closing it. If you are able to reproduce this issue, please reopen.