calico icon indicating copy to clipboard operation
calico copied to clipboard

Enabling Calico eBPF makes a mess with kube-apiserver

Open PanKaker opened this issue 3 years ago • 13 comments

Hi! Kubeapiserver (kubectl) start to freeze (work slowly) after enabling eBPF dataplane. Any thoughts?

Expected Behavior

kubectl works as expected

Current Behavior

After enabling eBPF mode (operator mode), kubectl start to respond for commands ~5-10 seconds.

Possible Solution

  • Restart kube-apiserver container
  • Disable calico eBPF mode

Steps to Reproduce (for bugs)

  1. Install kubernetes with kubeadm
  2. Install calico with operator mode
  3. Adding kubernetes-services-endpoint configmap
  4. Enabling eBPF for Calico
  5. Wait 1-5 minutes
  6. kubectl get nodes

Context

After enabling calcio eBPF dataplane, I've tried to create pods and I've noticed that kubectl start to work really slowly (each request take around 5-10sec). I don't know what cause the issue, but if you manually kill the kube-apiserver image (I'm killing it with docker) it starts to work properly.

Your Environment

  • Calico version: 3.21.4/3.23.1 (operator mode installation). Encapsulation: None (nodes in the same network)
  • Orchestrator version Kubernetes (1.224/1.22.9/1.22.10) (docker - 20.10.11/16)
  • Operating System and version: Ubuntu 20.04/22.04 LTS

Logs from kubectl

command: kubectl get nodes

I0601 11:13:41.057935  319065 round_trippers.go:435] curl -v -XGET  -H "User-Agent: kubectl/v1.22.9 (linux/amd64) kubernetes/6df4433" -H "Accept: application/json, */*" 'https://10.101.3.50:6443/apis/projectcalico.org/v3?timeout=32s'
I0601 11:13:41.057937  319065 cached_discovery.go:71] returning cached discovery info from /home/ubuntu/.kube/cache/discovery/10.101.3.50_6443/storage.k8s.io/v1beta1/serverresources.json
I0601 11:13:41.057957  319065 cached_discovery.go:71] returning cached discovery info from /home/ubuntu/.kube/cache/discovery/10.101.3.50_6443/policy/v1beta1/serverresources.json
I0601 11:13:41.057972  319065 cached_discovery.go:71] returning cached discovery info from /home/ubuntu/.kube/cache/discovery/10.101.3.50_6443/node.k8s.io/v1/serverresources.json
I0601 11:13:41.057977  319065 cached_discovery.go:71] returning cached discovery info from /home/ubuntu/.kube/cache/discovery/10.101.3.50_6443/scheduling.k8s.io/v1/serverresources.json
I0601 11:13:41.057978  319065 cached_discovery.go:71] returning cached discovery info from /home/ubuntu/.kube/cache/discovery/10.101.3.50_6443/v1/serverresources.json
I0601 11:13:46.059599  319065 round_trippers.go:454] GET https://10.101.3.50:6443/apis/projectcalico.org/v3?timeout=32s 503 Service Unavailable in 5001 milliseconds
I0601 11:13:46.059618  319065 round_trippers.go:460] Response Headers:
I0601 11:13:46.059626  319065 round_trippers.go:463]     Content-Length: 57
I0601 11:13:46.059631  319065 round_trippers.go:463]     Date: Wed, 01 Jun 2022 11:13:46 GMT
I0601 11:13:46.059636  319065 round_trippers.go:463]     Audit-Id: d7a9b592-3ddb-4430-a0c4-0b24cfffc8a1
I0601 11:13:46.059641  319065 round_trippers.go:463]     Cache-Control: no-cache, private
I0601 11:13:46.059646  319065 round_trippers.go:463]     Content-Type: text/plain; charset=utf-8
I0601 11:13:46.059650  319065 round_trippers.go:463]     X-Content-Type-Options: nosniff
I0601 11:13:46.059655  319065 round_trippers.go:463]     X-Kubernetes-Pf-Flowschema-Uid: 34c8a48b-8612-464e-a5d7-55d361044238
I0601 11:13:46.059660  319065 round_trippers.go:463]     X-Kubernetes-Pf-Prioritylevel-Uid: d61ab1f5-fcf1-4006-b7a0-1527b6e13d93
I0601 11:13:46.060338  319065 request.go:1181] Response Body: error trying to reach service: context deadline exceeded
I0601 11:13:46.060960  319065 request.go:1372] body was not decodable (unable to check for Status): Object 'Kind' is missing in 'error trying to reach service: context deadline exceeded
'
I0601 11:13:46.060976  319065 cached_discovery.go:78] skipped caching discovery info due to the server is currently unable to handle the request
I0601 11:13:46.062637  319065 round_trippers.go:435] curl -v -XGET  -H "User-Agent: kubectl/v1.22.9 (linux/amd64) kubernetes/6df4433" -H "Accept: application/json;as=Table;v=v1;g=meta.k8s.io,application/json;as=Table;v=v1beta1;g=meta.k8s.io,application/json" 'https://10.101.3.50:6443/api/v1/nodes?limit=500'
I0601 11:13:46.064113  319065 round_trippers.go:454] GET https://10.101.3.50:6443/api/v1/nodes?limit=500 200 OK in 1 millisecondsI0601 11:13:41.057935  319065 round_trippers.go:435] curl -v -XGET  -H "User-Agent: kubectl/v1.22.9 (linux/amd64) kubernetes/6df4433" -H "Accept: application/json, */*" 'https://10.101.3.50:6443/apis/projectcalico.org/v3?timeout=32s'
I0601 11:13:41.057937  319065 cached_discovery.go:71] returning cached discovery info from /home/ubuntu/.kube/cache/discovery/10.101.3.50_6443/storage.k8s.io/v1beta1/serverresources.json
I0601 11:13:41.057957  319065 cached_discovery.go:71] returning cached discovery info from /home/ubuntu/.kube/cache/discovery/10.101.3.50_6443/policy/v1beta1/serverresources.json
I0601 11:13:41.057972  319065 cached_discovery.go:71] returning cached discovery info from /home/ubuntu/.kube/cache/discovery/10.101.3.50_6443/node.k8s.io/v1/serverresources.json
I0601 11:13:41.057977  319065 cached_discovery.go:71] returning cached discovery info from /home/ubuntu/.kube/cache/discovery/10.101.3.50_6443/scheduling.k8s.io/v1/serverresources.json
I0601 11:13:41.057978  319065 cached_discovery.go:71] returning cached discovery info from /home/ubuntu/.kube/cache/discovery/10.101.3.50_6443/v1/serverresources.json
I0601 11:13:46.059599  319065 round_trippers.go:454] GET https://10.101.3.50:6443/apis/projectcalico.org/v3?timeout=32s 503 Service Unavailable in 5001 milliseconds
I0601 11:13:46.059618  319065 round_trippers.go:460] Response Headers:
I0601 11:13:46.059626  319065 round_trippers.go:463]     Content-Length: 57
I0601 11:13:46.059631  319065 round_trippers.go:463]     Date: Wed, 01 Jun 2022 11:13:46 GMT
I0601 11:13:46.059636  319065 round_trippers.go:463]     Audit-Id: d7a9b592-3ddb-4430-a0c4-0b24cfffc8a1
I0601 11:13:46.059641  319065 round_trippers.go:463]     Cache-Control: no-cache, private
I0601 11:13:46.059646  319065 round_trippers.go:463]     Content-Type: text/plain; charset=utf-8
I0601 11:13:46.059650  319065 round_trippers.go:463]     X-Content-Type-Options: nosniff
I0601 11:13:46.059655  319065 round_trippers.go:463]     X-Kubernetes-Pf-Flowschema-Uid: 34c8a48b-8612-464e-a5d7-55d361044238
I0601 11:13:46.059660  319065 round_trippers.go:463]     X-Kubernetes-Pf-Prioritylevel-Uid: d61ab1f5-fcf1-4006-b7a0-1527b6e13d93
I0601 11:13:46.060338  319065 request.go:1181] Response Body: error trying to reach service: context deadline exceeded
I0601 11:13:46.060960  319065 request.go:1372] body was not decodable (unable to check for Status): Object 'Kind' is missing in 'error trying to reach service: context deadline exceeded
'
I0601 11:13:46.060976  319065 cached_discovery.go:78] skipped caching discovery info due to the server is currently unable to handle the request
I0601 11:13:46.062637  319065 round_trippers.go:435] curl -v -XGET  -H "User-Agent: kubectl/v1.22.9 (linux/amd64) kubernetes/6df4433" -H "Accept: application/json;as=Table;v=v1;g=meta.k8s.io,application/json;as=Table;v=v1beta1;g=meta.k8s.io,application/json" 'https://10.101.3.50:6443/api/v1/nodes?limit=500'
I0601 11:13:46.064113  319065 round_trippers.go:454] GET https://10.101.3.50:6443/api/v1/nodes?limit=500 200 OK in 1 milliseconds

PanKaker avatar Jun 01 '22 09:06 PanKaker

Okay. So, from logs of kubectl I can see that for some reasons:

  1. Kubectl trying to make a request https://10.101.3.50:6443/apis/projectcalico.org/v3?timeout=32s
  2. After 5 sec we are reaching deadline (Service Unavailable in 5001 milliseconds)
  3. Kubectl trying to make a request https://10.101.3.50:6443/apis/projectcalico.org/v3?timeout=32s again
  4. Kubectl trying to reach GET https://10.101.3.50:6443/api/v1/nodes?limit=500 200 OK in 1 milliseconds and get results from api-server

It explains why I can see the delay for 10 seconds before the result.

But why kubectl trying to reach /projectcalico.org/v3?timeout=32s and it's failed - this is the mystery

PanKaker avatar Jun 01 '22 11:06 PanKaker

Hi! Do you have any update?

PanKaker avatar Jun 13 '22 09:06 PanKaker

@tomastigera @fasaxc might be able to provide some insight.

I don't believe that we have seen this behavior in our own testing

caseydavenport avatar Jun 13 '22 23:06 caseydavenport

You say the API server is in a container? Can you explain exactly how you set up the API server with kubeadm? Did you create a container and then run kubeadm inside, or is creating the container something kubeadm did?

I'm wondering if you're trying to simiulate a k8s cluster in docker or something like that (in which case there are gotchas with BPF mode; it can't be fully containerised.)

fasaxc avatar Jun 17 '22 09:06 fasaxc

Hi! Yes, API server is in a container. I didn't setup server manually. It was part of kubeadm init task.

PanKaker avatar Jun 20 '22 07:06 PanKaker

By 'Restart kube-apiserver container' I meant this command:

docker ps | grep "k8s_kube-apiserver" | docker kill $(awk '{print $1}')

PanKaker avatar Jun 20 '22 07:06 PanKaker

(pinging @tomastigera @fasaxc)

lmm avatar Jun 28 '22 16:06 lmm

Kubernetes were installed with pod network 10.233.64.0/18.

Maybe it will be useful. This is my custom-resources file:

apiVersion: operator.tigera.io/v1
kind: Installation
metadata:
  name: default
spec:
  # Configures Calico networking.
  calicoNetwork:
    nodeAddressAutodetectionV4:
      cidrs:
        - 10.101.3.0/24
    # Note: The ipPools section cannot be modified post-install.
    ipPools:
    - blockSize: 26
      cidr: 10.233.64.0/18
      encapsulation: None
      natOutgoing: Disabled
      nodeSelector: all()

# This section configures the Calico API server.
# For more information, see: https://projectcalico.docs.tigera.io/v3.23/reference/installation/api#operator.tigera.io/v1.APIServer
apiVersion: operator.tigera.io/v1
kind: APIServer
metadata:
  name: default
spec: {}

PanKaker avatar Jun 29 '22 08:06 PanKaker

UPD: I was able to reproduce the same issue for 2 clusters ( 2 nodes setup - 1 cp and 1 worker node)

PanKaker avatar Jun 30 '22 08:06 PanKaker

could pls decribe the exact steps for setting up such a cluster. That would help us to repro it and diagnose it. Thanks

tomastigera avatar Jul 26 '22 16:07 tomastigera

Tested with 2 node cluster.

Each host has 2 NICs (1 ssh interface, 1 node interface 10.101.3.0/24)

  1. Create a kubeadm cluster with config kubeadm init --config ./config-init.yaml
apiVersion: kubeadm.k8s.io/v1beta3
bootstrapTokens:
- groups:
  - system:bootstrappers:kubeadm:default-node-token
  token: abcdef.0123456789abcdef
  ttl: 24h0m0s
  usages:
  - signing
  - authentication
kind: InitConfiguration
localAPIEndpoint:
  advertiseAddress: 10.101.3.50
  bindPort: 6443
nodeRegistration:
  taints: null
---
apiServer:
  timeoutForControlPlane: 4m0s
apiVersion: kubeadm.k8s.io/v1beta3
kind: ClusterConfiguration
kubernetesVersion: 1.22.9
networking:
  dnsDomain: cluster.local
  serviceSubnet: 10.233.0.0/18
  podSubnet: 10.233.64.0/18
scheduler: {}
---
apiVersion: kubelet.config.k8s.io/v1beta1
kind: KubeletConfiguration
cpuManagerPolicy: static
reservedSystemCPUs: 0-1
  1. Join the second node from the output

  2. Create Calico with tigera-operator. kubectl create -f https://projectcalico.docs.tigera.io/manifests/tigera-operator.yaml kubectl create -f custom-resources:

apiVersion: operator.tigera.io/v1
kind: Installation
metadata:
  name: default
spec:
  # Configures Calico networking.
  calicoNetwork:
    nodeAddressAutodetectionV4:
      cidrs:
        - 10.101.3.0/24
    # Note: The ipPools section cannot be modified post-install.
    ipPools:
    - blockSize: 26
      cidr: 10.233.64.0/18
      encapsulation: None
      natOutgoing: Disabled
      nodeSelector: all()


# This section configures the Calico API server.
# For more information, see: https://projectcalico.docs.tigera.io/v3.23/reference/installation/api#operator.tigera.io/v1.APIServer
apiVersion: operator.tigera.io/v1
kind: APIServer
metadata:
  name: default
spec: {}

kubectl create -f tigera-operator-configMap.yaml

kind: ConfigMap
apiVersion: v1
metadata:
  name: kubernetes-services-endpoint
  namespace: tigera-operator
data:
  KUBERNETES_SERVICE_HOST: "10.101.3.50"
  KUBERNETES_SERVICE_PORT: "6443"`
  1. enabled eBPF mode
kubectl patch ds -n kube-system kube-proxy -p '{"spec":{"template":{"spec":{"nodeSelector":{"non-calico": "true"}}}}}'
kubectl patch installation.operator.tigera.io default --type merge -p '{"spec":{"calicoNetwork":{"linuxDataplane":"BPF", "hostPorts":null}}}'
  1. Running some network workload for 5-15 minutes

  2. kubectl get pods/nodes (output here: https://github.com/projectcalico/calico/issues/6168#issuecomment-1143500246)

PanKaker avatar Jul 27 '22 07:07 PanKaker

I can try to reproduce the issue and send more logs from kubernetes/calico/etcd

PanKaker avatar Jul 27 '22 12:07 PanKaker

@PanKaker Once the kubernetes-services-endpoint configmap is created, does it trigger a reload of the calico-node? Also the calico-node should be reloaded automatically once BPF mode is enabled. Can you please confirm this?

sridhartigera avatar Aug 16 '22 00:08 sridhartigera

@PanKaker I am not able to reproduce this issue. Closing it. If you are able to reproduce this issue, please reopen.

sridhartigera avatar Oct 05 '22 23:10 sridhartigera