calico High load on the API servers on a large Kubernetes cluster using Calico CNI

We have very specific and large workloads, where the ability to bring them up and down as fast as possible is very important. We are currently performing POD start-up throughput measurements and we are observing that Calico is a kind of limitation in this area.

We start N PODs per host in a single deployment, just running the pause container pre-loaded on all the hosts (no affinities or any other rule to locate the PODs on the worker nodes). We actually run the ClusterLoader2 "density" benchmark.

We perform the tests with and without the HostNetwork flag.

We scaled the cluster up to 2133 hosts so far, and the control plane is made up of 4 masters with an external ETCD instance. We do use Typha.

In order to give you an idea of the Calico impact, I report our measurements in the following table:

PODs per node	Time to start - Calico (s)	Time to start - HostNetwork (s)
1	45	15
3	128	36
5	227	58

The only difference between the experiments is using Calico or HostNetwork, All the other conditions are the same.

Additionally, we really see a huge difference in the API server(s) load when the HostNetwork is not used, with the blockaffinities API call (that, I think, is a Calico call) showing up as one of the most expensive call (without Typha that call was really killing the API servers).

In the case of 1 POD per node, the difference in time is still important, at the same time the load on the API servers is modest.

Let me also add that in order to achieve those results, we increase the QPS values of both the scheduler and the controller manager.

Your Environment

Calico version: v3.22.0
Orchestrator version (e.g. kubernetes, mesos, rkt): Kubernetes v1.21.2 (installed with kubeadm)
Operating System and version: Linux, CC7

Feb 25 '22 15:02 gavol

@gavol would you be able to provide a more detailed analysis of the API calls you're seeing? Would be really helpful to see which requests are consuming the most time in terms of which API endpoints and also which actions (get / list / watch / create / update / delete) are most expensive.

Mar 02 '22 23:03 caseydavenport

@caseydavenport How can I collect that information? I usually look at the API server Grafana dashboard to see what is happening there.

Would the cluster_quantile:apiserver_request_duration_seconds:histogram_quantile{} metrics work to get the information you need?

Mar 03 '22 11:03 gavol

That is what I get executing that query when the API servers are heavy loaded and calico is used (showing only the top calls):

immagine

Mar 03 '22 11:03 gavol

Maybe something like that:

sort_desc(sum(rate(apiserver_request_duration_seconds_sum{group="crd.projectcalico.org"}[5m]) / rate(apiserver_request_duration_seconds_count{group="crd.projectcalico.org"}[5m])) by(resource,verb,group))

is more useful.

immagine

I am sorry, but I am not very familiar with Prometheus queries.

Mar 03 '22 14:03 gavol

I think that is more correct:

sort_desc((sum(rate(apiserver_request_duration_seconds_sum{group="crd.projectcalico.org"}[5m])) by(group,resource,verb) / sum(rate(apiserver_request_duration_seconds_count{group="crd.projectcalico.org"}[5m])) by(group,resource,verb)))

immagine

Mar 03 '22 14:03 gavol

We might be seeing this issue. Is there a way to change the interval at which calico syncs from the API server or anything? We're seeing logs like this on our canal pods. Maybe I have the wrong component here.

2022-09-16 23:46:49.736 [INFO][74] felix/summary.go 100: Summarising 11 dataplane reconciliation loops over 1m3.3s: avg=236ms longest=2.223s (resync-nat-v4)
2022-09-16 23:46:49.770 [INFO][74] felix/watchercache.go 174: Full resync is required ListRoot="/calico/resources/v3/projectcalico.org/clusterinformations"
2022-09-16 23:46:49.988 [INFO][74] felix/watchercache.go 186: Failed to perform list of current data during resync ListRoot="/calico/resources/v3/projectcalico.org/felixconfigurations" error=the server has received too many requests and has asked us to try again later (get FelixConfigurations.crd.projectcalico.org)
2022-09-16 23:46:50.049 [INFO][74] felix/watchercache.go 186: Failed to perform list of current data during resync ListRoot="/calico/resources/v3/projectcalico.org/hostendpoints" error=the server has received too many requests and has asked us to try again later (get HostEndpoints.crd.projectcalico.org)
2022-09-16 23:46:50.988 [INFO][74] felix/watchercache.go 174: Full resync is required ListRoot="/calico/resources/v3/projectcalico.org/felixconfigurations"
2022-09-16 23:46:51.049 [INFO][74] felix/watchercache.go 174: Full resync is required ListRoot="/calico/resources/v3/projectcalico.org/hostendpoints"
2022-09-16 23:46:52.796 [INFO][74] felix/kubenetworkpolicy.go 101: Unable to list K8s Network Policy resources error=the server has received too many requests and has asked us to try again later (get networkpolicies.networking.k8s.io)
2022-09-16 23:46:52.796 [INFO][74] felix/watchercache.go 186: Failed to perform list of current data during resync ListRoot="/calico/resources/v3/projectcalico.org/kubernetesnetworkpolicies" error=the server has received too many requests and has asked us to try again later (get networkpolicies.networking.k8s.io)
2022-09-16 23:46:53.151 [INFO][74] felix/watchercache.go 186: Failed to perform list of current data during resync ListRoot="/calico/resources/v3/projectcalico.org/nodes" error=the server has received too many requests and has asked us to try again later (get nodes)
2022-09-16 23:46:53.797 [INFO][74] felix/watchercache.go 174: Full resync is required ListRoot="/calico/resources/v3/projectcalico.org/kubernetesnetworkpolicies"
2022-09-16 23:46:54.151 [INFO][74] felix/watchercache.go 174: Full resync is required ListRoot="/calico/resources/v3/projectcalico.org/nodes"
2022-09-16 23:46:54.384 [INFO][74] felix/watchercache.go 186: Failed to perform list of current data during resync ListRoot="/calico/resources/v3/projectcalico.org/networkpolicies" error=the server has received too many requests and has asked us to try again later (get NetworkPolicies.crd.projectcalico.org)
2022-09-16 23:46:54.511 [INFO][74] felix/watchercache.go 186: Failed to perform list of current data during resync ListRoot="/calico/resources/v3/projectcalico.org/profiles" error=the server has received too many requests and has asked us to try again later (get namespaces)

Right now our control plane has a high number of rejected API requests due to queue full, and we think calico is part of that.

Sep 17 '22 00:09 godber

@godber this sounds like it might be a slightly different issue. First questions would be how large is your cluster and are you using Calico's Typha component? Typha is meant to significantly reduce Calico's read load on the Kubernetes API server. From the logs you posted I would guess that you are not.

Sep 20 '22 17:09 caseydavenport

We are not using Typha, thanks for pointing it out to us.

Sep 20 '22 21:09 godber

calico calico copied to clipboard

High load on the API servers on a large Kubernetes cluster using Calico CNI

Your Environment

calico
calico copied to clipboard