calico icon indicating copy to clipboard operation
calico copied to clipboard

High load on the API servers on a large Kubernetes cluster using Calico CNI

Open gavol opened this issue 2 years ago • 8 comments

We have very specific and large workloads, where the ability to bring them up and down as fast as possible is very important. We are currently performing POD start-up throughput measurements and we are observing that Calico is a kind of limitation in this area.

We start N PODs per host in a single deployment, just running the pause container pre-loaded on all the hosts (no affinities or any other rule to locate the PODs on the worker nodes). We actually run the ClusterLoader2 "density" benchmark.

We perform the tests with and without the HostNetwork flag.

We scaled the cluster up to 2133 hosts so far, and the control plane is made up of 4 masters with an external ETCD instance. We do use Typha.

In order to give you an idea of the Calico impact, I report our measurements in the following table:

PODs per node Time to start - Calico (s) Time to start - HostNetwork (s)
1 45 15
3 128 36
5 227 58

The only difference between the experiments is using Calico or HostNetwork, All the other conditions are the same.

Additionally, we really see a huge difference in the API server(s) load when the HostNetwork is not used, with the blockaffinities API call (that, I think, is a Calico call) showing up as one of the most expensive call (without Typha that call was really killing the API servers).

In the case of 1 POD per node, the difference in time is still important, at the same time the load on the API servers is modest.

Let me also add that in order to achieve those results, we increase the QPS values of both the scheduler and the controller manager.

Your Environment

  • Calico version: v3.22.0
  • Orchestrator version (e.g. kubernetes, mesos, rkt): Kubernetes v1.21.2 (installed with kubeadm)
  • Operating System and version: Linux, CC7

gavol avatar Feb 25 '22 15:02 gavol

@gavol would you be able to provide a more detailed analysis of the API calls you're seeing? Would be really helpful to see which requests are consuming the most time in terms of which API endpoints and also which actions (get / list / watch / create / update / delete) are most expensive.

caseydavenport avatar Mar 02 '22 23:03 caseydavenport

@caseydavenport How can I collect that information? I usually look at the API server Grafana dashboard to see what is happening there.

Would the cluster_quantile:apiserver_request_duration_seconds:histogram_quantile{} metrics work to get the information you need?

gavol avatar Mar 03 '22 11:03 gavol

That is what I get executing that query when the API servers are heavy loaded and calico is used (showing only the top calls):

immagine

gavol avatar Mar 03 '22 11:03 gavol

Maybe something like that:

sort_desc(sum(rate(apiserver_request_duration_seconds_sum{group="crd.projectcalico.org"}[5m]) / rate(apiserver_request_duration_seconds_count{group="crd.projectcalico.org"}[5m])) by(resource,verb,group))

is more useful.

immagine

I am sorry, but I am not very familiar with Prometheus queries.

gavol avatar Mar 03 '22 14:03 gavol

I think that is more correct:

sort_desc((sum(rate(apiserver_request_duration_seconds_sum{group="crd.projectcalico.org"}[5m])) by(group,resource,verb) / sum(rate(apiserver_request_duration_seconds_count{group="crd.projectcalico.org"}[5m])) by(group,resource,verb)))

immagine

gavol avatar Mar 03 '22 14:03 gavol

We might be seeing this issue. Is there a way to change the interval at which calico syncs from the API server or anything? We're seeing logs like this on our canal pods. Maybe I have the wrong component here.

2022-09-16 23:46:49.736 [INFO][74] felix/summary.go 100: Summarising 11 dataplane reconciliation loops over 1m3.3s: avg=236ms longest=2.223s (resync-nat-v4)
2022-09-16 23:46:49.770 [INFO][74] felix/watchercache.go 174: Full resync is required ListRoot="/calico/resources/v3/projectcalico.org/clusterinformations"
2022-09-16 23:46:49.988 [INFO][74] felix/watchercache.go 186: Failed to perform list of current data during resync ListRoot="/calico/resources/v3/projectcalico.org/felixconfigurations" error=the server has received too many requests and has asked us to try again later (get FelixConfigurations.crd.projectcalico.org)
2022-09-16 23:46:50.049 [INFO][74] felix/watchercache.go 186: Failed to perform list of current data during resync ListRoot="/calico/resources/v3/projectcalico.org/hostendpoints" error=the server has received too many requests and has asked us to try again later (get HostEndpoints.crd.projectcalico.org)
2022-09-16 23:46:50.988 [INFO][74] felix/watchercache.go 174: Full resync is required ListRoot="/calico/resources/v3/projectcalico.org/felixconfigurations"
2022-09-16 23:46:51.049 [INFO][74] felix/watchercache.go 174: Full resync is required ListRoot="/calico/resources/v3/projectcalico.org/hostendpoints"
2022-09-16 23:46:52.796 [INFO][74] felix/kubenetworkpolicy.go 101: Unable to list K8s Network Policy resources error=the server has received too many requests and has asked us to try again later (get networkpolicies.networking.k8s.io)
2022-09-16 23:46:52.796 [INFO][74] felix/watchercache.go 186: Failed to perform list of current data during resync ListRoot="/calico/resources/v3/projectcalico.org/kubernetesnetworkpolicies" error=the server has received too many requests and has asked us to try again later (get networkpolicies.networking.k8s.io)
2022-09-16 23:46:53.151 [INFO][74] felix/watchercache.go 186: Failed to perform list of current data during resync ListRoot="/calico/resources/v3/projectcalico.org/nodes" error=the server has received too many requests and has asked us to try again later (get nodes)
2022-09-16 23:46:53.797 [INFO][74] felix/watchercache.go 174: Full resync is required ListRoot="/calico/resources/v3/projectcalico.org/kubernetesnetworkpolicies"
2022-09-16 23:46:54.151 [INFO][74] felix/watchercache.go 174: Full resync is required ListRoot="/calico/resources/v3/projectcalico.org/nodes"
2022-09-16 23:46:54.384 [INFO][74] felix/watchercache.go 186: Failed to perform list of current data during resync ListRoot="/calico/resources/v3/projectcalico.org/networkpolicies" error=the server has received too many requests and has asked us to try again later (get NetworkPolicies.crd.projectcalico.org)
2022-09-16 23:46:54.511 [INFO][74] felix/watchercache.go 186: Failed to perform list of current data during resync ListRoot="/calico/resources/v3/projectcalico.org/profiles" error=the server has received too many requests and has asked us to try again later (get namespaces)

Right now our control plane has a high number of rejected API requests due to queue full, and we think calico is part of that.

godber avatar Sep 17 '22 00:09 godber

@godber this sounds like it might be a slightly different issue. First questions would be how large is your cluster and are you using Calico's Typha component? Typha is meant to significantly reduce Calico's read load on the Kubernetes API server. From the logs you posted I would guess that you are not.

caseydavenport avatar Sep 20 '22 17:09 caseydavenport

We are not using Typha, thanks for pointing it out to us.

godber avatar Sep 20 '22 21:09 godber