kube-router Service proxy causing high CPU usage with ~2.7K services

Hey,

We’re running a Kubernetes cluster with ~150 nodes and ~2.7K services. Each service typically matches one pod. Service endpoints are updated quite often (e.g., due to pod restarts).

Each time a service endpoint is updated, kube-router seems to perform a full resync of services. Due to the number of services, the resync takes ~16 seconds on a m5a.large EC2 instance:

While doing the resync, kube-router aggressively consumes the CPU:

which affects other pods running on that instance.

Is there any way to reduce the CPU usage or make service resyncs happen faster? The ultimate goal is to make sure kube-router doesn’t affect other pods from that node.

(CPU limits are not a solution, unfortunately. We don’t want to apply CPU limits to kube-router because that’d make IPVS resyncs longer, and with long resyncs, we’d start experiencing noticeable traffic issues. E.g. if a service endpoint changes, and it takes kube-router 1 minute to perform an IPVS resync, the traffic to that service will be blackholed or rejected for the full minute.)

Aug 04 '20 09:08 iamakulov

Just for the sake of reference, our kube-router configuration is pretty standard:

apiVersion: extensions/v1beta1
kind: DaemonSet
metadata:
  labels:
    k8s-app: kube-router
    tier: node
  name: kube-router
  namespace: kube-system
spec:
  revisionHistoryLimit: 10
  selector:
    matchLabels:
      k8s-app: kube-router
      tier: node
  template:
    metadata:
      creationTimestamp: null
      labels:
        k8s-app: kube-router
        tier: node
    spec:
      containers:
      - args:
        - --run-router=true
        - --run-firewall=true
        - --run-service-proxy=true
        - --kubeconfig=/var/lib/kube-router/kubeconfig
        - --bgp-graceful-restart
        - -v=1
        - --metrics-port=12013
        env:
        - name: NODE_NAME
          valueFrom:
            fieldRef:
              apiVersion: v1
              fieldPath: spec.nodeName
        - name: KUBE_ROUTER_CNI_CONF_FILE
          value: /etc/cni/net.d/10-kuberouter.conflist
        image: docker.io/cloudnativelabs/kube-router:v0.4.0
        imagePullPolicy: IfNotPresent
        livenessProbe:
          failureThreshold: 3
          httpGet:
            path: /healthz
            port: 20244
            scheme: HTTP
          initialDelaySeconds: 10
          periodSeconds: 3
          successThreshold: 1
          timeoutSeconds: 1
        name: kube-router
        resources:
          requests:
            cpu: 100m
            memory: 250Mi
        securityContext:
          privileged: true
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: File
        volumeMounts:
        - mountPath: /lib/modules
          name: lib-modules
          readOnly: true
        - mountPath: /etc/cni/net.d
          name: cni-conf-dir
        - mountPath: /var/lib/kube-router/kubeconfig
          name: kubeconfig
          readOnly: true
      dnsPolicy: ClusterFirst
      hostNetwork: true
      initContainers:
      - command:
        - /bin/sh
        - -c
        - set -e -x; if [ ! -f /etc/cni/net.d/10-kuberouter.conflist ]; then if [
          -f /etc/cni/net.d/*.conf ]; then rm -f /etc/cni/net.d/*.conf; fi; TMP=/etc/cni/net.d/.tmp-kuberouter-cfg;
          cp /etc/kube-router/cni-conf.json ${TMP}; mv ${TMP} /etc/cni/net.d/10-kuberouter.conflist;
          fi
        image: busybox
        imagePullPolicy: Always
        name: install-cni
        resources: {}
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: File
        volumeMounts:
        - mountPath: /etc/cni/net.d
          name: cni-conf-dir
        - mountPath: /etc/kube-router
          name: kube-router-cfg
      priorityClassName: system-node-critical
      restartPolicy: Always
      schedulerName: default-scheduler
      securityContext: {}
      serviceAccount: kube-router
      serviceAccountName: kube-router
      terminationGracePeriodSeconds: 30
      tolerations:
      - key: CriticalAddonsOnly
        operator: Exists
      - effect: NoSchedule
        operator: Exists
      volumes:
      - hostPath:
          path: /lib/modules
          type: ""
        name: lib-modules
      - hostPath:
          path: /etc/cni/net.d
          type: ""
        name: cni-conf-dir
      - configMap:
          defaultMode: 420
          name: kube-router-cfg
        name: kube-router-cfg
      - hostPath:
          path: /var/lib/kube-router/kubeconfig
          type: ""
        name: kubeconfig
  templateGeneration: 25
  updateStrategy:
    rollingUpdate:
      maxUnavailable: 1
    type: RollingUpdate

Aug 04 '20 16:08 iamakulov

I see you're running kube-router 0.4.0, have you tried 1.0.0 or 1.0.1?

Aug 04 '20 17:08 mrueg

I can’t test kube-router v1 in production, but, testing locally, v1.0.1 appears to have the same behavior.

Here’s how to reproduce it locally, btw.

Steps to reproduce (with minikube)

Start a cluster (if not running already)

minikube start --kubernetes-version=v1.14.10

Annotate the node with

kubectl annotate node minikube kube-router.io/pod-cidr=10.0.0.1/24

Boot a pod with

kubectl run my-shell --image ubuntu -- bash -c 'sleep 999999999'

Deploy ~2K services for that pod:

kubectl apply -f https://gist.githubusercontent.com/iamakulov/695fdf0241452c77b2b58f2ecfd0ab38/raw/248ec660e361bc784d6bfe3f6472189125d530be/services-random.yml

(Gist)

Delete the kube-proxy DaemonSet:

kubectl delete ds kube-proxy -n kube-system

Deploy kube-router:

kubectl apply -f https://gist.githubusercontent.com/iamakulov/695fdf0241452c77b2b58f2ecfd0ab38/raw/d7d1c78e672503c3bec6bb730a3bdc2ccb4a6706/kube-router.yml
# Note: the kube-router yaml above is modified to work with minikube
# It also has a 800m CPU limit applied to make it easier to repro the issue on fast devices

(Gist)

Wait for kube-router to start, stream its logs, and edit any service from the list of deployed services:
```
kubectl edit svc my-9973192
# Change targetPort from 3000 to 3001
```

Observed behavior

The picture is similar for both kube-router v0.4.0 and v1.0.1.

IPVS resync takes ~6 seconds:

The CPU usage (tracked by docker stats) stays high during the whole duration of the IPVS update:

Aug 04 '20 19:08 iamakulov

While I’m trying to figure out how to debug/profile kube-router (not a Go pro :), here’re some high-level timestamps from inside syncIpvsServices():

The slowest functions appear to be setupClusterIPServices (2.5s out of 6.2s) and syncIpvsFirewall (3.5s out of 6.2s).

setupClusterIPServices is slow, perhaps, simply due to algorithmic complexity? Its complexity is O(N^2), and N (in this test case) is ~2000.
Not sure what the bottleneck in syncIpvsFirewall is yet

Aug 04 '20 20:08 iamakulov

Not sure if this helps, but here’s a 15-second CPU trace I captured through pprof which includes the IPVS sync.

trace.zip

Aug 04 '20 21:08 iamakulov

Looking further into this.

Here’s the scheduler latency profile generated from the above trace. Most of the time is spent in external exec() calls:

Initially, I assumed I might be misunderstanding the profile (I don’t have any Go debugging experience). However, this actually seems correct.

I’ve tried adding a bunch of log points into syncIpvsFirewall(), and the bottlenecks there are these two Refresh calls:

https://github.com/cloudnativelabs/kube-router/blob/e35dc9d61ef9507902c13f0e7d06b9ea5dc942ce/pkg/controllers/proxy/network_services_controller.go#L704-L714

Each Refresh() call invokes the ipset binary multiple times – once for each Kubernetes service. A single invocation is inexpensive, but 2K invocations add up to ~1.5s (out of ~6s total) for me. And syncIpvsFirewall() calls Refresh() twice (so 1.5s × 2 = 3s).

TBH not sure how to proceed from here. A good solution would be, perhaps, to avoid full resyncs on each change – and, instead, carefully patch existing networking rules. But I don’t know what potential drawbacks this might have, nor am I experienced with Go/networking enough to do such change.

Aug 05 '20 17:08 iamakulov

Wait, I might’ve found an easy win!!

ipset supports two ways of loading rules into it: a) you can call ipset add multiple times, or b) you can call ipset restore with a list of rules.

The list of rules looks exactly like the list of exec() commands, just without the binary name. E.g.:

create kube-router-ipvs-services- hash:ip,port family inet hashsize 1024 maxelem 65536 timeout 0
add kube-router-ipvs-services- 10.100.7.98,tcp:80 timeout 0
add kube-router-ipvs-services- 10.100.23.110,tcp:80 timeout 0
add kube-router-ipvs-services- 10.107.15.194,tcp:80 timeout 0
add kube-router-ipvs-services- 10.96.235.209,tcp:80 timeout 0
add kube-router-ipvs-services- 10.111.75.75,tcp:80 timeout 0
add kube-router-ipvs-services- 10.102.127.111,tcp:80 timeout 0
add kube-router-ipvs-services- 10.105.77.80,tcp:80 timeout 0
add kube-router-ipvs-services- 10.98.135.77,tcp:80 timeout 0
add kube-router-ipvs-services- 10.97.235.184,tcp:80 timeout 0

I measured both approaches, and with the same list of rules (but different sets, of course), ipset restore runs two orders of magnitude faster:

Aug 05 '20 17:08 iamakulov

That seems like a reasonable approach to me. This is very similar to the approach we intend to take with iptables/nftables for the NPC in 1.2. As a matter of fact it looks like some of the functionality already exists in pkg/utils/ipset.go there are already a Restore() and buildIPSetRestore(), but it doesn't really look like they have been used so far and it probably doesn't do quite as much as you need for your use case. But it gives building blocks to go off.

Do you feel comfortable submitting a PR for this work? If so, we could probably put it in our 1.2 release which is going to focus on performance. We're currently working on fixing bugs in 1.0 and addressing legacy go and go library versions for 1.1.

Aug 05 '20 19:08 aauren

Ha, I was just finishing the PR for this. Here you go: https://github.com/cloudnativelabs/kube-router/pull/964

Aug 05 '20 20:08 iamakulov

Now, a question about ipAddrAdd() (which is the second – and the last – remaining bottleneck).

In https://github.com/cloudnativelabs/kube-router/commit/725bff6b87c489bd7759bb0c9668a9b112f4c252, @murali-reddy mentioned that he’s calling ip route replace instead of netlink.RouteReplace because the latter succeeds but doesn’t actually replace the route. The commit was two years go.

Locally, if I’m replacing ip route replace with netlink.RouteReplace:

- out, err := exec.Command("ip", "route", "replace", "local", ip, "dev", KubeDummyIf, "table", "local", "proto", "kernel", "scope", "host", "src",
- 	NodeIP.String(), "table", "local").CombinedOutput()
+ err = netlink.RouteReplace(&netlink.Route{
+ 	Dst: &net.IPNet{IP: net.ParseIP(ip), Mask: net.IPv4Mask(255, 255, 255, 255)},
+ 	LinkIndex: iface.Attrs().Index,
+ 	Table: 254,
+ 	Protocol: 0,
+ 	Scope: netlink.SCOPE_HOST,
+ 	Src: NodeIP,
+ })

the second bottleneck gets resolved. With this change, for me, setupClusterIPServices now takes ~0.3s (down from 2.5s), and the whole ipvs sync-up goes down to ~0.6s (from ~6.5s).

Questions:

How could I verify whether the issue quoted by @murali-reddy still holds true? (Don’t know networking enough to test this on my own.) It’s been two years, and there’re other places in the code which use RouteReplace, so perhaps it’s safe to do this change now?
What’s the correct way to write the RouteReplace call above? I’ve never worked with ip routing before.

I mapped some arguments from ip route replace to netlink.Route, but I’m not sure that’s 100% correct. I also had to copy some magic numbers (like the table number and the protocol number) from the result of netlink.RouteGet() – I’m not sure how to properly map table local to Table: 254 and proto kernel to Protocol: 0.

Aug 05 '20 23:08 iamakulov

Okay, I think I have answers to both questions. Here’s the second PR: https://github.com/cloudnativelabs/kube-router/pull/965

Aug 06 '20 09:08 iamakulov

Most of this was addressed via #964

There were a few outstanding chances for additional improvement that weren't quite implemented in #965

However, since that PR has been closed and the original author has moved on to other projects, I'm closing this issue as "mostly fixed"

Oct 31 '22 04:10 aauren