weave weave-net CNI: Scale to greater than 100 nodes and cluster becomes unuseable

What you expected to happen?

I'm expecting nothing to happen. I'm expecting to scale my cluster to 100+ nodes and things work as they did with like under 100 nodes.

What happened?

Traffic stops routing. Nodes become unstable to communicate with. Can't get traffic into the ingress. pods get network timeout errors, etc....

If we restart the weave DaemonSet, that fixes things.

How to reproduce it?

Install weave-net 2.8.1 (latest). Scale some pods we have to like 250. This causes cluster-autoscaler to scale out our nodes to like 130 some odd EC2 instances.

After the nodes join the cluster becomes unstable and traffic doesn't make it to the ingress controllers anymore. cluster-autoscaler can't communicate with AWS API anymore. coredns is not always reachable, etc....

Anything else we need to know?

AWS. EKS. K8s 1.21. Noticed this first with weave-net 2.7.x. Upgraded to 2.8.1 and still the same issue persists.

I'm passing to weave-kube:

        - name: WEAVE_MTU
          value: "8916"
        - name: CONN_LIMIT
          value: "200"

Neither seem to have an effect with or without.

Logs:

Logs are fairly quiet. I started setting WEAVE_MTU as I was seeing logs about the effect MTU being like 500 bytes or such nonsense. With or without seems to have no effect.

The only other thing is that nodes are unreachable:

INFO: 2022/05/09 17:04:06.583179 ->[10.215.127.117:6783] attempting connection
INFO: 2022/05/09 17:04:06.583850 ->[10.215.127.117:6783] error during connection attempt: dial tcp :0->10.215.127.117:6783: connect: connection refused
INFO: 2022/05/09 17:04:51.359452 ->[10.215.127.62:6783|c2:6f:8b:53:77:38(ip-10-215-127-62.us-west-2.compute.internal)]: connection shutting down due to error: read tcp 10.215.127.110:33613->10.215.127.62:6783: read: connection reset by peer
INFO: 2022/05/09 17:04:51.359528 ->[10.215.127.62:6783|c2:6f:8b:53:77:38(ip-10-215-127-62.us-west-2.compute.internal)]: connection deleted
INFO: 2022/05/09 17:04:51.362158 ->[10.215.127.62:6783] attempting connection
INFO: 2022/05/09 17:04:51.365679 ->[10.215.127.62:6783] connection shutting down due to error during handshake: failed to receive remote protocol header: read tcp 10.215.127.110:34009->10.215.127.62:6783: read: connection reset by peer
DEBU: 2022/05/09 17:04:51.695784 EVENT UpdatePod {"metadata":{"annotations":{"kubectl.kubernetes.io/restartedAt":"2022-05-06T10:10:20-07:00","kubernetes.io/psp":"eks.privileged"},"creationTimestamp":"2022-05-09T16:31:20Z","deletionGracePeriodSeconds":30,"deletionTimestamp":"2022-05-09T17:04:51Z","generateName":"weave-net-","labels":{"......

May 09 '22 17:05 bitva77

We had a similar Issue and was able to fix it with your hint with the CONN_LIMIT. We had around 100 Nodes as well an experienced several outages before and fixed with restarting the weavenet pods. What we noticed was a really high CPU Usage of the Weavenet pods which was gone after we increases the CONN_LIMIT to 200. Also we are not using CPU or Memory Limits which is also the official recommendation. We did not set WEAVE_MTU.

So thanks for your issue maybe you have to increase the limit beyond 200 for this amount, but i am not an expert just want to share our experience.

May 10 '22 18:05 derbauer97

I think it was just coincidence that our network was working again. https://github.com/weaveworks/weave/blob/34de0b10a69c2fa11f2314bfe0e449f739a96cd8/prog/weave-kube/launch.sh#L44 The default of CONN_LIMIT is 200 so i think to take effect it has to be increased beyond 200.

May 10 '22 20:05 derbauer97

oh snap. @derbauer97 thanks for that. Running a test right now at 300. So far so good.

By the way, one of our contributing symptoms is the workload pods we scale up, there's an issue with an OOME's on one of the node. There seems to be some correlation between that and the weave-net pods getting into a funk.

I'm assuming once the node goes OOM that weave-net on that node is down. Which should be fine. But it feels like that then causes some race condition panic among the other pods in the cluster.

May 10 '22 20:05 bitva77

We increased to 400 and added also the MTU so far we have no issues and the weave status connections shows the new MTU with fastdp, we will see if this fixes the issue.

I Remember something similiar if one member of the weavenet "cluster" ist not reachable i certain circumstances the whole network crashes but i did not have time to check what happened there. I think it is time to switch to another CNI which is still supported ^^

May 11 '22 09:05 derbauer97

Yeah, with 300 CONN_LIMIT we still crashed. I guess we can try to move to 400 but I agree with you that it's time to look at another CNI as Weave-Net isn't scaling to our needs.

May 11 '22 20:05 bitva77

Same with 400 i noticed that alle peer Connections are fall back to sleeve when the network crashed. I think we will increase the node size as a workaround and start migrating to another CNI. After terminating all connections are established with fastdp again

May 12 '22 05:05 derbauer97

Yeah, we tried bigger instances as well. We were seeing network saturation on the interfaces when this issue happens and thought bigger instances with better throughput would help...it didn't :)

We're moving to Cilium now.

May 12 '22 06:05 bitva77

You're going the wrong direction with CONN_LIMIT! The number of connections Weave makes is the lesser of CONN_LIMIT and number of nodes, so increasing CONN_LIMIT has no effect once it exceeds the number of nodes. Try setting CONN_LIMIT to a smaller value, like 100, or even 75. Also, read the documentation on Allocating IP Addresses carefully. The consensus process is resource-intensive (both CPU and network) and can go badly when adding a large number of nodes to an existing cluster. Also, if the IPAM database (/var/lib/weave/weave-netdata.db) is not present or the IPAM service is not "ready", it can take a relatively long time (15-20 minutes) for the consensus process to complete, during which CPU and network activity is very high. Even with IPAM "ready" on all nodes, the number of connections weaver tries to establish varies as the square of min(CONN_LIMIT, number of nodes) and just the process of setting up those connections is resource-intensive during startup. Depending on your cluster's network and node CPU capacity, this can drive connections into "sleeve" mode which further exacerbates the problem. Once in this state, the cluster will remain in this state indefinitely. We call this "weave storm".

Reducing the CONN_LIMIT reduces the network and CPU load during startup and can sometimes allow the consensus process to complete, if it hasn't already, and allow the cluster to come up when it otherwise would not. We have a 180-node cluster on which we must deploy weave to in two steps:

Deploy weave daemonset with CONN_LIMIT: 75, and wait until cluster stabilizes. If it's not rebuilding its IPAM database (consensus) this typically happens fairly quickly. Exec into pretty much any weave container on any node and do: ./weave --local status to check IPAM status, and ./weave --local status connections | grep sleeve | wc -l to count the number of sleeve connections.
Update the CONN_LIMIT to 200 as a rolling update with maxUnavailable no greater than 3 so that change is rolled out gradually. This avoids triggering a "weave storm" and having all-to-all connectivity provides some performance advantages.

Aug 22 '22 14:08 etfloyd

FWIW, the above strategy is a temporary mitigation. We will be moving to a different CNI as well, but that takes time.

Aug 22 '22 14:08 etfloyd

weave weave copied to clipboard

weave-net CNI: Scale to greater than 100 nodes and cluster becomes unuseable

What you expected to happen?

What happened?

How to reproduce it?

Anything else we need to know?

Logs:

weave
weave copied to clipboard