weave
weave copied to clipboard
Weave can not complete TCP connection on CentOS8
What you expected to happen?
I hope weave network worked on the centos 8.
What happened?
I add a kubernetes minion node of CentOS8 to a cluster and remove firewalld and nftable to avoid their effect. I found the POD on the node can ping other POD on other nodes (ICMP worked good). However, the TCP connection can not complete between the two PODs.
How to reproduce it?
The kubernetes cluster CRI is containerd 1.4.3 Case 1
- Deploy a DaemonSet of netcat on the kubernetes cluster.
- Use the POD deployed on the CentOS 8 Node as a nc client.
- Use the POD deployed on the other Node as a nc server.
- Capture the package on the bridge weave of the two nodes. k8s01-02-weave-netcat-case1.zip
Case 2
- Deploy a DaemonSet of netcat on the kubernetes cluster.
- Use the POD deployed on the CentOS 8 Node as a nc server.
- Use the POD deployed on the other Node as a nc client.
- Capture the package on the bridge weave of the two nodes. k8s01-02-weave-netcat-case2.zip
Anything else we need to know?
The kubernetes cluster environment info:
Node | OS | Kubernetes Version | CRI Version |
---|---|---|---|
Master | Ubuntu 18.04 4.15.0-134-generic x86_64 | v1.20.1 | containerd v1.4.3 |
Minion 01 | Centos 8.3.2011 4.18.0-240.1.1.el8_3.x86_64 | v1.20.1 | containerd v1.4.3 |
Minion 02 | Centos 7.9.2009 5.9.1-1.el7.elrepo.x86_64 | v1.20.1 | containerd v1.4.3 |
Versions:
$ weave version
weave script 2.8.0
$ docker version
Client: Docker Engine - Community
Version: 20.10.2
API version: 1.41
Go version: go1.13.15
Git commit: 2291f61
Built: Mon Dec 28 16:17:40 2020
OS/Arch: linux/amd64
Context: default
Experimental: true
Server: Docker Engine - Community
Engine:
Version: 20.10.2
API version: 1.41 (minimum version 1.12)
Go version: go1.13.15
Git commit: 8891c58
Built: Mon Dec 28 16:15:09 2020
OS/Arch: linux/amd64
Experimental: false
containerd:
Version: 1.4.3
GitCommit: 269548fa27e0089a8b8278fc4fc781d7f65a939b
runc:
Version: 1.0.0-rc92
GitCommit: ff819c7e9184c13b7c2607fe6c30ae19403a7aff
docker-init:
Version: 0.19.0
GitCommit: de40ad0
$ uname -a
$ kubectl version
Client Version: version.Info{Major:"1", Minor:"20", GitVersion:"v1.20.1", GitCommit:"c4d752765b3bbac2237bf87cf0b1c2e307844666", GitTreeState:"clean", BuildDate:"2020-12-18T12:09:25Z", GoVersion:"go1.15.5", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"20", GitVersion:"v1.20.1", GitCommit:"c4d752765b3bbac2237bf87cf0b1c2e307844666", GitTreeState:"clean", BuildDate:"2020-12-18T12:00:47Z", GoVersion:"go1.15.5", Compiler:"gc", Platform:"linux/amd64"}
Logs:
$ docker logs weave
or, if using Kubernetes:
$ kubectl logs -n kube-system <weave-net-pod> weave
see attachment above.
Network:
$ ip route
default via 192.168.88.1 dev ens192 proto static metric 100
10.32.0.0/12 dev weave proto kernel scope link src 10.32.0.3
172.16.39.0/24 via 192.168.88.2 dev ens192 proto static metric 100
172.17.0.0/16 dev docker0 proto kernel scope link src 172.17.0.1 linkdown
192.168.80.0/20 dev ens192 proto kernel scope link src 192.168.88.242 metric 100
192.168.122.0/24 dev virbr0 proto kernel scope link src 192.168.122.1 linkdown
$ ip -4 -o addr
1: lo inet 127.0.0.1/8 scope host lo\ valid_lft forever preferred_lft forever
2: ens192 inet 192.168.88.242/20 brd 192.168.95.255 scope global noprefixroute ens192\ valid_lft forever preferred_lft forever
3: virbr0 inet 192.168.122.1/24 brd 192.168.122.255 scope global virbr0\ valid_lft forever preferred_lft forever
5: docker0 inet 172.17.0.1/16 brd 172.17.255.255 scope global docker0\ valid_lft forever preferred_lft forever
8: weave inet 10.32.0.3/12 brd 10.47.255.255 scope global weave\ valid_lft forever preferred_lft forever
$ sudo iptables-save
see attachment above.
It seems that the connection has lost ACK when sending to or reciving from the node. Not all ACK lost, just some of them. these ACK retransmisson also lost.
for more invention, I doubt the issue maybe caused by centos 8. I install Calico, the problem also happend. All I can think of the possibility of the issue is the sysctl.con:
vm.max_map_count=655350
vm.swappiness=0
vm.min_free_kbytes=65535
fs.file-max=655360
net.core.somaxconn=65500
net.core.netdev_max_backlog=262144
net.ipv4.tcp_max_orphans=262144
net.ipv4.tcp_timestamps=0
net.ipv4.tcp_max_syn_backlog=262144
net.ipv4.tcp_syncookies=0
net.ipv4.tcp_tw_reuse=0
net.ipv4.ip_forward=1
net.ipv4.conf.default.rp_filter=0
net.ipv4.conf.all.rp_filter=0
net.bridge.bridge-nf-call-ip6tables=1
net.bridge.bridge-nf-call-iptables=1
If it can't reproduce on other CentOS 8, it should be closed.
Thanks for the detailed report. It's hard to think what could cause dropping of some ack packets and not others. Maybe some interaction with NIC checksum offload? (I haven't looked at the packet dumps)
I have same problem with fastdp on Centos 8, sleeve mode works. #FIX
ethtool -i ens192
driver: vmxnet3
version: 1.5.0.0-k-NAPI
firmware-version:
expansion-rom-version:
bus-info: 0000:0b:00.0
supports-statistics: yes
supports-test: no
supports-eeprom-access: no
supports-register-dump: yes
supports-priv-flags: no
These commands fix the problem:
ethtool -K ens192 tx-udp_tnl-csum-segmentation off
ethtool -K ens192 tx-udp_tnl-segmentation off
nmcli connection modify ens192 ethtool.feature-tx-udp_tnl-segmentation off
nmcli connection modify ens192 ethtool.feature-tx-udp_tnl-csum-segmentation off
I have same problem with fastdp on Centos 8, sleeve mode works. #FIX
ethtool -i ens192 driver: vmxnet3 version: 1.5.0.0-k-NAPI firmware-version: expansion-rom-version: bus-info: 0000:0b:00.0 supports-statistics: yes supports-test: no supports-eeprom-access: no supports-register-dump: yes supports-priv-flags: no
These commands fix the problem:
ethtool -K ens192 tx-udp_tnl-csum-segmentation off ethtool -K ens192 tx-udp_tnl-segmentation off nmcli connection modify ens192 ethtool.feature-tx-udp_tnl-segmentation off nmcli connection modify ens192 ethtool.feature-tx-udp_tnl-csum-segmentation off
thanks, @borg-z! this solution help me to solve the same issue