kubeadm-ha
kubeadm-ha copied to clipboard
81-add-worker后,原来的老节点不能使用TCP协议访问新节点POD
缺陷描述
使用81-add-worker扩容两个节点后:
- 新节点主机上可以访问老节点POD,新节点POD里面也可以访问老节点POD
- 老节点主机上不能访问新节点POD,老节点POD里面也不能访问新节点POD
- 老节点主机上不能通过curl访问新节点的ing controller
- 新节点主机上可以访问当前节点上的POD
- 两边ICMP不管主机还是POD都可以互通没问题,只有TCP有问题
- 两个新节点都有相同的问题
- 扩容了两个集群,可以完全复现该问题
初步排查结果
- 新节点flannal安装正确,配置文件正确,子网分配成功,cni netns及veth接口正常
- vxlan设备和cni bridge正常,各节点的mtu一致
- ip route、ip neigh、bridge fdb表、arp表未发现异常
- tcpdump:
- tcp包从老节点发udp->新节点flannel正常接收udp->vxlan正常解包
- 新节点vxlan设备上可以看到tcp S包->但没有传递到cni bridge上,通信到这里中止,tcp不能继续握手
- 查了sysctl和iptables和老节点对比,未发现异常
- 查了新节点iptables未有任何drop/reject流量记录
- 查了Google上大多说是iptables没有forward accept,看了各个节点没有这个问题
- 查了新节点kubelet log未发现什么异常
环境 (请填写以下信息):
执行下面括号中的命令,提交返回结果
-
OS (
printf "$(uname -srm)\n$(cat /etc/os-release)\n"
):
Linux 3.10.0-1160.15.2.el7.x86_64 x86_64
NAME="CentOS Linux"
VERSION="7 (Core)"
ID="centos"
ID_LIKE="rhel fedora"
VERSION_ID="7"
PRETTY_NAME="CentOS Linux 7 (Core)"
ANSI_COLOR="0;31"
CPE_NAME="cpe:/o:centos:centos:7"
HOME_URL="https://www.centos.org/"
BUG_REPORT_URL="https://bugs.centos.org/"
CENTOS_MANTISBT_PROJECT="CentOS-7"
CENTOS_MANTISBT_PROJECT_VERSION="7"
REDHAT_SUPPORT_PRODUCT="centos"
REDHAT_SUPPORT_PRODUCT_VERSION="7"
-
Ansible版本 (
ansible --version
):
ansible 2.7.5
config file = /home/tempuser/install/kubeadm-ha/ansible.cfg
configured module search path = ['/home/tempuser/.ansible/plugins/modules', '/usr/share/ansible/plugins/modules']
ansible python module location = /usr/local/lib/python3.6/site-packages/ansible
executable location = /usr/local/bin/ansible
python version = 3.6.8 (default, Nov 16 2020, 16:55:22) [GCC 4.8.5 20150623 (Red Hat 4.8.5-44)]
-
Python版本 (
python --version
):
Python 2.7.5
-
Kubeadm-ha版本(commit) (
git rev-parse --short HEAD
):
# 比较老的一个集群,当时安装用的kubeadm-ha版本也比较老,现有扩容需求
$ git rev-parse --short HEAD
1fa9622
$ git log -1
commit 1fa962253cb50d55597ac041618ecc17fe6d9fc7
Author: ChongmingDu <[email protected]>
Date: Sat Jul 31 01:07:16 2021 +0800
fs.inotify values were added to sysctl
- 目标kube版本
# kubectl version
Client Version: version.Info{Major:"1", Minor:"15", GitVersion:"v1.15.12", GitCommit:"e2a822d9f3c2fdb5c9bfbe64313cf9f657f0a725", GitTreeState:"clean", BuildDate:"2020-05-06T05:17:59Z", GoVersion:"go1.12.17", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"15", GitVersion:"v1.15.12", GitCommit:"e2a822d9f3c2fdb5c9bfbe64313cf9f657f0a725", GitTreeState:"clean", BuildDate:"2020-05-06T05:09:48Z", GoVersion:"go1.12.17", Compiler:"gc", Platform:"linux/amd64"}
- 目标docker及containerd版本
# docker version
Client: Docker Engine - Community
Version: 20.10.7
API version: 1.41
Go version: go1.13.15
Git commit: f0df350
Built: Wed Jun 2 11:58:10 2021
OS/Arch: linux/amd64
Context: default
Experimental: true
Server: Docker Engine - Community
Engine:
Version: 20.10.7
API version: 1.41 (minimum version 1.12)
Go version: go1.13.15
Git commit: b0f5bc3
Built: Wed Jun 2 11:56:35 2021
OS/Arch: linux/amd64
Experimental: false
containerd:
Version: 1.4.6
GitCommit: d71fcd7d8303cbf684402823e425e9dd2e99285d
runc:
Version: 1.0.0-rc95
GitCommit: b9ee9c6314599f1b4a7f497e1f1f856fe433d3b7
docker-init:
Version: 0.19.0
GitCommit: de40ad0
- 目标flannel版本
# kubectl -n kube-system get ds kube-flannel-ds -o jsonpath='{range .spec.template.spec}{.containers[].image}{"\n"}{.initContainers[].image}{"\n"}{end}'
registry.aliyuncs.com/kubeadm-ha/coreos_flannel:v0.12.0
registry.aliyuncs.com/kubeadm-ha/coreos_flannel:v0.12.0
如何复现
复现的步骤:
- 在原有的inventory基础上,向[all] [kube-worker] [new-worker]中增加新的两个节点
- 执行部署命令,命令如下
ansible-playbook -i inventory-test.ini -e @variables.yaml 81-add-worker.yml
- 两个集群扩容后,可以100%复现相同的问题
- 出现错误:扩容过程无报错
其他事项
问题有点古怪,没找到vxlan未按即定路由转发tcp至cni bridge的原因