kubeadm-ha icon indicating copy to clipboard operation
kubeadm-ha copied to clipboard

81-add-worker后,原来的老节点不能使用TCP协议访问新节点POD

Open wurenny opened this issue 9 months ago • 0 comments

缺陷描述

使用81-add-worker扩容两个节点后:

  • 新节点主机上可以访问老节点POD,新节点POD里面也可以访问老节点POD
  • 老节点主机上不能访问新节点POD,老节点POD里面也不能访问新节点POD
  • 老节点主机上不能通过curl访问新节点的ing controller
  • 新节点主机上可以访问当前节点上的POD
  • 两边ICMP不管主机还是POD都可以互通没问题,只有TCP有问题
  • 两个新节点都有相同的问题
  • 扩容了两个集群,可以完全复现该问题

初步排查结果

  • 新节点flannal安装正确,配置文件正确,子网分配成功,cni netns及veth接口正常
  • vxlan设备和cni bridge正常,各节点的mtu一致
  • ip route、ip neigh、bridge fdb表、arp表未发现异常
  • tcpdump:
    • tcp包从老节点发udp->新节点flannel正常接收udp->vxlan正常解包
    • 新节点vxlan设备上可以看到tcp S包->但没有传递到cni bridge上,通信到这里中止,tcp不能继续握手
  • 查了sysctl和iptables和老节点对比,未发现异常
  • 查了新节点iptables未有任何drop/reject流量记录
  • 查了Google上大多说是iptables没有forward accept,看了各个节点没有这个问题
  • 查了新节点kubelet log未发现什么异常

环境 (请填写以下信息):

执行下面括号中的命令,提交返回结果

  • OS (printf "$(uname -srm)\n$(cat /etc/os-release)\n"):
Linux 3.10.0-1160.15.2.el7.x86_64 x86_64
NAME="CentOS Linux"
VERSION="7 (Core)"
ID="centos"
ID_LIKE="rhel fedora"
VERSION_ID="7"
PRETTY_NAME="CentOS Linux 7 (Core)"
ANSI_COLOR="0;31"
CPE_NAME="cpe:/o:centos:centos:7"
HOME_URL="https://www.centos.org/"
BUG_REPORT_URL="https://bugs.centos.org/"

CENTOS_MANTISBT_PROJECT="CentOS-7"
CENTOS_MANTISBT_PROJECT_VERSION="7"
REDHAT_SUPPORT_PRODUCT="centos"
REDHAT_SUPPORT_PRODUCT_VERSION="7"
  • Ansible版本 (ansible --version):
ansible 2.7.5
  config file = /home/tempuser/install/kubeadm-ha/ansible.cfg
  configured module search path = ['/home/tempuser/.ansible/plugins/modules', '/usr/share/ansible/plugins/modules']
  ansible python module location = /usr/local/lib/python3.6/site-packages/ansible
  executable location = /usr/local/bin/ansible
  python version = 3.6.8 (default, Nov 16 2020, 16:55:22) [GCC 4.8.5 20150623 (Red Hat 4.8.5-44)]
  • Python版本 (python --version):
Python 2.7.5
  • Kubeadm-ha版本(commit) (git rev-parse --short HEAD):
# 比较老的一个集群,当时安装用的kubeadm-ha版本也比较老,现有扩容需求
$ git rev-parse --short HEAD
1fa9622

$ git log -1
commit 1fa962253cb50d55597ac041618ecc17fe6d9fc7
Author: ChongmingDu <[email protected]>
Date:   Sat Jul 31 01:07:16 2021 +0800

    fs.inotify values were added to sysctl
  • 目标kube版本
# kubectl version
Client Version: version.Info{Major:"1", Minor:"15", GitVersion:"v1.15.12", GitCommit:"e2a822d9f3c2fdb5c9bfbe64313cf9f657f0a725", GitTreeState:"clean", BuildDate:"2020-05-06T05:17:59Z", GoVersion:"go1.12.17", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"15", GitVersion:"v1.15.12", GitCommit:"e2a822d9f3c2fdb5c9bfbe64313cf9f657f0a725", GitTreeState:"clean", BuildDate:"2020-05-06T05:09:48Z", GoVersion:"go1.12.17", Compiler:"gc", Platform:"linux/amd64"}
  • 目标docker及containerd版本
# docker version
Client: Docker Engine - Community
 Version:           20.10.7
 API version:       1.41
 Go version:        go1.13.15
 Git commit:        f0df350
 Built:             Wed Jun  2 11:58:10 2021
 OS/Arch:           linux/amd64
 Context:           default
 Experimental:      true

Server: Docker Engine - Community
 Engine:
  Version:          20.10.7
  API version:      1.41 (minimum version 1.12)
  Go version:       go1.13.15
  Git commit:       b0f5bc3
  Built:            Wed Jun  2 11:56:35 2021
  OS/Arch:          linux/amd64
  Experimental:     false
 containerd:
  Version:          1.4.6
  GitCommit:        d71fcd7d8303cbf684402823e425e9dd2e99285d
 runc:
  Version:          1.0.0-rc95
  GitCommit:        b9ee9c6314599f1b4a7f497e1f1f856fe433d3b7
 docker-init:
  Version:          0.19.0
  GitCommit:        de40ad0
  • 目标flannel版本
# kubectl -n kube-system get ds kube-flannel-ds -o jsonpath='{range .spec.template.spec}{.containers[].image}{"\n"}{.initContainers[].image}{"\n"}{end}'
registry.aliyuncs.com/kubeadm-ha/coreos_flannel:v0.12.0
registry.aliyuncs.com/kubeadm-ha/coreos_flannel:v0.12.0

如何复现

复现的步骤:

  1. 在原有的inventory基础上,向[all] [kube-worker] [new-worker]中增加新的两个节点
  2. 执行部署命令,命令如下
ansible-playbook -i inventory-test.ini -e @variables.yaml 81-add-worker.yml
  1. 两个集群扩容后,可以100%复现相同的问题
  2. 出现错误:扩容过程无报错

其他事项

问题有点古怪,没找到vxlan未按即定路由转发tcp至cni bridge的原因

wurenny avatar May 19 '24 04:05 wurenny