kubeadm icon indicating copy to clipboard operation
kubeadm copied to clipboard

abnormal removal of etcd member during reset when has multiple network interface

Open yxxhero opened this issue 1 year ago • 8 comments

What keywords did you search in kubeadm issues before filing this one?

In the case of multiple network cards, abnormal removal of etcd member during reset.

Is this a BUG REPORT or FEATURE REQUEST?

BUG REPORT

Versions

kubeadm version (use kubeadm version):

Environment:

  • Kubernetes version (use kubectl version):
  • Cloud provider or hardware configuration:
  • OS (e.g. from /etc/os-release):
  • Kernel (e.g. uname -a):
  • Container runtime (CRI) (e.g. containerd, cri-o):
  • Container networking plugin (CNI) (e.g. Calico, Cilium):
  • Others:

What happened?

In the case of multiple network cards, abnormal removal of etcd member during reset.

What you expected to happen?

How to reproduce it (as minimally and precisely as possible)?

Anything else we need to know?

yxxhero avatar Feb 02 '24 13:02 yxxhero

In the case of multiple network cards, abnormal removal of etcd member during reset.

can you explain what happens exactly? generally we don't want to add APIEndpoint or other init/join control plane options to ResetConfiguration.

neolit123 avatar Feb 02 '24 13:02 neolit123

also please see this section of the docs: https://kubernetes.io/docs/setup/production-environment/tools/kubeadm/create-cluster-kubeadm/#network-setup

k8s doesn't support well multiple network interfaces.

neolit123 avatar Feb 02 '24 13:02 neolit123

@neolit123 thanks. let me explain. https://github.com/yxxhero/kubernetes/blob/cd5e65709e4874af87305714c7866ab59c8f9c3a/cmd/kubeadm/app/phases/etcd/local.go#L116-L127

see the above code

	etcdPeerAddress := etcdutil.GetPeerURL(&cfg.LocalAPIEndpoint)

	klog.V(2).Infof("[etcd] get the member id from peer: %s", etcdPeerAddress)
	id, err := etcdClient.GetMemberID(etcdPeerAddress)
	if err != nil {
		if errors.Is(etcdutil.ErrNoMemberIDForPeerURL, err) {
			klog.V(5).Infof("[etcd] member was already removed, because no member id exists for peer %s", etcdPeerAddress)
			return nil
		}
		return err
	}

When we add a node using the kubeadm join --control-plane --apiserver-advertise-address <some_ip> command, if we reset the node with kubeadm reset, it will not be able to remove the etcd member because it uses the default IP address, which is different from the one specified in --apiserver-advertise-address. As a result, manual removal of the etcd member is required.

yxxhero avatar Feb 02 '24 15:02 yxxhero

Or maybe we can get the right ip from the etcd.yaml? @neolit123

yxxhero avatar Feb 02 '24 15:02 yxxhero

When we add a node using the kubeadm join --control-plane --apiserver-advertise-address <some_ip> command, if we reset the node with kubeadm reset, it will not be able to remove the etcd member because it uses the default IP address, which is different from the one specified in --apiserver-advertise-address. As a result, manual removal of the etcd member is required.

but kubeadm annotates the etcd and apiserver pods: https://kubernetes.io/docs/reference/labels-annotations-taints/#kubeadm-kubernetes-io-kube-apiserver-advertise-address-endpoint

etcdPeerAddress := etcdutil.GetPeerURL(&cfg.LocalAPIEndpoint)

IIRC, this cfg here is constructed with data from the node. it's an initconfiguration, but during reset it get's data from various places and the localAPIEndpoint should be the one from the kube-apiserver pod.

so, instead of adding the field in resetconfiguration we need to investigate what's not working in the existing code, IMO.

neolit123 avatar Feb 02 '24 15:02 neolit123

@neolit123 yeah. I will do as you advice.

yxxhero avatar Feb 02 '24 15:02 yxxhero

@neolit123 yeah. I will do as you advice.

there might be a bug in our code, let's take our time to understand the problem better. because this etcd management is a sensitive area of kubeadm.

EDIT: also please show example IPs, what is expected, what is in the annotations and what the code gives you. perhaps you want to add some fmt.Printf... in the logic.

neolit123 avatar Feb 02 '24 15:02 neolit123

@neolit123 please review my new idea.

yxxhero avatar Feb 03 '24 05:02 yxxhero

can we close this issue and https://github.com/kubernetes/kubernetes/pull/123110 ?

neolit123 avatar Apr 22 '24 12:04 neolit123

sure

yxxhero avatar Apr 22 '24 13:04 yxxhero