k3s icon indicating copy to clipboard operation
k3s copied to clipboard

Node deleted from etcd cluster rejoins if it was the leader when using k3s service

Open rancher-max opened this issue 3 years ago • 13 comments

Environmental Info: K3s Version:

Upgrade to 1.21 release branch commit 6acee2e2f5d5f3cf9da416baf321cefa7de8991c Likely happens in an upgrade to master as well, but that is untested

Node(s) CPU architecture, OS, and Version:

ubuntu 20.04 lts Linux ip-172-31-14-108 5.4.0-1009-aws #9-Ubuntu SMP Sun Apr 12 19:46:01 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux

Cluster Configuration:

3 servers, 1 agent, etcd backend

Describe the bug:

After performing a manual upgrade to this commit, then deleting a node (either after applying the new annotation, or deleting the node using kubectl delete node name), there is a leader election error with a panic and the k3s service on the node automatically restarts, causing it to join back in the cluster.

Steps To Reproduce:

  • Install k3s v1.21.4+k3s1
  • Upgrade using curl -sfL https://get.k3s.io | INSTALL_K3S_COMMIT=6acee2e2f5d5f3cf9da416baf321cefa7de8991c sh -
  • Delete a server node

Expected behavior:

The node should delete and not try to rejoin the cluster.

Actual behavior:

The service restarts on the node because of a panic that occurs.

Additional context / logs:

This only happens after an upgrade and not in a fresh install of the same commit. k3s915upgrade-panic.log

Backporting

  • Potentially needs forward porting to master

rancher-max avatar Sep 15 '21 20:09 rancher-max

I guess what happens would depend on whether or not the deleted node is the leader when it’s deleted. If it’s the leader it would loose the election and panic. If it’s not the leader it’ll loop forever trying to get the leader lock.

Here's what I see if I delete a non-leader node - it loops forever on this message:

0915 20:53:14.588650       1 status.go:71] apiserver received an error that is not an metav1.Status: rpctypes.EtcdError{code:0xe, desc:"etcdserver: server stopped"}: etcdserver: server stopped
E0915 20:53:14.589235       1 leaderelection.go:325] error retrieving resource lock kube-system/cloud-controller-manager: etcdserver: server stopped
E0915 20:53:15.043508       1 status.go:71] apiserver received an error that is not an metav1.Status: rpctypes.EtcdError{code:0xe, desc:"etcdserver: server stopped"}: etcdserver: server stopped
E0915 20:53:15.044074       1 leaderelection.go:325] error retrieving resource lock kube-system/k3s: etcdserver: server stopped
E0915 20:53:15.368929       1 status.go:71] apiserver received an error that is not an metav1.Status: rpctypes.EtcdError{code:0xe, desc:"etcdserver: server stopped"}: etcdserver: server stopped
E0915 20:53:15.369510       1 leaderelection.go:325] error retrieving resource lock kube-system/kube-controller-manager: etcdserver: server stopped
E0915 20:53:16.537072       1 status.go:71] apiserver received an error that is not an metav1.Status: rpctypes.EtcdError{code:0xe, desc:"etcdserver: server stopped"}: etcdserver: server stopped
E0915 20:53:16.537637       1 leaderelection.go:325] error retrieving resource lock kube-system/kube-scheduler: etcdserver: server stopped

I wish we had a good way of stopping all the controllers when the node is deleted so that this was the last message seen by the user:

Sep 15 18:32:59 ip-172-31-5-33 k3s[4128]: time="2021-09-15T18:32:59.132678410Z" level=info msg="this node has been removed from the cluster please restart k3s to rejoin the cluster"

I think we could do this by running the controllers in a separate context that gets cancelled when the member is deleted? The Rancher controllers still Fatal when they lose the leaderelection though; that would have to be fixed elsewhere.

brandond avatar Sep 15 '21 20:09 brandond

A few edits:

  1. Renamed issue title from Node fails to properly delete after an upgrade. An upgrade made this easier to reproduce since it can often change the leader of the etcd cluster.
  2. This can be simply reproduced in a fresh install with the following steps:
    1. Install k3s with etcd backend and multiple servers. This is pre-existing, so can be done using v1.21.4+k3s1 for example.
    2. Install etcdctl and get the leader: sudo etcdctl --cacert=/var/lib/rancher/k3s/server/tls/etcd/server-ca.crt --cert=/var/lib/rancher/k3s/server/tls/etcd/client.crt --key=/var/lib/rancher/k3s/server/tls/etcd/client.key endpoint status --cluster -w table
    3. Delete that node from the etcd cluster. Can do this by either kubectl delete node <leader> or if using a recent commitid then by kubectl annotate node <leader> etcd.k3s.cattle.io/remove=true
  3. There will be a panic, as shown in the log in this issue description, and then the k3s service will restart, causing the node to either rejoin the cluster or in some cases create its own new cluster, which is what happened to me since this was the initial server node in my cluster.

rancher-max avatar Sep 15 '21 23:09 rancher-max

@rancher-max I don't think the etcd leader is the thing that is crashing and causing problems, it's the K3s leader, but coincidentally because the first node in the K3s cluster is most likely to be the etcd AND k3s leader as well, deleting that one causes the problem.

You can figure out the K3s leader through:

k3s kubectl get cm k3s -n kube-system -o json | jq -r '.metadata.annotations["control-plane.alpha.kubernetes.io/leader"]' | jq -r .holderIdentity

if you have jq installed, or just analyzing the annotation if you don't.

Oats87 avatar Sep 16 '21 15:09 Oats87

We should probably have the node refuse to un-tombstone itself if it has been started with --cluster-init. It needs to be restarted with --server to point at the address of an existing cluster member to join, or have the DB manually removed if the user really does want to make a new single-node cluster. I wish we could handle this for the user but I can't think of a safe way to do that.

What do you think @galal-hussein ?

brandond avatar Sep 16 '21 15:09 brandond

@brandond but the first node in the cluster isn't necessarily started with --cluster-init right?

I'd say if it's starting without --server and --token, we know it's a "first member"?

Oats87 avatar Sep 16 '21 15:09 Oats87

The un-tombstone logic will delete the etcd db from disk. At that point you must have either --cluster-init or --server to use managed etcd. Lack of one of those two flags gets you SQLite.

brandond avatar Sep 16 '21 16:09 brandond

That is not my observation. A simple k3s server --datastore-endpoint etcd starts me up with K3s backed by etcd. If I touch /var/lib/rancher/k3s/server/db/etcd/tombstone then restart K3s, I get a brand new cluster/new etcd, all without specifying --cluster-init or the --server flag, but I did use the server subcommand which is not what I was referring to in my comment above.

Oats87 avatar Sep 16 '21 16:09 Oats87

The documented way to use managed etcd and start a new cluster is with --cluster-init.

Using --datastore-endpoint=etcd might unintentionally work, but that's not something we've documented anywhere and IMO should not work; you should need to explicitly either start a new cluster (via --cluster-init) or join one (via --server=x).

I guess the --datastore-endpoint help does say that you can use etcd, but in practice using an external etcd cluster requires specifying a http/https endpoint as opposed to just bare 'etcd' which seems like an undocumented hack as all other endpoints require a datastore DSN in URI format.

--datastore-endpoint value (db) Specify etcd, Mysql, Postgres, or Sqlite (default) data source name [$K3S_DATASTORE_ENDPOINT]

brandond avatar Sep 16 '21 16:09 brandond

I think k3s is the one that is panicking and losing leader election, our etcd fork has a patch to prevent panicking if removed from the cluster, I think cluster-init is right if we are dealing only with k3s, but we need to consider rke2 as well, which doesnt require cluster-init

galal-hussein avatar Sep 16 '21 16:09 galal-hussein

either way, is un-tombstoning would be the right idea even, so we will have to document that if you delete the first node, you need to remove the etcd data dir manually to enable rejoin, we can just figure out a way to stop k3s from panicking if lost leader election

galal-hussein avatar Sep 16 '21 16:09 galal-hussein

Anyways, regardless of what flags we end up using to get there, I think that un-tombstoning a node that doesn't have a --server to join is wrong, since it will result in the node creating a new cluster by itself. Automatic un-tombstoning should require having --server set.

brandond avatar Sep 16 '21 16:09 brandond

Recent commits on master should Fatal() but not stack trace when the core Kubernetes controllers panic. Wrangler-managed controllers do the same when they loose leader election: https://github.com/rancher/wrangler/blob/master/pkg/leader/leader.go#L51-L66

brandond avatar Sep 16 '21 16:09 brandond

Related to:

  • https://github.com/rancher/rke2/issues/1959
  • https://github.com/rancher/rke2/issues/2144

brandond avatar Nov 17 '21 17:11 brandond

@brandond is this still an issue that should be kept open?

caroline-suse-rancher avatar Nov 08 '22 17:11 caroline-suse-rancher

I would defer to @rancher-max as to whether or not this is still a problem that needs investigation or resolution.

brandond avatar Nov 08 '22 18:11 brandond

This is no longer happening. The logs on the deleted node now repeat with the following panic:

Feb 27 21:57:36 ip-172-31-1-62 k3s[12066]: {"level":"info","ts":"2023-02-27T21:57:36.431Z","caller":"rafthttp/peer.go:335","msg":"stopped remote peer","remote-peer-id":"7717fe38937f907a"}
Feb 27 21:57:36 ip-172-31-1-62 k3s[12066]: {"level":"info","ts":"2023-02-27T21:57:36.431Z","caller":"rafthttp/transport.go:355","msg":"removed remote peer","local-member-id":"fb0d6980fb116f87","removed-remote-peer-id":"7717fe38937f907a"}
Feb 27 21:57:36 ip-172-31-1-62 k3s[12066]: panic: removed all voters
Feb 27 21:57:36 ip-172-31-1-62 k3s[12066]: goroutine 219 [running]:
Feb 27 21:57:36 ip-172-31-1-62 k3s[12066]: go.etcd.io/etcd/raft/v3.(*raft).applyConfChange(0x0?, {0x0, {0xc001464170, 0x1, 0x1}, {0x0, 0x0, 0x0}})
Feb 27 21:57:36 ip-172-31-1-62 k3s[12066]:         /go/pkg/mod/github.com/k3s-io/etcd/raft/[email protected]/raft.go:1633 +0x214
Feb 27 21:57:36 ip-172-31-1-62 k3s[12066]: go.etcd.io/etcd/raft/v3.(*node).run(0xc0000c8f60)
Feb 27 21:57:36 ip-172-31-1-62 k3s[12066]:         /go/pkg/mod/github.com/k3s-io/etcd/raft/[email protected]/node.go:360 +0xb3a
Feb 27 21:57:36 ip-172-31-1-62 k3s[12066]: created by go.etcd.io/etcd/raft/v3.RestartNode
Feb 27 21:57:36 ip-172-31-1-62 k3s[12066]:         /go/pkg/mod/github.com/k3s-io/etcd/raft/[email protected]/node.go:244 +0x24a
Feb 27 21:57:36 ip-172-31-1-62 systemd[1]: k3s.service: Main process exited, code=exited, status=2/INVALIDARGUMENT
Feb 27 21:57:36 ip-172-31-1-62 systemd[1]: k3s.service: Failed with result 'exit-code'.
Feb 27 21:57:36 ip-172-31-1-62 systemd[1]: k3s.service: Unit process 3049 (containerd-shim) remains running after unit stopped.
Feb 27 21:57:36 ip-172-31-1-62 systemd[1]: k3s.service: Unit process 3130 (containerd-shim) remains running after unit stopped.
Feb 27 21:57:36 ip-172-31-1-62 systemd[1]: k3s.service: Unit process 3232 (containerd-shim) remains running after unit stopped.
Feb 27 21:57:36 ip-172-31-1-62 systemd[1]: k3s.service: Unit process 4197 (containerd-shim) remains running after unit stopped.
Feb 27 21:57:36 ip-172-31-1-62 systemd[1]: k3s.service: Unit process 4284 (containerd-shim) remains running after unit stopped.
Feb 27 21:57:41 ip-172-31-1-62 systemd[1]: k3s.service: Scheduled restart job, restart counter is at 39.
Feb 27 21:57:41 ip-172-31-1-62 systemd[1]: Stopped Lightweight Kubernetes.

rancher-max avatar Feb 27 '23 21:02 rancher-max