talos
talos copied to clipboard
Intermittent failure to form etcd cluster (hetzner)
Bug Report
Description
I am experiencing an intermittent issue where control plane nodes fail to join the etcd cluster after bootstrap.
Roughly 50% of the time the etcd cluster is successfully formed (i.e., joined by all three control plane nodes).
The rest of the time either the first node (the one that the bootstrap is issued against) remains the only etcd member (the other two control plane nodes are left out of the cluster), or a second node is also able to join etcd, but not the third one.
I've built the image I am deploying with Packer as per 1.
I am patching the following parts of the generated cloud init configuration:
For control plane nodes:
-
/machine/ca
-
/cluster/ca
-
/cluster/aggregatorCA
-
/cluster/aescbcEncryptionSecret
-
/cluster/serviceAccount/key
-
/cluster/etcd/ca
-
/cluster/etcd/subnet
- set to10.202.12.0/24
. I also tried without this patch, using the public server IPs, it made no difference.
For all nodes:
-
/machine/ca/crt
-
/machine/token
-
/cluster/ca/crt
-
/cluster/id
-
/cluster/secret
-
/cluster/token
-
/machine/kubelet
(adding{ "nodeIP": { "validSubnets": ["10.202.12.0/24"]}}
)
(Most of the patches are meant to make sure that the generated configuration remains stable between runs against the same cluster.)
After issuing the bootstrap command and waiting for ~15min the etcd cluster is still not formed.
Logs
etcd logs from a node that's unable to join the etcd cluster
157.90.243.49: {"level":"info","ts":"2022-07-14T13:07:14.016Z","caller":"flags/flag.go:113","msg":"recognized and used environment variable","variable-name":"ETCD_CIPHER_SUITES","variable-value":"TLS_ECDHE_ECDSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_ECDSA_WITH_AES_256_GCM_SHA384,TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384,TLS_ECDHE_ECDSA_WITH_CHACHA20_POLY1305,TLS_ECDHE_RSA_WITH_CHACHA20_POLY1305"}
157.90.243.49: {"level":"info","ts":"2022-07-14T13:07:14.016Z","caller":"etcdmain/etcd.go:73","msg":"Running: ","args":["/usr/local/bin/etcd","--advertise-client-urls=https://10.202.12.13:2379","--auto-tls=false","--cert-file=/system/secrets/etcd/peer.crt","--client-cert-auth=true","--data-dir=/var/lib/etcd","--experimental-initial-corrupt-check=true","--initial-advertise-peer-urls=https://10.202.12.13:2380","--initial-cluster=hz-fsn1-talos-poc-kube-controlplane-n2=https://10.201.12.12:2380,=https://10.201.12.13:2380","--initial-cluster-state=existing","--key-file=/system/secrets/etcd/peer.key","--listen-client-urls=https://[::]:2379","--listen-peer-urls=https://[::]:2380","--name=hz-fsn1-talos-poc-kube-controlplane-n3","--peer-auto-tls=false","--peer-cert-file=/system/secrets/etcd/peer.crt","--peer-client-cert-auth=true","--peer-key-file=/system/secrets/etcd/peer.key","--peer-trusted-ca-file=/system/secrets/etcd/ca.crt","--trusted-ca-file=/system/secrets/etcd/ca.crt"]}
157.90.243.49: {"level":"info","ts":"2022-07-14T13:07:14.016Z","caller":"etcdmain/etcd.go:116","msg":"server has been already initialized","data-dir":"/var/lib/etcd","dir-type":"member"}
157.90.243.49: {"level":"info","ts":"2022-07-14T13:07:14.016Z","caller":"embed/etcd.go:131","msg":"configuring peer listeners","listen-peer-urls":["https://[::]:2380"]}
157.90.243.49: {"level":"info","ts":"2022-07-14T13:07:14.016Z","caller":"embed/etcd.go:479","msg":"starting with peer TLS","tls-info":"cert = /system/secrets/etcd/peer.crt, key = /system/secrets/etcd/peer.key, client-cert=, client-key=, trusted-ca = /system/secrets/etcd/ca.crt, client-cert-auth = true, crl-file = ","cipher-suites":["TLS_ECDHE_ECDSA_WITH_AES_128_GCM_SHA256","TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256","TLS_ECDHE_ECDSA_WITH_AES_256_GCM_SHA384","TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384","TLS_ECDHE_ECDSA_WITH_CHACHA20_POLY1305","TLS_ECDHE_RSA_WITH_CHACHA20_POLY1305"]}
157.90.243.49: {"level":"info","ts":"2022-07-14T13:07:14.024Z","caller":"embed/etcd.go:139","msg":"configuring client listeners","listen-client-urls":["https://[::]:2379"]}
157.90.243.49: {"level":"info","ts":"2022-07-14T13:07:14.024Z","caller":"embed/etcd.go:308","msg":"starting an etcd server","etcd-version":"3.5.4","git-sha":"08407ff76","go-version":"go1.16.15","go-os":"linux","go-arch":"amd64","max-cpu-set":3,"max-cpu-available":3,"member-initialized":false,"name":"hz-fsn1-talos-poc-kube-controlplane-n3","data-dir":"/var/lib/etcd","wal-dir":"","wal-dir-dedicated":"","member-dir":"/var/lib/etcd/member","force-new-cluster":false,"heartbeat-interval":"100ms","election-timeout":"1s","initial-election-tick-advance":true,"snapshot-count":100000,"snapshot-catchup-entries":5000,"initial-advertise-peer-urls":["https://10.202.12.13:2380"],"listen-peer-urls":["https://[::]:2380"],"advertise-client-urls":["https://10.202.12.13:2379"],"listen-client-urls":["https://[::]:2379"],"listen-metrics-urls":[],"cors":["*"],"host-whitelist":["*"],"initial-cluster":"=https://10.201.12.13:2380,hz-fsn1-talos-poc-kube-controlplane-n2=https://10.201.12.12:2380","initial-cluster-state":"existing","initial-cluster-token":"etcd-cluster","quota-size-bytes":2147483648,"pre-vote":true,"initial-corrupt-check":true,"corrupt-check-time-interval":"0s","auto-compaction-mode":"periodic","auto-compaction-retention":"0s","auto-compaction-interval":"0s","discovery-url":"","discovery-proxy":"","downgrade-check-interval":"5s"}
157.90.243.49: {"level":"info","ts":"2022-07-14T13:07:14.025Z","caller":"etcdserver/backend.go:81","msg":"opened backend db","path":"/var/lib/etcd/member/snap/db","took":"179.649µs"}
157.90.243.49: {"level":"info","ts":"2022-07-14T13:07:14.027Z","caller":"embed/etcd.go:368","msg":"closing etcd server","name":"hz-fsn1-talos-poc-kube-controlplane-n3","data-dir":"/var/lib/etcd","advertise-peer-urls":["https://10.202.12.13:2380"],"advertise-client-urls":["https://10.202.12.13:2379"]}
157.90.243.49: {"level":"warn","ts":"2022-07-14T13:07:14.027Z","caller":"embed/config_logging.go:169","msg":"rejected connection","remote-addr":"10.202.12.11:40806","server-name":"","error":"read tcp 10.202.12.13:2380->10.202.12.11:40806: use of closed network connection"}
157.90.243.49: {"level":"info","ts":"2022-07-14T13:07:14.027Z","caller":"embed/etcd.go:370","msg":"closed etcd server","name":"hz-fsn1-talos-poc-kube-controlplane-n3","data-dir":"/var/lib/etcd","advertise-peer-urls":["https://10.202.12.13:2380"],"advertise-client-urls":["https://10.202.12.13:2379"]}
157.90.243.49: {"level":"fatal","ts":"2022-07-14T13:07:14.027Z","caller":"etcdmain/etcd.go:204","msg":"discovery failed","error":"couldn't find local name \"hz-fsn1-talos-poc-kube-controlplane-n3\" in the initial cluster configuration","stacktrace":"go.etcd.io/etcd/server/v3/etcdmain.startEtcdOrProxyV2\n\t/go/src/go.etcd.io/etcd/release/etcd/server/etcdmain/etcd.go:204\ngo.etcd.io/etcd/server/v3/etcdmain.Main\n\t/go/src/go.etcd.io/etcd/release/etcd/server/etcdmain/main.go:40\nmain.main\n\t/go/src/go.etcd.io/etcd/release/etcd/server/main.go:32\nruntime.main\n\t/go/gos/go1.16.15/src/runtime/proc.go:225"}
Environment
- Talos version: 1.1.0
- Kubernetes version: 1.24.2
- Platform: Hetzner Cloud
I just saw the 1.1.1 release and this commit. I'm about to bump the Talos version to 1.1.1 and see if the issue is still there.
This was an issue with 1.1.0 release only, it should work for < 1.1.0 and >= 1.1.1. This is a race condition and it depends on the bootstrap order, but basically it has a low chance to succeed in random IP order.
With 1.1.1 the behavior actually got worse, but at least it's no longer intermittent - now it happens every time.
No node, except for the one which the bootstrap command was issued against, joins the etcd cluster or becomes a control-plane Kubernetes node.
On the 1st one (i.e., the bootstrapped node) I see a bunch of {"level":"warn","ts":"2022-07-20T11:34:27.622Z","caller":"embed/config_logging.go:169","msg":"rejected connection","remote-addr":"10.202.12.12:60264","server-name":"","error":"remote error: tls: bad certificate"}
. On the 2nd and 3rd nodes etcd logs are missing (ERROR: rpc error: code = Unknown desc = log "etcd" was not registered
), while talosctl dmesg
shows the usual loop of
etcd is waiting to join the cluster, if this node is the first node in the cluster, please run `talosctl bootstrap` against one of the following IPs: ...
I'm thinking that this has something to do with the config patches I am applying, however the process I'm using is roughly:
- generate configuration
- extract and persist PKI/secrets/ids
- generate configuration, this time patching it with the PKI/secrets/ids extracted earlier
My assumption is that this should lead to a valid configuration. talosctl validate -m cloud -c ...
also seems to agree.
I'd appreciate any hints on where to look next. Thanks!
During join sequence, other cp nodes discover addresses of other etcd members via k8s control plane, which I assume works, as you see in the logs errors about the connection.
I'm not sure why tls: bad certificate
happens, but joining node builds etcd client cert for itself from the PKI in the machine config, so the only requirement is to have etcd PKI in sync across control plane nodes.
There should be also a message in the log of the joining node(s) with something like retrying: ...
, this might give a hint on why joining node can't connect to etcd.
Thanks, the error was:
retrying error: no controlplane endpoints discovered yet
and then:
retrying error: timeout
Turns out that removing the /cluster/etcd/subnet
and /machine/kubelet/nodeIP/validSubnets
JSON patches fixes the issue. (Removing each patch individually did not help.)
For clarity - I'm using a Hetzner load balancer, configured with a single service and a single target (the Kubernetes API endpoint) with use_private_ip
set to true
.
So I guess my question is -- is having kubelet and etcd talk over a private cloud network something supported?
It's hard to say what the issue is without having full logs, but I believe on Hetzner there's an issue with some IPs being assigned relatively late into the boot cycle which might affect this to some extent. At the same time /cluster/etcd/subnet
controls which IPs is advertised by etcd for peer connections. So you might be hitting a bug there. We will do more work to stabilize things in the next release.
Looking at the logs of a node that failed to join the control plane, it looks like the private network was brought up quickly enough.
Here's a support bundle from a failed run, including all three cp nodes. support.zip
yeah, I looked into the logs.
So one of the joining nodes failed like I expected and described above:
user: warning: [2022-07-21T08:41:29.200341553Z]: [talos] service[etcd](Failed): Failed to run pre stage: failed to calculate etcd addresses: no address matched the provided subnet
It tried to start etcd before private IP was available, we'll have a fix for that in the next version.
As for the other one which doesn't join due to tls: bad certificate
, I don't see anything in the logs. So probably you should try with talosctl gen config
to remove any doubts with your config generation procedure.
The second interface (local network) appears a little bit later than the first one. And etcd certs does not have SAN ip from local network. I've tried to solve it here https://github.com/siderolabs/talos/pull/5806
As the solution - just reboot the control-plane node. Or add more timewait in boot menu - https://github.com/sergelogvinov/terraform-talos/blob/e1b09d1c0cb4d0f3719038df096f7e1c94e48355/system_os/hetzner/hetzner.pkr.hcl#L51 It works only for hetzner cloud.
I have such issue not only in Hetzner cloud...