talos Intermittent failure to form etcd cluster (hetzner)

Bug Report

Description

I am experiencing an intermittent issue where control plane nodes fail to join the etcd cluster after bootstrap.

Roughly 50% of the time the etcd cluster is successfully formed (i.e., joined by all three control plane nodes).

The rest of the time either the first node (the one that the bootstrap is issued against) remains the only etcd member (the other two control plane nodes are left out of the cluster), or a second node is also able to join etcd, but not the third one.

I've built the image I am deploying with Packer as per 1.

I am patching the following parts of the generated cloud init configuration:

For control plane nodes:

/machine/ca
/cluster/ca
/cluster/aggregatorCA
/cluster/aescbcEncryptionSecret
/cluster/serviceAccount/key
/cluster/etcd/ca
/cluster/etcd/subnet - set to 10.202.12.0/24. I also tried without this patch, using the public server IPs, it made no difference.

For all nodes:

/machine/ca/crt
/machine/token
/cluster/ca/crt
/cluster/id
/cluster/secret
/cluster/token
/machine/kubelet (adding { "nodeIP": { "validSubnets": ["10.202.12.0/24"]}})

(Most of the patches are meant to make sure that the generated configuration remains stable between runs against the same cluster.)

After issuing the bootstrap command and waiting for ~15min the etcd cluster is still not formed.

Logs

etcd logs from a node that's unable to join the etcd cluster

157.90.243.49: {"level":"info","ts":"2022-07-14T13:07:14.016Z","caller":"flags/flag.go:113","msg":"recognized and used environment variable","variable-name":"ETCD_CIPHER_SUITES","variable-value":"TLS_ECDHE_ECDSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_ECDSA_WITH_AES_256_GCM_SHA384,TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384,TLS_ECDHE_ECDSA_WITH_CHACHA20_POLY1305,TLS_ECDHE_RSA_WITH_CHACHA20_POLY1305"}
157.90.243.49: {"level":"info","ts":"2022-07-14T13:07:14.016Z","caller":"etcdmain/etcd.go:73","msg":"Running: ","args":["/usr/local/bin/etcd","--advertise-client-urls=https://10.202.12.13:2379","--auto-tls=false","--cert-file=/system/secrets/etcd/peer.crt","--client-cert-auth=true","--data-dir=/var/lib/etcd","--experimental-initial-corrupt-check=true","--initial-advertise-peer-urls=https://10.202.12.13:2380","--initial-cluster=hz-fsn1-talos-poc-kube-controlplane-n2=https://10.201.12.12:2380,=https://10.201.12.13:2380","--initial-cluster-state=existing","--key-file=/system/secrets/etcd/peer.key","--listen-client-urls=https://[::]:2379","--listen-peer-urls=https://[::]:2380","--name=hz-fsn1-talos-poc-kube-controlplane-n3","--peer-auto-tls=false","--peer-cert-file=/system/secrets/etcd/peer.crt","--peer-client-cert-auth=true","--peer-key-file=/system/secrets/etcd/peer.key","--peer-trusted-ca-file=/system/secrets/etcd/ca.crt","--trusted-ca-file=/system/secrets/etcd/ca.crt"]}
157.90.243.49: {"level":"info","ts":"2022-07-14T13:07:14.016Z","caller":"etcdmain/etcd.go:116","msg":"server has been already initialized","data-dir":"/var/lib/etcd","dir-type":"member"}
157.90.243.49: {"level":"info","ts":"2022-07-14T13:07:14.016Z","caller":"embed/etcd.go:131","msg":"configuring peer listeners","listen-peer-urls":["https://[::]:2380"]}
157.90.243.49: {"level":"info","ts":"2022-07-14T13:07:14.016Z","caller":"embed/etcd.go:479","msg":"starting with peer TLS","tls-info":"cert = /system/secrets/etcd/peer.crt, key = /system/secrets/etcd/peer.key, client-cert=, client-key=, trusted-ca = /system/secrets/etcd/ca.crt, client-cert-auth = true, crl-file = ","cipher-suites":["TLS_ECDHE_ECDSA_WITH_AES_128_GCM_SHA256","TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256","TLS_ECDHE_ECDSA_WITH_AES_256_GCM_SHA384","TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384","TLS_ECDHE_ECDSA_WITH_CHACHA20_POLY1305","TLS_ECDHE_RSA_WITH_CHACHA20_POLY1305"]}
157.90.243.49: {"level":"info","ts":"2022-07-14T13:07:14.024Z","caller":"embed/etcd.go:139","msg":"configuring client listeners","listen-client-urls":["https://[::]:2379"]}
157.90.243.49: {"level":"info","ts":"2022-07-14T13:07:14.024Z","caller":"embed/etcd.go:308","msg":"starting an etcd server","etcd-version":"3.5.4","git-sha":"08407ff76","go-version":"go1.16.15","go-os":"linux","go-arch":"amd64","max-cpu-set":3,"max-cpu-available":3,"member-initialized":false,"name":"hz-fsn1-talos-poc-kube-controlplane-n3","data-dir":"/var/lib/etcd","wal-dir":"","wal-dir-dedicated":"","member-dir":"/var/lib/etcd/member","force-new-cluster":false,"heartbeat-interval":"100ms","election-timeout":"1s","initial-election-tick-advance":true,"snapshot-count":100000,"snapshot-catchup-entries":5000,"initial-advertise-peer-urls":["https://10.202.12.13:2380"],"listen-peer-urls":["https://[::]:2380"],"advertise-client-urls":["https://10.202.12.13:2379"],"listen-client-urls":["https://[::]:2379"],"listen-metrics-urls":[],"cors":["*"],"host-whitelist":["*"],"initial-cluster":"=https://10.201.12.13:2380,hz-fsn1-talos-poc-kube-controlplane-n2=https://10.201.12.12:2380","initial-cluster-state":"existing","initial-cluster-token":"etcd-cluster","quota-size-bytes":2147483648,"pre-vote":true,"initial-corrupt-check":true,"corrupt-check-time-interval":"0s","auto-compaction-mode":"periodic","auto-compaction-retention":"0s","auto-compaction-interval":"0s","discovery-url":"","discovery-proxy":"","downgrade-check-interval":"5s"}
157.90.243.49: {"level":"info","ts":"2022-07-14T13:07:14.025Z","caller":"etcdserver/backend.go:81","msg":"opened backend db","path":"/var/lib/etcd/member/snap/db","took":"179.649µs"}
157.90.243.49: {"level":"info","ts":"2022-07-14T13:07:14.027Z","caller":"embed/etcd.go:368","msg":"closing etcd server","name":"hz-fsn1-talos-poc-kube-controlplane-n3","data-dir":"/var/lib/etcd","advertise-peer-urls":["https://10.202.12.13:2380"],"advertise-client-urls":["https://10.202.12.13:2379"]}
157.90.243.49: {"level":"warn","ts":"2022-07-14T13:07:14.027Z","caller":"embed/config_logging.go:169","msg":"rejected connection","remote-addr":"10.202.12.11:40806","server-name":"","error":"read tcp 10.202.12.13:2380->10.202.12.11:40806: use of closed network connection"}
157.90.243.49: {"level":"info","ts":"2022-07-14T13:07:14.027Z","caller":"embed/etcd.go:370","msg":"closed etcd server","name":"hz-fsn1-talos-poc-kube-controlplane-n3","data-dir":"/var/lib/etcd","advertise-peer-urls":["https://10.202.12.13:2380"],"advertise-client-urls":["https://10.202.12.13:2379"]}
157.90.243.49: {"level":"fatal","ts":"2022-07-14T13:07:14.027Z","caller":"etcdmain/etcd.go:204","msg":"discovery failed","error":"couldn't find local name \"hz-fsn1-talos-poc-kube-controlplane-n3\" in the initial cluster configuration","stacktrace":"go.etcd.io/etcd/server/v3/etcdmain.startEtcdOrProxyV2\n\t/go/src/go.etcd.io/etcd/release/etcd/server/etcdmain/etcd.go:204\ngo.etcd.io/etcd/server/v3/etcdmain.Main\n\t/go/src/go.etcd.io/etcd/release/etcd/server/etcdmain/main.go:40\nmain.main\n\t/go/src/go.etcd.io/etcd/release/etcd/server/main.go:32\nruntime.main\n\t/go/gos/go1.16.15/src/runtime/proc.go:225"}

Environment

Talos version: 1.1.0
Kubernetes version: 1.24.2
Platform: Hetzner Cloud

Jul 19 '22 15:07 sasoiliev

I just saw the 1.1.1 release and this commit. I'm about to bump the Talos version to 1.1.1 and see if the issue is still there.

Jul 19 '22 15:07 sasoiliev

This was an issue with 1.1.0 release only, it should work for < 1.1.0 and >= 1.1.1. This is a race condition and it depends on the bootstrap order, but basically it has a low chance to succeed in random IP order.

Jul 19 '22 20:07 smira

With 1.1.1 the behavior actually got worse, but at least it's no longer intermittent - now it happens every time.

No node, except for the one which the bootstrap command was issued against, joins the etcd cluster or becomes a control-plane Kubernetes node.

On the 1st one (i.e., the bootstrapped node) I see a bunch of {"level":"warn","ts":"2022-07-20T11:34:27.622Z","caller":"embed/config_logging.go:169","msg":"rejected connection","remote-addr":"10.202.12.12:60264","server-name":"","error":"remote error: tls: bad certificate"}. On the 2nd and 3rd nodes etcd logs are missing (ERROR: rpc error: code = Unknown desc = log "etcd" was not registered), while talosctl dmesg shows the usual loop of

etcd is waiting to join the cluster, if this node is the first node in the cluster, please run `talosctl bootstrap` against one of the following IPs: ...

I'm thinking that this has something to do with the config patches I am applying, however the process I'm using is roughly:

generate configuration
extract and persist PKI/secrets/ids
generate configuration, this time patching it with the PKI/secrets/ids extracted earlier

My assumption is that this should lead to a valid configuration. talosctl validate -m cloud -c ... also seems to agree.

I'd appreciate any hints on where to look next. Thanks!

Jul 20 '22 12:07 sasoiliev

During join sequence, other cp nodes discover addresses of other etcd members via k8s control plane, which I assume works, as you see in the logs errors about the connection.

I'm not sure why tls: bad certificate happens, but joining node builds etcd client cert for itself from the PKI in the machine config, so the only requirement is to have etcd PKI in sync across control plane nodes.

Jul 20 '22 13:07 smira

There should be also a message in the log of the joining node(s) with something like retrying: ... , this might give a hint on why joining node can't connect to etcd.

Jul 20 '22 13:07 smira

Thanks, the error was:

retrying error: no controlplane endpoints discovered yet

and then:

retrying error: timeout

Turns out that removing the /cluster/etcd/subnet and /machine/kubelet/nodeIP/validSubnets JSON patches fixes the issue. (Removing each patch individually did not help.)

For clarity - I'm using a Hetzner load balancer, configured with a single service and a single target (the Kubernetes API endpoint) with use_private_ip set to true.

So I guess my question is -- is having kubelet and etcd talk over a private cloud network something supported?

Jul 20 '22 15:07 sasoiliev

It's hard to say what the issue is without having full logs, but I believe on Hetzner there's an issue with some IPs being assigned relatively late into the boot cycle which might affect this to some extent. At the same time /cluster/etcd/subnet controls which IPs is advertised by etcd for peer connections. So you might be hitting a bug there. We will do more work to stabilize things in the next release.

Jul 20 '22 16:07 smira

Looking at the logs of a node that failed to join the control plane, it looks like the private network was brought up quickly enough.

Here's a support bundle from a failed run, including all three cp nodes. support.zip

Jul 21 '22 09:07 sasoiliev

yeah, I looked into the logs.

So one of the joining nodes failed like I expected and described above:

user: warning: [2022-07-21T08:41:29.200341553Z]: [talos] service[etcd](Failed): Failed to run pre stage: failed to calculate etcd addresses: no address matched the provided subnet

It tried to start etcd before private IP was available, we'll have a fix for that in the next version.

As for the other one which doesn't join due to tls: bad certificate, I don't see anything in the logs. So probably you should try with talosctl gen config to remove any doubts with your config generation procedure.

Jul 21 '22 12:07 smira

The second interface (local network) appears a little bit later than the first one. And etcd certs does not have SAN ip from local network. I've tried to solve it here https://github.com/siderolabs/talos/pull/5806

As the solution - just reboot the control-plane node. Or add more timewait in boot menu - https://github.com/sergelogvinov/terraform-talos/blob/e1b09d1c0cb4d0f3719038df096f7e1c94e48355/system_os/hetzner/hetzner.pkr.hcl#L51 It works only for hetzner cloud.

I have such issue not only in Hetzner cloud...

Jul 28 '22 17:07 sergelogvinov

talos talos copied to clipboard

Intermittent failure to form etcd cluster (hetzner)

Bug Report

Description

Logs

Environment

talos
talos copied to clipboard