postgresql_cluster
postgresql_cluster copied to clipboard
etcd needs a restart after propogating certificates
Bug description
Noticed that I was not able to deploy etcd properly on clean SSH machines. Failing at:
TASK [etcd : Wait for port 2379 to become open on the host] ****************************************************************************************************************************************************************************
ok: [100.64.0.57]
ok: [100.64.0.59]
ok: [100.64.0.58]
FAILED - RETRYING: [100.64.0.57]: Wait until the etcd cluster is healthy (10 retries left).
FAILED - RETRYING: [100.64.0.59]: Wait until the etcd cluster is healthy (10 retries left).
FAILED - RETRYING: [100.64.0.58]: Wait until the etcd cluster is healthy (10 retries left).
In etcd logs the following can be seen:
"remote-addr":"100.64.0.57:38362","server-name":"","error":"tls: failed to verify certificate: x509: certificate signed by unknown authority (possibly because of \"crypto/rsa: verification error\" while trying to verify candidate authority certificate \"Autobase CA\")"}
"remote-addr":"100.64.0.58:53956","server-name":"","error":"remote error: tls: bad certificate"}
It appears the etcd was not restarted after new CA was added to the host. Had to log in and restart systemd service for etcd.
Expected behavior
It's expected for etcd to pickup CA certs properly.
Steps to reproduce
- Have clean SSH VMs
- Run the deploy_pgcluster with mostly defaults
Installation method
Command line
System info
AlmaLinux 9 machines, autobase latest master branch.
Additional info
No response
Certificates are generated before etcd starts, so an additional restart usually isn’t needed. We’ve already deployed many times to production without encountering this issue.
@realkarmakun Is this happening consistently on every deployment, or is it just a one-off?
I'm yet to check it against freshly installed VMs, but it was consistent if I try removing cluster using remove_cluster playbook.
I also ran into this in a new cluster just now. Solved it by setting tls_cert_regenerate: true (or -e "tls_cert_regenerate=true" when running interactively).
The issue is likely in a mismatch of the certs generated during an initial run WRT to hostname matching. The hostname is altered in the process as well and at some point there is something which causes a hickup/mismatch.
A more failsafe way would to be set tls_cert_regenerate: true by default. While this will cause a little bit longer executions times, it should prevent these issues in the first place.
Not fully sure though, it could also result in the same issues. If so, there's likely an issue with the execution order, i.e. certs should be generated after all hostname-related tasks have finished.
certs should be generated after all hostname-related tasks have finished.
We generate certificates after changing the hostname.
tls_cert_regenerate is enabled by default https://github.com/vitabaks/autobase/blob/2.2.0/automation/vars/main.yml#L90
I hit another issue other than cert verification - "error":"tls: client didn't provide a certificate"
I follow the default setting without changing anything:
# if dcs_type: "etcd" and dcs_exists: false
etcd_version: "3.5.20" # version for deploy etcd cluster
etcd_data_dir: "/var/lib/etcd"
etcd_cluster_name: "etcd-{{ patroni_cluster_name }}" # ETCD_INITIAL_CLUSTER_TOKEN
etcd_on_dedicated_nodes: "{{ groups['etcd_cluster'] | difference(groups['postgres_cluster']) | length > 0 }}" # 'true' or 'false'
# TLS
# Enables TLS encryption with a self-signed certificate if 'tls_cert_generate' is true.
etcd_tls_enable: "{{ tls_cert_generate | default(true) }}"
etcd_tls_dir: "/etc/etcd/tls"
etcd_tls_ca_crt: "ca.crt"
etcd_tls_ca_key: "ca.key"
etcd_tls_server_crt: "server.crt"
etcd_tls_server_key: "server.key"
etcd_client_cert_auth: "{{ 'true' if not etcd_on_dedicated_nodes | bool else 'false' }}"
Tried to change etcd_client_cert_auth: 'false' and restart etcd, but it will ask for certificate too
Can the task be closed or is the problem still relevant?
I wasn't able to replicate it so yeah. I'll close it