postgresql_cluster etcd needs a restart after propogating certificates

Bug description

Noticed that I was not able to deploy etcd properly on clean SSH machines. Failing at:

TASK [etcd : Wait for port 2379 to become open on the host] ****************************************************************************************************************************************************************************
ok: [100.64.0.57]
ok: [100.64.0.59]
ok: [100.64.0.58]
FAILED - RETRYING: [100.64.0.57]: Wait until the etcd cluster is healthy (10 retries left).
FAILED - RETRYING: [100.64.0.59]: Wait until the etcd cluster is healthy (10 retries left).
FAILED - RETRYING: [100.64.0.58]: Wait until the etcd cluster is healthy (10 retries left).

In etcd logs the following can be seen:

"remote-addr":"100.64.0.57:38362","server-name":"","error":"tls: failed to verify certificate: x509: certificate signed by unknown authority (possibly because of \"crypto/rsa: verification error\" while trying to verify candidate authority certificate \"Autobase CA\")"}
"remote-addr":"100.64.0.58:53956","server-name":"","error":"remote error: tls: bad certificate"}

It appears the etcd was not restarted after new CA was added to the host. Had to log in and restart systemd service for etcd.

Expected behavior

It's expected for etcd to pickup CA certs properly.

Steps to reproduce

Have clean SSH VMs
Run the deploy_pgcluster with mostly defaults

Installation method

Command line

System info

AlmaLinux 9 machines, autobase latest master branch.

Additional info

No response

May 03 '25 10:05 realkarmakun

Certificates are generated before etcd starts, so an additional restart usually isn’t needed. We’ve already deployed many times to production without encountering this issue.

@realkarmakun Is this happening consistently on every deployment, or is it just a one-off?

May 03 '25 11:05 vitabaks

I'm yet to check it against freshly installed VMs, but it was consistent if I try removing cluster using remove_cluster playbook.

May 03 '25 14:05 realkarmakun

I also ran into this in a new cluster just now. Solved it by setting tls_cert_regenerate: true (or -e "tls_cert_regenerate=true" when running interactively).

The issue is likely in a mismatch of the certs generated during an initial run WRT to hostname matching. The hostname is altered in the process as well and at some point there is something which causes a hickup/mismatch.

A more failsafe way would to be set tls_cert_regenerate: true by default. While this will cause a little bit longer executions times, it should prevent these issues in the first place. Not fully sure though, it could also result in the same issues. If so, there's likely an issue with the execution order, i.e. certs should be generated after all hostname-related tasks have finished.

May 06 '25 10:05 pat-s

certs should be generated after all hostname-related tasks have finished.

We generate certificates after changing the hostname.

May 06 '25 11:05 vitabaks

tls_cert_regenerate is enabled by default https://github.com/vitabaks/autobase/blob/2.2.0/automation/vars/main.yml#L90

May 06 '25 11:05 vitabaks

I hit another issue other than cert verification - "error":"tls: client didn't provide a certificate"

I follow the default setting without changing anything:

# if dcs_type: "etcd" and dcs_exists: false
etcd_version: "3.5.20" # version for deploy etcd cluster
etcd_data_dir: "/var/lib/etcd"
etcd_cluster_name: "etcd-{{ patroni_cluster_name }}" # ETCD_INITIAL_CLUSTER_TOKEN
etcd_on_dedicated_nodes: "{{ groups['etcd_cluster'] | difference(groups['postgres_cluster']) | length > 0 }}" # 'true' or 'false'
# TLS
# Enables TLS encryption with a self-signed certificate if 'tls_cert_generate' is true.
etcd_tls_enable: "{{ tls_cert_generate | default(true) }}"
etcd_tls_dir: "/etc/etcd/tls"
etcd_tls_ca_crt: "ca.crt"
etcd_tls_ca_key: "ca.key"
etcd_tls_server_crt: "server.crt"
etcd_tls_server_key: "server.key"
etcd_client_cert_auth: "{{ 'true' if not etcd_on_dedicated_nodes | bool else 'false' }}"

Tried to change etcd_client_cert_auth: 'false' and restart etcd, but it will ask for certificate too

Jul 10 '25 06:07 JY-210

Can the task be closed or is the problem still relevant?

Aug 21 '25 17:08 vitabaks

I wasn't able to replicate it so yeah. I'll close it

Aug 24 '25 00:08 realkarmakun

postgresql_cluster postgresql_cluster copied to clipboard

etcd needs a restart after propogating certificates

Bug description

Expected behavior

Steps to reproduce

Installation method

System info

Additional info

postgresql_cluster
postgresql_cluster copied to clipboard