postgresql_cluster icon indicating copy to clipboard operation
postgresql_cluster copied to clipboard

etcd needs a restart after propogating certificates

Open realkarmakun opened this issue 7 months ago • 5 comments

Bug description

Noticed that I was not able to deploy etcd properly on clean SSH machines. Failing at:

TASK [etcd : Wait for port 2379 to become open on the host] ****************************************************************************************************************************************************************************
ok: [100.64.0.57]
ok: [100.64.0.59]
ok: [100.64.0.58]
FAILED - RETRYING: [100.64.0.57]: Wait until the etcd cluster is healthy (10 retries left).
FAILED - RETRYING: [100.64.0.59]: Wait until the etcd cluster is healthy (10 retries left).
FAILED - RETRYING: [100.64.0.58]: Wait until the etcd cluster is healthy (10 retries left).

In etcd logs the following can be seen:

"remote-addr":"100.64.0.57:38362","server-name":"","error":"tls: failed to verify certificate: x509: certificate signed by unknown authority (possibly because of \"crypto/rsa: verification error\" while trying to verify candidate authority certificate \"Autobase CA\")"}
"remote-addr":"100.64.0.58:53956","server-name":"","error":"remote error: tls: bad certificate"}

It appears the etcd was not restarted after new CA was added to the host. Had to log in and restart systemd service for etcd.

Expected behavior

It's expected for etcd to pickup CA certs properly.

Steps to reproduce

  1. Have clean SSH VMs
  2. Run the deploy_pgcluster with mostly defaults

Installation method

Command line

System info

AlmaLinux 9 machines, autobase latest master branch.

Additional info

No response

realkarmakun avatar May 03 '25 10:05 realkarmakun

Certificates are generated before etcd starts, so an additional restart usually isn’t needed. We’ve already deployed many times to production without encountering this issue.

@realkarmakun Is this happening consistently on every deployment, or is it just a one-off?

vitabaks avatar May 03 '25 11:05 vitabaks

I'm yet to check it against freshly installed VMs, but it was consistent if I try removing cluster using remove_cluster playbook.

realkarmakun avatar May 03 '25 14:05 realkarmakun

I also ran into this in a new cluster just now. Solved it by setting tls_cert_regenerate: true (or -e "tls_cert_regenerate=true" when running interactively).

The issue is likely in a mismatch of the certs generated during an initial run WRT to hostname matching. The hostname is altered in the process as well and at some point there is something which causes a hickup/mismatch.

A more failsafe way would to be set tls_cert_regenerate: true by default. While this will cause a little bit longer executions times, it should prevent these issues in the first place. Not fully sure though, it could also result in the same issues. If so, there's likely an issue with the execution order, i.e. certs should be generated after all hostname-related tasks have finished.

pat-s avatar May 06 '25 10:05 pat-s

certs should be generated after all hostname-related tasks have finished.

We generate certificates after changing the hostname.

vitabaks avatar May 06 '25 11:05 vitabaks

tls_cert_regenerate is enabled by default https://github.com/vitabaks/autobase/blob/2.2.0/automation/vars/main.yml#L90

vitabaks avatar May 06 '25 11:05 vitabaks

I hit another issue other than cert verification - "error":"tls: client didn't provide a certificate"

I follow the default setting without changing anything:

# if dcs_type: "etcd" and dcs_exists: false
etcd_version: "3.5.20" # version for deploy etcd cluster
etcd_data_dir: "/var/lib/etcd"
etcd_cluster_name: "etcd-{{ patroni_cluster_name }}" # ETCD_INITIAL_CLUSTER_TOKEN
etcd_on_dedicated_nodes: "{{ groups['etcd_cluster'] | difference(groups['postgres_cluster']) | length > 0 }}" # 'true' or 'false'
# TLS
# Enables TLS encryption with a self-signed certificate if 'tls_cert_generate' is true.
etcd_tls_enable: "{{ tls_cert_generate | default(true) }}"
etcd_tls_dir: "/etc/etcd/tls"
etcd_tls_ca_crt: "ca.crt"
etcd_tls_ca_key: "ca.key"
etcd_tls_server_crt: "server.crt"
etcd_tls_server_key: "server.key"
etcd_client_cert_auth: "{{ 'true' if not etcd_on_dedicated_nodes | bool else 'false' }}"

Tried to change etcd_client_cert_auth: 'false' and restart etcd, but it will ask for certificate too

JY-210 avatar Jul 10 '25 06:07 JY-210

Can the task be closed or is the problem still relevant?

vitabaks avatar Aug 21 '25 17:08 vitabaks

I wasn't able to replicate it so yeah. I'll close it

realkarmakun avatar Aug 24 '25 00:08 realkarmakun