patroni
patroni copied to clipboard
Demotion of primary hangs when logical replication connection is established
What happened?
We are running a Postgres cluster with two members on K8S using the operator and Patroni 3.0.1. A client application has a logical replication connection established to Postgres. Patroni failed to reach DCS and initiated demotion of the primary. Because of the logical replication connection, demotion did hang indefinitely until the replication connection was manually closed by killing the client.
How can we reproduce it (as minimally and precisely as possible)?
- Have a 2+ member cluster
- Establish a logical replication connection to the primary
- Trigger demotion of primary by making DCS unavailble
What did you expect to happen?
Demotion should succeed and logical replication connection should be closed forcefully.
Patroni/PostgreSQL/DCS version
- ghcr.io/zalando/spilo-15:3.0-p1
- Patroni version: 3.0.1
- PostgreSQL version: 15
- DCS (and its version): K8S etcd
Patroni configuration file
bootstrap:
dcs:
failsafe_mode: false
loop_wait: 10
maximum_lag_on_failover: 33554432
postgresql:
parameters:
archive_mode: 'on'
archive_timeout: 1800s
autovacuum_analyze_scale_factor: 0.02
autovacuum_max_workers: 5
autovacuum_vacuum_scale_factor: 0.05
checkpoint_completion_target: 0.9
hot_standby: 'on'
log_autovacuum_min_duration: 0
log_checkpoints: 'on'
log_connections: 'on'
log_disconnections: 'on'
log_line_prefix: '%t [%p]: [%l-1] %c %x %d %u %a %h '
log_lock_waits: 'on'
log_min_duration_statement: 500
log_statement: ddl
log_temp_files: 0
max_connections: 100
max_replication_slots: 10
max_wal_senders: 10
tcp_keepalives_idle: 900
tcp_keepalives_interval: 100
track_functions: all
wal_compression: 'on'
wal_level: logical
wal_log_hints: 'on'
use_pg_rewind: true
use_slots: true
retry_timeout: 10
ttl: 30
initdb:
- auth-host: md5
- auth-local: trust
post_init: /scripts/post_init.sh "zalandos"
users:
zalandos:
options:
- CREATEDB
- NOLOGIN
password: ''
kubernetes:
bypass_api_service: true
labels:
application: spilo
port: tcp://10.233.0.1:443
port_443_tcp: tcp://10.233.0.1:443
port_443_tcp_addr: 10.233.0.1
port_443_tcp_port: '443'
port_443_tcp_proto: tcp
ports:
- name: postgresql
port: 5432
role_label: spilo-role
scope_label: cluster-name
service_host: 10.233.0.1
service_port: '443'
service_port_https: '443'
use_endpoints: true
namespace: REDACTED
postgresql:
authentication:
replication:
password: REDACTED
username: standby
superuser:
password: REDACTED
username: postgres
basebackup_fast_xlog:
command: /scripts/basebackup.sh
retries: 2
bin_dir: /usr/lib/postgresql/15/bin
callbacks:
on_role_change: /scripts/on_role_change.sh zalandos true
connect_address: 10.233.71.212:5432
create_replica_method:
- wal_e
- basebackup_fast_xlog
data_dir: /home/postgres/pgdata/pgroot/data
listen: '*:5432'
name: REDACTED
parameters:
archive_command: envdir "/run/etc/wal-e.d/env" wal-g wal-push "%p"
bg_mon.history_buckets: 120
bg_mon.listen_address: '::'
extwlist.custom_path: /scripts
extwlist.extensions: btree_gin,btree_gist,citext,extra_window_functions,first_last_agg,hll,hstore,hypopg,intarray,ltree,pgcrypto,pgq,pgq_node,pg_trgm,postgres_fdw,tablefunc,uuid-ossp,pg_partman
log_destination: csvlog
log_directory: ../pg_log
log_file_mode: '0644'
log_filename: postgresql-%u.log
log_rotation_age: 1d
log_truncate_on_rotation: 'on'
logging_collector: 'on'
pg_stat_statements.track_utility: 'off'
shared_buffers: 409MB
shared_preload_libraries: bg_mon,pg_stat_statements,pgextwlist,pg_auth_mon,set_user,pg_cron,pg_stat_kcache
ssl: 'on'
ssl_cert_file: /run/certs/server.crt
ssl_key_file: /run/certs/server.key
pg_hba:
- local all all trust
- host all all all md5
- host all +zalandos all pam
- local replication standby trust
- hostssl replication standby all md5
pgpass: /run/postgresql/pgpass
recovery_conf:
restore_command: envdir "/run/etc/wal-e.d/env" timeout "0" /scripts/restore_command.sh "%f" "%p"
use_unix_socket: true
use_unix_socket_repl: true
wal_e:
command: envdir /run/etc/wal-e.d/env bash /scripts/wale_restore.sh
no_master: 1
retries: 2
threshold_backup_size_percentage: 30
threshold_megabytes: 102400
restapi:
connect_address: 10.233.71.212:8008
listen: :8008
scope: REDACTED
patronictl show-config
failsafe_mode: false
loop_wait: 10
maximum_lag_on_failover: 33554432
pg_hba:
- local all all trust
- host all all all md5
- host all +zalandos all pam
- local replication standby trust
- hostssl replication standby all md5
postgresql:
parameters:
archive_mode: 'on'
archive_timeout: 1800s
autovacuum_analyze_scale_factor: 0.02
autovacuum_max_workers: 5
autovacuum_vacuum_scale_factor: 0.05
checkpoint_completion_target: 0.9
hot_standby: 'on'
log_autovacuum_min_duration: 0
log_checkpoints: 'on'
log_connections: 'on'
log_disconnections: 'on'
log_line_prefix: '%t [%p]: [%l-1] %c %x %d %u %a %h '
log_lock_waits: 'on'
log_min_duration_statement: 500
log_statement: ddl
log_temp_files: 0
max_connections: 100
max_replication_slots: 10
max_wal_senders: 10
tcp_keepalives_idle: 900
tcp_keepalives_interval: 100
track_functions: all
wal_compression: 'on'
wal_level: logical
wal_log_hints: 'on'
use_pg_rewind: true
use_slots: true
retry_timeout: 10
ttl: 30
Patroni log files
2024-02-14 06:25:06,976 ERROR: Error communicating with DCS
2024-02-14 06:25:06,976 INFO: demoting self because DCS is not accessible and I was a leader
2024-02-14 06:25:06,976 INFO: Demoting self (offline)
2024-02-14 06:25:06,977 WARNING: Loop time exceeded, rescheduling immediately.
2024-02-14 06:25:06,978 INFO: Lock owner: aa-bb-ccc-postgres-1; I am aa-bb-ccc-postgres-1
2024-02-14 06:25:07,083 INFO: updated leader lock during demoting self because DCS is not accessible and I was a leader
2024-02-14 06:25:17,477 INFO: Lock owner: aa-bb-ccc-postgres-1; I am aa-bb-ccc-postgres-1
2024-02-14 06:25:17,487 INFO: updated leader lock during demoting self because DCS is not accessible and I was a leader
2024-02-14 06:25:22.673 UTC [40] LOG {ticks: 0, maint: 0, retry: 0}
2024-02-14 06:25:22.675 UTC [40] ERROR connection error: PQconnectPoll
2024-02-14 06:25:22.675 UTC [40] ERROR libpq: connection to server on socket "/var/run/postgresql/.s.PGSQL.5432" failed: FATAL: the database system is shutting down
2024-02-14 06:25:22.675 UTC [40] WARNING [postgres]: default timeout
2024-02-14 06:25:22.676 UTC [40] ERROR connection error: PQconnectPoll
2024-02-14 06:25:22.676 UTC [40] ERROR libpq: connection to server on socket "/var/run/postgresql/.s.PGSQL.5432" failed: FATAL: the database system is shutting down
2024-02-14 06:25:22.677 UTC [40] ERROR connection error: PQconnectPoll
2024-02-14 06:25:22.677 UTC [40] ERROR libpq: connection to server on socket "/var/run/postgresql/.s.PGSQL.5432" failed: FATAL: the database system is shutting down
2024-02-14 06:25:22.677 UTC [40] WARNING [ccc]: default timeout
2024-02-14 06:25:27,477 INFO: Lock owner: aa-bb-ccc-postgres-1; I am aa-bb-ccc-postgres-1
2024-02-14 06:25:27,489 INFO: updated leader lock during demoting self because DCS is not accessible and I was a leader
2024-02-14 06:25:37,477 INFO: Lock owner: aa-bb-ccc-postgres-1; I am aa-bb-ccc-postgres-1
2024-02-14 06:25:37,488 INFO: updated leader lock during demoting self because DCS is not accessible and I was a leader
2024-02-14 06:25:47,481 INFO: Lock owner: aa-bb-ccc-postgres-1; I am aa-bb-ccc-postgres-1
2024-02-14 06:25:47,495 INFO: updated leader lock during demoting self because DCS is not accessible and I was a leader
2024-02-14 06:25:52.701 UTC [40] LOG {ticks: 0, maint: 0, retry: 0}
2024-02-14 06:25:57,477 INFO: Lock owner: aa-bb-ccc-postgres-1; I am aa-bb-ccc-postgres-1
2024-02-14 06:25:57,487 INFO: updated leader lock during demoting self because DCS is not accessible and I was a leader
2024-02-14 06:26:07,477 INFO: Lock owner: aa-bb-ccc-postgres-1; I am aa-bb-ccc-postgres-1
2024-02-14 06:26:07,487 INFO: updated leader lock during demoting self because DCS is not accessible and I was a leader
2024-02-14 06:26:17,477 INFO: Lock owner: aa-bb-ccc-postgres-1; I am aa-bb-ccc-postgres-1
2024-02-14 06:26:22,481 ERROR: Request to server https://10.233.0.1:443 failed: ReadTimeoutError("HTTPSConnectionPool(host='10.233.0.1', port=443): Read timed out. (read timeout=4.999785443767905)")
2024-02-14 06:26:22.701 UTC [40] LOG {ticks: 0, maint: 0, retry: 0}
2024-02-14 06:26:22.703 UTC [40] ERROR connection error: PQconnectPoll
2024-02-14 06:26:22.703 UTC [40] ERROR libpq: connection to server on socket "/var/run/postgresql/.s.PGSQL.5432" failed: FATAL: the database system is shutting down
2024-02-14 06:26:22.704 UTC [40] ERROR connection error: PQconnectPoll
2024-02-14 06:26:22.704 UTC [40] ERROR libpq: connection to server on socket "/var/run/postgresql/.s.PGSQL.5432" failed: FATAL: the database system is shutting down
2024-02-14 06:26:22.704 UTC [40] WARNING [postgres]: default timeout
2024-02-14 06:26:22.705 UTC [40] ERROR connection error: PQconnectPoll
2024-02-14 06:26:22.705 UTC [40] ERROR libpq: connection to server on socket "/var/run/postgresql/.s.PGSQL.5432" failed: FATAL: the database system is shutting down
2024-02-14 06:26:22.705 UTC [40] WARNING [ccc]: default timeout
2024-02-14 06:26:23,374 INFO: updated leader lock during demoting self because DCS is not accessible and I was a leader
2024-02-14 06:26:27,477 INFO: Lock owner: aa-bb-ccc-postgres-1; I am aa-bb-ccc-postgres-1
2024-02-14 06:26:27,492 INFO: updated leader lock during demoting self because DCS is not accessible and I was a leader
2024-02-14 06:26:37,477 INFO: Lock owner: aa-bb-ccc-postgres-1; I am aa-bb-ccc-postgres-1
2024-02-14 06:26:39,992 INFO: updated leader lock during demoting self because DCS is not accessible and I was a leader
2024-02-14 06:26:47,477 INFO: Lock owner: aa-bb-ccc-postgres-1; I am aa-bb-ccc-postgres-1
2024-02-14 06:26:47,489 INFO: updated leader lock during demoting self because DCS is not accessible and I was a leader
2024-02-14 06:26:47,489 ERROR: ObjectCache.run ProtocolError("Connection broken: InvalidChunkLength(got length b'', 0 bytes read)", InvalidChunkLength(got length b'', 0 bytes read))
2024-02-14 06:26:52.729 UTC [40] LOG {ticks: 0, maint: 0, retry: 0}
2024-02-14 06:26:57,477 INFO: Lock owner: aa-bb-ccc-postgres-1; I am aa-bb-ccc-postgres-1
2024-02-14 06:27:02,481 ERROR: Request to server https://10.233.0.1:443 failed: ReadTimeoutError("HTTPSConnectionPool(host='10.233.0.1', port=443): Read timed out. (read timeout=4.99343760125339)")
2024-02-14 06:27:05,301 ERROR: Request to server https://10.233.0.1:443 failed: ReadTimeoutError("HTTPSConnectionPool(host='10.233.0.1', port=443): Read timed out. (read timeout=2.1720608193427324)")
2024-02-14 06:27:05,752 INFO: updated leader lock during demoting self because DCS is not accessible and I was a leader
2024-02-14 06:27:07,477 INFO: Lock owner: aa-bb-ccc-postgres-1; I am aa-bb-ccc-postgres-1
2024-02-14 06:27:07,488 INFO: updated leader lock during demoting self because DCS is not accessible and I was a leader
2024-02-14 06:27:17,477 INFO: Lock owner: aa-bb-ccc-postgres-1; I am aa-bb-ccc-postgres-1
2024-02-14 06:27:17,489 INFO: updated leader lock during demoting self because DCS is not accessible and I was a leader
2024-02-14 06:27:22.733 UTC [40] LOG {ticks: 0, maint: 0, retry: 0}
2024-02-14 06:27:22.735 UTC [40] ERROR connection error: PQconnectPoll
2024-02-14 06:27:22.735 UTC [40] ERROR libpq: connection to server on socket "/var/run/postgresql/.s.PGSQL.5432" failed: FATAL: the database system is shutting down
2024-02-14 06:27:22.735 UTC [40] WARNING [postgres]: default timeout
2024-02-14 06:27:22.737 UTC [40] ERROR connection error: PQconnectPoll
2024-02-14 06:27:22.737 UTC [40] ERROR libpq: connection to server on socket "/var/run/postgresql/.s.PGSQL.5432" failed: FATAL: the database system is shutting down
2024-02-14 06:27:22.737 UTC [40] ERROR connection error: PQconnectPoll
2024-02-14 06:27:22.737 UTC [40] ERROR libpq: connection to server on socket "/var/run/postgresql/.s.PGSQL.5432" failed: FATAL: the database system is shutting down
2024-02-14 06:27:22.737 UTC [40] WARNING [ccc]: default timeout
2024-02-14 06:27:27,477 INFO: Lock owner: aa-bb-ccc-postgres-1; I am aa-bb-ccc-postgres-1
2024-02-14 06:27:27,487 INFO: updated leader lock during demoting self because DCS is not accessible and I was a leader
2024-02-14 06:27:37,477 INFO: Lock owner: aa-bb-ccc-postgres-1; I am aa-bb-ccc-postgres-1
2024-02-14 06:27:37,488 INFO: updated leader lock during demoting self because DCS is not accessible and I was a leader
2024-02-14 06:27:47,477 INFO: Lock owner: aa-bb-ccc-postgres-1; I am aa-bb-ccc-postgres-1
2024-02-14 06:27:47,488 INFO: updated leader lock during demoting self because DCS is not accessible and I was a leader
2024-02-14 06:27:52.761 UTC [40] LOG {ticks: 0, maint: 0, retry: 0}
PostgreSQL log files
Please see Patroni log files section.
Have you tried to use GitHub issue search?
- [X] Yes
Anything else we need to know?
We enabled failsafe_mode
as a mitigation.
Patroni currently only forces termination in synchronous mode (see primary_stop_timeout parameter) Regarding the original issue, it is really the problem of libraries not implementing replication protocol correctly. Patroni just waits for the clean shutdown (which is not possible given there is at least one walsender that hasn't got the confirmation of all the changes being received).
JFYI the problem of endless shutdown should have been fixed in pgjdbc v42.7.0 - Use KeepAlive to confirm LSNs (although I don't know what you are actually using in the client application)
And please make sure you actually checked the previously opened issues - this was discussed many times