postgres-operator icon indicating copy to clipboard operation
postgres-operator copied to clipboard

Kubernetes DCS Postgres DB pod stuck in read-only state

Open douglasawh opened this issue 6 months ago • 0 comments

Please, answer some short questions which should help us to understand your problem / question better?

  • Which image of the operator are you using? postgres-operator:v1.14.0
  • Where do you run it - cloud or metal? Kubernetes or OpenShift? Rancher Kubernetes
  • Are you running Postgres Operator in production? yes
    • Type of issue? Bug report

Some general remarks when posting a bug report:

  • Please, check the operator, pod (Patroni) and postgresql logs first. When copy-pasting many log lines please do it in a separate GitHub gist together with your Postgres CRD and configuration manifest.

We had a Zalando Operator Postgres DB pod that is stuck in a read-only state. During a memory increase to our Zalando Postgres DB cluster in Kubernetes today, the cluster promoted the replica to leader, but writes to the new leader failed because it was in a read-only state.

Here are the logs from that event: postgres-1 (new leader):

2025-05-06 21:25:14,413 INFO: promoted self to leader by acquiring session lock
server promoting
2025-05-06 21:25:15,465 INFO: no action. I am (gitlab-code-postgres-1), the leader with the lock
ERROR: cannot execute ALTER ROLE in a read-only transaction
CONTEXT: SQL statement "ALTER ROLE admin WITH CREATEDB NOLOGIN NOCREATEROLE NOSUPERUSER NOREPLICATION INHERIT"
PL/pgSQL function inline_code_block line 5 at SQL statement

postgres-0 (new replica):

2025-05-06 21:25:15,429 INFO: Lock owner: gitlab-code-postgres-1; I am gitlab-code-postgres-0
2025-05-06 21:25:15,430 INFO: switchover: demote in progress
2025-05-06 21:25:16,382 INFO: Local timeline=51 lsn=3D6/9B000028
2025-05-06 21:25:16,396 ERROR: Exception when working with leader
Traceback (most recent call last):
File "/usr/local/lib/python3.10/dist-packages/patroni/postgresql/[rewind.py](https://rewind.py/)", line 80, in check_leader_is_not_in_recovery
with get_connection_cursor(connect_timeout=3, options='-c statement_timeout=2000', **conn_kwargs) as cur:
File "/usr/lib/python3.10/[contextlib.py](https://contextlib.py/)", line 135, in __enter__
return next(self.gen)
File "/usr/local/lib/python3.10/dist-packages/patroni/postgresql/[connection.py](https://connection.py/)", line 158, in get_connection_cursor
conn = psycopg.connect(**kwargs)
File "/usr/local/lib/python3.10/dist-packages/patroni/[psycopg.py](https://psycopg.py/)", line 123, in connect
ret = _connect(*args, **kwargs)
File "/usr/lib/python3/dist-packages/psycopg2/__init__.py", line 122, in connect
conn = _connect(dsn, connection_factory=connection_factory, **kwasync)
psycopg2.OperationalError: connection to server at "****", port [****] failed: session is read-only

2025-05-06 21:25:16,397 INFO: closed patroni connections to postgres

Restarting the pods several times seems to have resolved the issue.

There seems to be a class of these issues in patroni. Here is one with etcd: https://github.com/patroni/patroni/issues/3314. Here is one with consul: https://github.com/patroni/patroni/issues/2972

  • If you feel this issue might be more related to the Spilo docker image or Patroni, consider opening issues in the respective repos.

Honestly not sure, this is my first time personally using the Zalando Operator (previous experience with generic patroni). There is a similar class of bugs in patroni, so perhaps the problem is with the kubernetes DCS?

We have not reproduced it. However, it might be possible to reproduce with memory pressure.

bootstrap: dcs: failsafe_mode: false loop_wait: 10 maximum_lag_on_failover: 33550432 postgresql: parameters: archive_mode: 'on' archive_timeout: 1800s autovacuum_analyze_scale_factor: 0.02 autovacuum_max_workers: 5 autovacuum_vacuum_scale_factor: 0.05 checkpoint_completion_target: 0.9 hot_standby: 'on' log_autovacuum_min_duration: 0 log_checkpoints: 'on' log_connections: 'on' log_disconnections: 'on' log_line_prefix: '%t [%p]: [%l-1] %c %x %d %u %a %h ' log_lock_waits: 'on' log_min_duration_statement: 500 log_statement: ddl log_temp_files: 0 max_connections: 500 max_replication_slots: 10 max_wal_senders: 10 tcp_keepalives_idle: 900 tcp_keepalives_interval: 50 track_functions: all wal_compression: 'on' wal_keep_size: 4096MB wal_level: hot_standby wal_log_hints: 'on' use_pg_rewind: true use_slots: true retry_timeout: 10 ttl: 30 initdb:

  • auth-host: md5

  • auth-local: trust post_init: /scripts/post_init.sh "zalandos" kubernetes: bypass_api_service: true labels: application: spilo leader_label_value: master pod_ip: 50.50.50.50 port: tcp://50.12.0.1:443 port_443_tcp: tcp://50.12.0.1:443 port_443_tcp_addr: 50.12.0.1 port_443_tcp_port: '443' port_443_tcp_proto: tcp ports:

  • name: postgresql port: 5032 role_label: spilo-role scope_label: cluster-name service_host: 50.12.0.1 service_port: '443' service_port_https: '443' standby_leader_label_value: master use_endpoints: true namespace: gitlab-code postgresql: authentication: replication: password: username: standby superuser: password: username: postgres basebackup_fast_xlog: command: /scripts/basebackup.sh retries: 2 bin_dir: /usr/lib/postgresql/14/bin callbacks: on_role_change: /scripts/on_role_change.sh zalandos true connect_address: 50.50.50.50:5032 create_replica_method:

  • basebackup_fast_xlog data_dir: /home/postgres/pgdata/pgroot/data listen: '*:5032' name: gitlab-code-postgres-1 parameters: archive_command: /bin/true bg_mon.history_buckets: 120 bg_mon.listen_address: 0.0.0.0 extwlist.custom_path: /scripts extwlist.extensions: btree_gin,btree_gist,citext,extra_window_functions,first_last_agg,hll,hstore,hypopg,intarray,ltree,pgcrypto,pgq,pgq_node,pg_trgm,postgres_fdw,tablefunc,uuid-ossp,timescaledb,pg_partman log_destination: csvlog log_directory: ../pg_log log_file_mode: '0644' log_filename: postgresql-%u.log log_rotation_age: 1d log_truncate_on_rotation: 'on' logging_collector: 'on' pg_stat_statements.track_utility: 'off' shared_buffers: 8002MB shared_preload_libraries: bg_mon,pg_stat_statements,pgextwlist,pg_auth_mon,set_user,timescaledb,pg_cron,pg_stat_kcache ssl: 'on' ssl_ca_file: /tlsca/caFile ssl_cert_file: /tls/tls.crt ssl_key_file: /tls/tls.key pg_hba:

  • local all all trust

  • hostssl all +zalandos 127.0.0.1/32 pam

  • host all all 127.0.0.1/32 md5

  • hostssl all +zalandos ::1/128 pam

  • host all all ::1/128 md5

  • local replication standby trust

  • hostssl replication standby all md5

  • hostnossl all all all reject

  • hostssl all +zalandos all pam

  • hostssl all all all md5 pgpass: /run/postgresql/pgpass use_unix_socket: true use_unix_socket_repl: true restapi: connect_address: 50.50.50.50:8008 listen: :8008 scope: gitlab-code-postgres

  • Patroni version: 4.0.4

  • PostgreSQL version: 14

  • DCS (and its version): Kubernetes

Please let me know if any additional information would be useful, or if you think I should file against patroni or spilo.

douglasawh avatar May 14 '25 16:05 douglasawh