kubeblocks icon indicating copy to clipboard operation
kubeblocks copied to clipboard

[BUG] Chaos-Mesh IO latency fault causes pg startup errors

Open JashBook opened this issue 5 months ago • 3 comments

Describe the bug A clear and concise description of what the bug is.

kbcli version
Kubernetes: v1.30.4-vke.4
KubeBlocks: 0.9.5-beta.8
kbcli: 0.9.5-beta.7

To Reproduce Steps to reproduce the behavior:

  1. create pg cluster
apiVersion: apps.kubeblocks.io/v1alpha1
kind: Cluster
metadata:
  name: postgres-fuhcpe
  namespace: default
spec:
  clusterDefinitionRef: postgresql
  topology: replication
  terminationPolicy: Delete
  componentSpecs:
    - name: postgresql
      serviceVersion: 12.14.0
      replicas: 2
      resources:
        requests:
          cpu: 100m
          memory: 0.5Gi
        limits:
          cpu: 100m
          memory: 0.5Gi
      volumeClaimTemplates:
        - name: data
          spec:
            storageClassName: 
            accessModes:
              - ReadWriteOnce
            resources:
              requests:
                storage: 20Gi
kbcli cluster list postgres-fuhcpe --show-labels  --namespace default 
    
NAME              NAMESPACE   CLUSTER-DEFINITION   VERSION   TERMINATION-POLICY   STATUS    CREATED-TIME                 LABELS                                                                               
postgres-fuhcpe   default     postgresql                     Delete               Running   Jul 31,2025 17:09 UTC+0800   clusterdefinition.kubeblocks.io/name=postgresql,clusterversion.kubeblocks.io/name=   

kbcli cluster list-instances postgres-fuhcpe --namespace default
    
NAME                           NAMESPACE   CLUSTER           COMPONENT    STATUS    ROLE        ACCESSMODE   AZ               CPU(REQUEST/LIMIT)   MEMORY(REQUEST/LIMIT)   STORAGE     NODE                          CREATED-TIME                 
postgres-fuhcpe-postgresql-0   default     postgres-fuhcpe   postgresql   Running   secondary   <none>       cn-guangzhou-b   100m / 100m          512Mi / 512Mi           data:20Gi   192.168.0.88/192.168.0.88     Jul 31,2025 17:09 UTC+0800   
postgres-fuhcpe-postgresql-1   default     postgres-fuhcpe   postgresql   Running   primary     <none>       cn-guangzhou-b   100m / 100m          512Mi / 512Mi           data:20Gi   192.168.0.167/192.168.0.167   Jul 31,2025 17:09 UTC+0800  
  1. install chaos-mesh
helm repo add chaos-mesh https://charts.chaos-mesh.org/
helm upgrade --install chaos-mesh chaos-mesh/chaos-mesh -n chaos-mesh --create-namespace --version 2.7.2  --set chaosDaemon.runtime=containerd --set chaosDaemon.socketPath=/run/containerd/containerd.sock
  1. inject IO latency fault
apiVersion: chaos-mesh.org/v1alpha1
kind: IOChaos
metadata:
  name: test-chaos-mesh-latencyover-postgres-fuhcpe
  namespace: default
spec:
  action: latency
  mode: all
  selector:
    namespaces:
      - default
    labelSelectors:
      apps.kubeblocks.io/pod-name: postgres-fuhcpe-postgresql-1
  volumePath: /home/postgres/pgdata
  path: '/home/postgres/pgdata/**/*'
  delay: '2000ms'
  percent: 100
  duration: 2m
  1. See error
kubectl get cluster postgres-fuhcpe
NAME              CLUSTER-DEFINITION   VERSION   TERMINATION-POLICY   STATUS     AGE
postgres-fuhcpe   postgresql                     Delete               Updating   33m

kbcli cluster list-instances postgres-fuhcpe
NAME                           NAMESPACE   CLUSTER           COMPONENT    STATUS    ROLE      ACCESSMODE   AZ               CPU(REQUEST/LIMIT)   MEMORY(REQUEST/LIMIT)   STORAGE     NODE                          CREATED-TIME                 
postgres-fuhcpe-postgresql-0   default     postgres-fuhcpe   postgresql   Running   primary   <none>       cn-guangzhou-b   100m / 100m          512Mi / 512Mi           data:20Gi   192.168.0.88/192.168.0.88     Jul 31,2025 17:09 UTC+0800   
postgres-fuhcpe-postgresql-1   default     postgres-fuhcpe   postgresql   Running   <none>    <none>       cn-guangzhou-b   100m / 100m          512Mi / 512Mi           data:20Gi   192.168.0.167/192.168.0.167   Jul 31,2025 17:09 UTC+0800   

logs error pod

kubectl logs postgres-fuhcpe-postgresql-1 
Defaulted container "postgresql" out of: postgresql, pgbouncer, exporter, lorry, config-manager, pg-init-container (init), init-lorry (init)
2025-07-31 09:10:42,005 - bootstrapping - INFO - Figuring out my environment (Google? AWS? Openstack? Local?)
2025-07-31 09:10:44,008 - bootstrapping - INFO - Could not connect to 169.254.169.254, assuming local Docker setup
2025-07-31 09:10:44,008 - bootstrapping - INFO - No meta-data available for this provider
2025-07-31 09:10:44,009 - bootstrapping - INFO - Looks like you are running local
2025-07-31 09:10:44,099 - bootstrapping - INFO - kubeblocks generate local configuration: 
bootstrap:
  dcs:
    check_timeline: true
    loop_wait: 10
    max_timelines_history: 0
    maximum_lag_on_failover: 1048576
    postgresql:
      parameters:
        archive_command: /bin/true
        archive_mode: 'on'
        autovacuum_analyze_scale_factor: '0.1'
        autovacuum_max_workers: '3'
        autovacuum_vacuum_scale_factor: '0.05'
        checkpoint_completion_target: '0.9'
        log_autovacuum_min_duration: '10000'
        log_checkpoints: 'True'
        log_connections: 'False'
        log_disconnections: 'False'
        log_min_duration_statement: '1000'
        log_statement: ddl
        log_temp_files: 128kB
        max_connections: '56'
        max_locks_per_transaction: '64'
        max_prepared_transactions: '100'
        max_replication_slots: '16'
        max_wal_senders: '64'
        max_worker_processes: '8'
        tcp_keepalives_idle: 45s
        tcp_keepalives_interval: 10s
        track_commit_timestamp: 'False'
        track_functions: pl
        wal_compression: 'True'
        wal_keep_segments: '0'
        wal_level: replica
        wal_log_hints: 'False'
    retry_timeout: 10
    ttl: 30
  initdb:
  - auth-host: md5
  - auth-local: trust
postgresql:
  config_dir: /home/postgres/pgdata/conf
  custom_conf: /home/postgres/conf/postgresql.conf
  parameters:
    pg_stat_statements.track_utility: 'False'
    shared_buffers: 128MB
    shared_preload_libraries: pg_stat_statements,auto_explain,bg_mon,pgextwlist,pg_auth_mon,set_user,pg_cron,pg_stat_kcache,timescaledb,pgaudit
  pg_hba:
  - host     all             all             0.0.0.0/0                md5
  - host     all             all             ::/0                     md5
  - local    all             all                                     trust
  - host     all             all             127.0.0.1/32            trust
  - host     all             all             ::1/128                 trust
  - local     replication     all                                    trust
  - host      replication     all             0.0.0.0/0               md5
  - host      replication     all             ::/0                    md5

2025-07-31 09:10:44,200 - bootstrapping - INFO - Configuring pam-oauth2
2025-07-31 09:10:44,200 - bootstrapping - INFO - No PAM_OAUTH2 configuration was specified, skipping
2025-07-31 09:10:44,200 - bootstrapping - INFO - Configuring patroni
2025-07-31 09:10:44,297 - bootstrapping - INFO - Writing to file /run/postgres.yml
2025-07-31 09:10:44,298 - bootstrapping - INFO - Configuring wal-e
2025-07-31 09:10:44,298 - bootstrapping - INFO - Configuring certificate
2025-07-31 09:10:44,298 - bootstrapping - INFO - Generating ssl self-signed certificate
2025-07-31 09:10:47,799 - bootstrapping - INFO - Configuring pgqd
2025-07-31 09:10:47,799 - bootstrapping - INFO - Configuring standby-cluster
2025-07-31 09:10:47,799 - bootstrapping - INFO - Configuring log
2025-07-31 09:10:47,799 - bootstrapping - INFO - Configuring pgbouncer
2025-07-31 09:10:47,799 - bootstrapping - INFO - No PGBOUNCER_CONFIGURATION was specified, skipping
2025-07-31 09:10:47,799 - bootstrapping - INFO - Configuring bootstrap
2025-07-31 09:10:47,799 - bootstrapping - INFO - Configuring crontab
2025-07-31 09:10:47,799 - bootstrapping - INFO - Skipping creation of renice cron job due to lack of SYS_NICE capability
2025-07-31 09:10:50,106 INFO: Selected new K8s API server endpoint https://192.168.0.2:6443
2025-07-31 09:10:50,210 INFO: No PostgreSQL configuration items changed, nothing to reload.
2025-07-31 09:10:50,211 INFO: Lock owner: None; I am postgres-fuhcpe-postgresql-1
2025-07-31 09:10:50,341 INFO: trying to bootstrap a new cluster
The files belonging to this database system will be owned by user "postgres".
This user must also own the server process.

The database cluster will be initialized with locale "en_US.utf-8".
The default database encoding has accordingly been set to "UTF8".
The default text search configuration will be set to "english".

Data page checksums are disabled.

fixing permissions on existing directory /home/postgres/pgdata/pgroot/data ... ok
creating subdirectories ... ok
selecting dynamic shared memory implementation ... posix
selecting default max_connections ... 100
selecting default shared_buffers ... 128MB
selecting default time zone ... Etc/UTC
creating configuration files ... ok
running bootstrap script ... ok
performing post-bootstrap initialization ... ok
syncing data to disk ... ok

Success. You can now start the database server using:

    /usr/lib/postgresql/12/bin/pg_ctl -D /home/postgres/pgdata/pgroot/data -l logfile start

2025-07-31 09:10:55,898 INFO: postmaster pid=133
/var/run/postgresql:5432 - no response
2025-07-31 09:10:55 GMT [133]: [1-1] 688b331f.85 0     LOG:  Auto detecting pg_stat_kcache.linux_hz parameter...
2025-07-31 09:10:55 GMT [133]: [2-1] 688b331f.85 0     LOG:  pg_stat_kcache.linux_hz is set to 1000000
2025-07-31 09:10:55 GMT [133]: [3-1] 688b331f.85 0     LOG:  pgaudit extension initialized
2025-07-31 09:10:55 GMT [133]: [4-1] 688b331f.85 0     LOG:  starting PostgreSQL 12.18 (Ubuntu 12.18-1.pgdg22.04+1) on x86_64-pc-linux-gnu, compiled by gcc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0, 64-bit
2025-07-31 09:10:55 GMT [133]: [5-1] 688b331f.85 0     LOG:  listening on IPv4 address "0.0.0.0", port 5432
2025-07-31 09:10:55 GMT [133]: [6-1] 688b331f.85 0     LOG:  listening on IPv6 address "::", port 5432
2025-07-31 09:10:55 GMT [133]: [7-1] 688b331f.85 0     LOG:  listening on Unix socket "/var/run/postgresql/.s.PGSQL.5432"
2025-07-31 09:10:56 GMT [133]: [8-1] 688b331f.85 0     LOG:  redirecting log output to logging collector process
2025-07-31 09:10:56 GMT [133]: [9-1] 688b331f.85 0     HINT:  Future log output will appear in directory "../pg_log".
/var/run/postgresql:5432 - accepting connections
/var/run/postgresql:5432 - accepting connections
2025-07-31 09:10:56,908 INFO: establishing a new patroni connection to the postgres cluster
2025-07-31 09:10:57,034 INFO: running post_bootstrap
DO
GRANT ROLE
DO
DO
CREATE EXTENSION
NOTICE:  version "1.1" of extension "pg_auth_mon" is already installed
ALTER EXTENSION
GRANT
CREATE EXTENSION
DO
NOTICE:  version "1.5" of extension "pg_cron" is already installed
ALTER EXTENSION
ALTER POLICY
REVOKE
GRANT
REVOKE
GRANT
ALTER POLICY
REVOKE
GRANT
CREATE FUNCTION
REVOKE
GRANT
REVOKE
GRANT
CREATE EXTENSION
DO
CREATE TABLE
GRANT
CREATE FOREIGN TABLE
GRANT
CREATE VIEW
ALTER VIEW
GRANT
CREATE FOREIGN TABLE
GRANT
CREATE VIEW
ALTER VIEW
GRANT
CREATE FOREIGN TABLE
GRANT
CREATE VIEW
ALTER VIEW
GRANT
CREATE FOREIGN TABLE
GRANT
CREATE VIEW
ALTER VIEW
GRANT
CREATE FOREIGN TABLE
GRANT
CREATE VIEW
ALTER VIEW
GRANT
CREATE FOREIGN TABLE
GRANT
CREATE VIEW
ALTER VIEW
GRANT
CREATE FOREIGN TABLE
GRANT
CREATE VIEW
ALTER VIEW
GRANT
CREATE FOREIGN TABLE
GRANT
CREATE VIEW
ALTER VIEW
GRANT
RESET
SET
NOTICE:  schema "zmon_utils" does not exist, skipping
DROP SCHEMA
DO
NOTICE:  language "plpythonu" does not exist, skipping
DROP LANGUAGE
NOTICE:  function plpython_call_handler() does not exist, skipping
DROP FUNCTION
NOTICE:  function plpython_inline_handler(internal) does not exist, skipping
DROP FUNCTION
NOTICE:  function plpython_validator(oid) does not exist, skipping
DROP FUNCTION
CREATE SCHEMA
GRANT
SET
CREATE TYPE
CREATE FUNCTION
CREATE FUNCTION
CREATE FUNCTION
REVOKE
CREATE VIEW
REVOKE
GRANT
GRANT
You are now connected to database "postgres" as user "postgres".
CREATE SCHEMA
GRANT
SET
CREATE FUNCTION
CREATE FUNCTION
REVOKE
GRANT
COMMENT
CREATE FUNCTION
REVOKE
GRANT
COMMENT
CREATE FUNCTION
REVOKE
GRANT
COMMENT
CREATE FUNCTION
REVOKE
GRANT
COMMENT
CREATE FUNCTION
REVOKE
GRANT
COMMENT
CREATE FUNCTION
REVOKE
GRANT
COMMENT
CREATE FUNCTION
REVOKE
GRANT
COMMENT
CREATE FUNCTION
REVOKE
GRANT
COMMENT
GRANT
RESET
CREATE EXTENSION
CREATE EXTENSION
CREATE EXTENSION
NOTICE:  version "3.0" of extension "set_user" is already installed
ALTER EXTENSION
GRANT
GRANT
GRANT
CREATE SCHEMA
GRANT
SET
CREATE FUNCTION
CREATE VIEW
CREATE FUNCTION
CREATE VIEW
CREATE FUNCTION
CREATE VIEW
REVOKE
GRANT
REVOKE
GRANT
RESET
You are now connected to database "template1" as user "postgres".
CREATE SCHEMA
GRANT
SET
CREATE FUNCTION
CREATE FUNCTION
REVOKE
GRANT
COMMENT
CREATE FUNCTION
REVOKE
GRANT
COMMENT
CREATE FUNCTION
REVOKE
GRANT
COMMENT
CREATE FUNCTION
REVOKE
GRANT
COMMENT
CREATE FUNCTION
REVOKE
GRANT
COMMENT
CREATE FUNCTION
REVOKE
GRANT
COMMENT
CREATE FUNCTION
REVOKE
GRANT
COMMENT
CREATE FUNCTION
REVOKE
GRANT
COMMENT
GRANT
RESET
CREATE EXTENSION
CREATE EXTENSION
CREATE EXTENSION
NOTICE:  version "3.0" of extension "set_user" is already installed
ALTER EXTENSION
GRANT
GRANT
GRANT
CREATE SCHEMA
GRANT
SET
CREATE FUNCTION
CREATE VIEW
CREATE FUNCTION
CREATE VIEW
CREATE FUNCTION
CREATE VIEW
REVOKE
GRANT
REVOKE
GRANT
RESET
2025-07-31 09:11:02,002 WARNING: Could not activate Linux watchdog device: "Can't open watchdog device: [Errno 2] No such file or directory: '/dev/watchdog'"
2025-07-31 09:11:02,214 INFO: initialized a new cluster
2025-07-31 09:11:02,224 INFO: no action. I am (postgres-fuhcpe-postgresql-1), the leader with the lock
2025-07-31 09:11:12,761 INFO: no action. I am (postgres-fuhcpe-postgresql-1), the leader with the lock
2025-07-31 09:11:22,760 INFO: no action. I am (postgres-fuhcpe-postgresql-1), the leader with the lock
2025-07-31 09:11:32,803 INFO: no action. I am (postgres-fuhcpe-postgresql-1), the leader with the lock
2025-07-31 09:11:42,798 INFO: no action. I am (postgres-fuhcpe-postgresql-1), the leader with the lock
2025-07-31 09:11:49.013 UTC [72] LOG Starting pgqd 3.5
2025-07-31 09:11:49.013 UTC [72] LOG auto-detecting dbs ...
2025-07-31 09:11:52,723 INFO: no action. I am (postgres-fuhcpe-postgresql-1), the leader with the lock
2025-07-31 09:12:02,760 INFO: no action. I am (postgres-fuhcpe-postgresql-1), the leader with the lock
2025-07-31 09:12:12,800 INFO: no action. I am (postgres-fuhcpe-postgresql-1), the leader with the lock
2025-07-31 09:12:19.097 UTC [72] LOG {ticks: 0, maint: 0, retry: 0}
2025-07-31 09:12:23,102 INFO: no action. I am (postgres-fuhcpe-postgresql-1), the leader with the lock
2025-07-31 09:12:32,997 INFO: no action. I am (postgres-fuhcpe-postgresql-1), the leader with the lock
2025-07-31 09:12:42,899 INFO: no action. I am (postgres-fuhcpe-postgresql-1), the leader with the lock
2025-07-31 09:12:49.097 UTC [72] LOG {ticks: 0, maint: 0, retry: 0}
2025-07-31 09:12:52,900 INFO: no action. I am (postgres-fuhcpe-postgresql-1), the leader with the lock
2025-07-31 09:13:03,498 INFO: no action. I am (postgres-fuhcpe-postgresql-1), the leader with the lock
2025-07-31 09:13:12,899 INFO: no action. I am (postgres-fuhcpe-postgresql-1), the leader with the lock
2025-07-31 09:13:19.127 UTC [72] LOG {ticks: 0, maint: 0, retry: 0}
2025-07-31 09:13:22,762 INFO: no action. I am (postgres-fuhcpe-postgresql-1), the leader with the lock
2025-07-31 09:13:30,199 ERROR: get_postgresql_status
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/patroni/api.py", line 673, in get_postgresql_status
    row = self.query(stmt.format(postgresql.wal_name, postgresql.lsn_name), retry=retry)[0]
  File "/usr/local/lib/python3.10/dist-packages/patroni/api.py", line 655, in query
    return self.server.query(sql, *params)
  File "/usr/local/lib/python3.10/dist-packages/patroni/api.py", line 749, in query
    raise e
  File "/usr/local/lib/python3.10/dist-packages/patroni/api.py", line 745, in query
    cursor.execute(sql, params)
psycopg2.errors.QueryCanceled: canceling statement due to statement timeout

2025-07-31 09:13:36,205 ERROR: get_postgresql_status
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/patroni/api.py", line 673, in get_postgresql_status
    row = self.query(stmt.format(postgresql.wal_name, postgresql.lsn_name), retry=retry)[0]
  File "/usr/local/lib/python3.10/dist-packages/patroni/api.py", line 655, in query
    return self.server.query(sql, *params)
  File "/usr/local/lib/python3.10/dist-packages/patroni/api.py", line 749, in query
    raise e
  File "/usr/local/lib/python3.10/dist-packages/patroni/api.py", line 745, in query
    cursor.execute(sql, params)
psycopg2.errors.QueryCanceled: canceling statement due to statement timeout

2025-07-31 09:13:42,209 ERROR: get_postgresql_status
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/patroni/api.py", line 673, in get_postgresql_status
    row = self.query(stmt.format(postgresql.wal_name, postgresql.lsn_name), retry=retry)[0]
  File "/usr/local/lib/python3.10/dist-packages/patroni/api.py", line 655, in query
    return self.server.query(sql, *params)
  File "/usr/local/lib/python3.10/dist-packages/patroni/api.py", line 749, in query
    raise e
  File "/usr/local/lib/python3.10/dist-packages/patroni/api.py", line 745, in query
    cursor.execute(sql, params)
psycopg2.errors.QueryCanceled: canceling statement due to statement timeout

2025-07-31 09:13:48,213 ERROR: get_postgresql_status
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/patroni/api.py", line 673, in get_postgresql_status
    row = self.query(stmt.format(postgresql.wal_name, postgresql.lsn_name), retry=retry)[0]
  File "/usr/local/lib/python3.10/dist-packages/patroni/api.py", line 655, in query
    return self.server.query(sql, *params)
  File "/usr/local/lib/python3.10/dist-packages/patroni/api.py", line 749, in query
    raise e
  File "/usr/local/lib/python3.10/dist-packages/patroni/api.py", line 745, in query
    cursor.execute(sql, params)
psycopg2.errors.QueryCanceled: canceling statement due to statement timeout

2025-07-31 09:13:49.128 UTC [72] LOG {ticks: 0, maint: 0, retry: 0}
2025-07-31 09:13:54,216 ERROR: get_postgresql_status
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/patroni/api.py", line 673, in get_postgresql_status
    row = self.query(stmt.format(postgresql.wal_name, postgresql.lsn_name), retry=retry)[0]
  File "/usr/local/lib/python3.10/dist-packages/patroni/api.py", line 655, in query
    return self.server.query(sql, *params)
  File "/usr/local/lib/python3.10/dist-packages/patroni/api.py", line 749, in query
    raise e
  File "/usr/local/lib/python3.10/dist-packages/patroni/api.py", line 745, in query
    cursor.execute(sql, params)
psycopg2.errors.QueryCanceled: canceling statement due to statement timeout
...
2025-07-31 09:15:19.218 UTC [72] LOG {ticks: 0, maint: 0, retry: 0}
2025-07-31 09:15:23,062 ERROR: get_postgresql_status
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/patroni/api.py", line 673, in get_postgresql_status
    row = self.query(stmt.format(postgresql.wal_name, postgresql.lsn_name), retry=retry)[0]
  File "/usr/local/lib/python3.10/dist-packages/patroni/api.py", line 655, in query
    return self.server.query(sql, *params)
  File "/usr/local/lib/python3.10/dist-packages/patroni/api.py", line 749, in query
    raise e
  File "/usr/local/lib/python3.10/dist-packages/patroni/api.py", line 745, in query
    cursor.execute(sql, params)
psycopg2.errors.QueryCanceled: canceling statement due to statement timeout

2025-07-31 09:15:33,123 ERROR: get_postgresql_status
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/patroni/api.py", line 673, in get_postgresql_status
    row = self.query(stmt.format(postgresql.wal_name, postgresql.lsn_name), retry=retry)[0]
  File "/usr/local/lib/python3.10/dist-packages/patroni/api.py", line 657, in query
    return retry(self.server.query, sql, *params)
  File "/usr/local/lib/python3.10/dist-packages/patroni/utils.py", line 334, in __call__
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/patroni/api.py", line 749, in query
    raise e
  File "/usr/local/lib/python3.10/dist-packages/patroni/api.py", line 745, in query
    cursor.execute(sql, params)
psycopg2.errors.QueryCanceled: canceling statement due to statement timeout

2025-07-31 09:13:50,729 INFO: Lock owner: postgres-fuhcpe-postgresql-1; I am postgres-fuhcpe-postgresql-1
2025-07-31 09:15:33,123 ERROR: failed to update leader lock
2025-07-31 09:15:33,124 INFO: Demoting self (immediate-nolock)
2025-07-31 09:15:33,124 WARNING: Exception happened during processing of request from ::ffff:192.168.0.239:59420
2025-07-31 09:15:33,125 WARNING: Traceback (most recent call last):
  File "/usr/lib/python3.10/socketserver.py", line 683, in process_request_thread
    self.finish_request(request, client_address)
  File "/usr/lib/python3.10/socketserver.py", line 360, in finish_request
    self.RequestHandlerClass(request, client_address, self)
  File "/usr/lib/python3.10/socketserver.py", line 747, in __init__
    self.handle()
  File "/usr/lib/python3.10/http/server.py", line 433, in handle
    self.handle_one_request()
  File "/usr/local/lib/python3.10/dist-packages/patroni/api.py", line 721, in handle_one_request
    BaseHTTPRequestHandler.handle_one_request(self)
  File "/usr/lib/python3.10/http/server.py", line 421, in handle_one_request
    method()
  File "/usr/local/lib/python3.10/dist-packages/patroni/api.py", line 210, in do_GET_patroni
    self._write_status_response(200, response)
  File "/usr/local/lib/python3.10/dist-packages/patroni/api.py", line 90, in _write_status_response
    self._write_json_response(status_code, response)
  File "/usr/local/lib/python3.10/dist-packages/patroni/api.py", line 49, in _write_json_response
    self._write_response(status_code, json.dumps(response, default=str), content_type='application/json')
  File "/usr/local/lib/python3.10/dist-packages/patroni/api.py", line 45, in _write_response
    self.end_headers()
  File "/usr/lib/python3.10/http/server.py", line 535, in end_headers
    self.flush_headers()
  File "/usr/lib/python3.10/http/server.py", line 539, in flush_headers
    self.wfile.write(b"".join(self._headers_buffer))
  File "/usr/lib/python3.10/socketserver.py", line 826, in write
    self._sock.sendall(b)
BrokenPipeError: [Errno 32] Broken pipe

2025-07-31 09:15:34,136 INFO: demoted self because failed to update leader lock in DCS
2025-07-31 09:15:34,136 ERROR: Exception during execution of long running task starting after demotion
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/patroni/async_executor.py", line 97, in run
    wakeup = func(*args) if args else func()
  File "/usr/local/lib/python3.10/dist-packages/patroni/postgresql/__init__.py", line 954, in follow
    self.config.write_recovery_conf(recovery_params)
  File "/usr/local/lib/python3.10/dist-packages/patroni/postgresql/config.py", line 777, in write_recovery_conf
    open(self._standby_signal, 'w').close()
OSError: [Errno 107] Transport endpoint is not connected: '/home/postgres/pgdata/pgroot/data/standby.signal'
2025-07-31 09:15:34,137 WARNING: Loop time exceeded, rescheduling immediately.
2025-07-31 09:15:34,138 INFO: Lock owner: postgres-fuhcpe-postgresql-0; I am postgres-fuhcpe-postgresql-1
2025-07-31 09:15:34,299 INFO: trying to bootstrap from leader 'postgres-fuhcpe-postgresql-0'
mkdir: cannot create directory ‘/home/postgres/pgdata’: Transport endpoint is not connected
rm: cannot remove '/home/postgres/pgdata/pgroot/data': Transport endpoint is not connected
rm: cannot remove '/home/postgres/pgdata/pgroot/wal_fast/*': Transport endpoint is not connected
/scripts/basebackup.sh: line 89: /home/postgres/pgdata/pgroot/wal_fast/receivewal.lock: Transport endpoint is not connected
flock: 9: Bad file descriptor
cat: /home/postgres/pgdata/pgroot/wal_fast/receivewal.pid: Transport endpoint is not connected
2025-07-31 09:15:34,400 INFO: Lock owner: postgres-fuhcpe-postgresql-0; I am postgres-fuhcpe-postgresql-1
2025-07-31 09:15:34,899 INFO: bootstrap from leader 'postgres-fuhcpe-postgresql-0' in progress
pg_basebackup: error: could not access directory "/home/postgres/pgdata/pgroot/data": Transport endpoint is not connected
2025-07-31 09:15:38,102 INFO: Lock owner: postgres-fuhcpe-postgresql-0; I am postgres-fuhcpe-postgresql-1
2025-07-31 09:15:38,102 INFO: bootstrap from leader 'postgres-fuhcpe-postgresql-0' in progress
rm: cannot remove '/home/postgres/pgdata/pgroot/data': Transport endpoint is not connected
pg_basebackup: error: could not access directory "/home/postgres/pgdata/pgroot/data": Transport endpoint is not connected
2025-07-31 09:15:48,104 INFO: Lock owner: postgres-fuhcpe-postgresql-0; I am postgres-fuhcpe-postgresql-1
2025-07-31 09:15:48,104 INFO: bootstrap from leader 'postgres-fuhcpe-postgresql-0' in progress
2025-07-31 09:15:49.234 UTC [72] LOG {ticks: 0, maint: 0, retry: 0}
2025-07-31 09:15:58,105 INFO: Lock owner: postgres-fuhcpe-postgresql-0; I am postgres-fuhcpe-postgresql-1
2025-07-31 09:15:58,105 INFO: bootstrap from leader 'postgres-fuhcpe-postgresql-0' in progress
rm: cannot remove '/home/postgres/pgdata/pgroot/data': Transport endpoint is not connected
pg_basebackup: error: could not access directory "/home/postgres/pgdata/pgroot/data": Transport endpoint is not connected
2025-07-31 09:16:05,500 ERROR: Error creating replica using method basebackup_fast_xlog: /scripts/basebackup.sh exited with code=1
2025-07-31 09:16:05,500 ERROR: failed to bootstrap from leader 'postgres-fuhcpe-postgresql-0'
2025-07-31 09:16:05,500 INFO: Removing data directory: /home/postgres/pgdata/pgroot/data
2025-07-31 09:16:08,104 INFO: Lock owner: postgres-fuhcpe-postgresql-0; I am postgres-fuhcpe-postgresql-1
2025-07-31 09:16:08,141 INFO: trying to bootstrap from leader 'postgres-fuhcpe-postgresql-0'
mkdir: cannot create directory ‘/home/postgres/pgdata’: Transport endpoint is not connected
rm: cannot remove '/home/postgres/pgdata/pgroot/data': Transport endpoint is not connected
rm: cannot remove '/home/postgres/pgdata/pgroot/wal_fast/*': Transport endpoint is not connected
/scripts/basebackup.sh: line 89: /home/postgres/pgdata/pgroot/wal_fast/receivewal.lock: Transport endpoint is not connected
flock: 9: Bad file descriptor
cat: /home/postgres/pgdata/pgroot/wal_fast/receivewal.pid: Transport endpoint is not connected
pg_basebackup: error: could not access directory "/home/postgres/pgdata/pgroot/data": Transport endpoint is not connected
2025-07-31 09:16:18,105 INFO: Lock owner: postgres-fuhcpe-postgresql-0; I am postgres-fuhcpe-postgresql-1
2025-07-31 09:16:18,143 INFO: bootstrap from leader 'postgres-fuhcpe-postgresql-0' in progress
rm: cannot remove '/home/postgres/pgdata/pgroot/data': Transport endpoint is not connected
pg_basebackup: error: could not access directory "/home/postgres/pgdata/pgroot/data": Transport endpoint is not connected
2025-07-31 09:16:19.264 UTC [72] LOG {ticks: 0, maint: 0, retry: 0}
2025-07-31 09:16:23.088 UTC [72] ERROR connection error: PQconnectStart
2025-07-31 09:16:23.088 UTC [72] ERROR libpq: connection to server on socket "/var/run/postgresql/.s.PGSQL.5432" failed: No such file or directory
		Is the server running locally and accepting connections on that socket?
2025-07-31 09:16:23.088 UTC [72] WARNING [executions_loop]: default timeout
2025-07-31 09:16:28,105 INFO: Lock owner: postgres-fuhcpe-postgresql-0; I am postgres-fuhcpe-postgresql-1
2025-07-31 09:16:28,105 INFO: bootstrap from leader 'postgres-fuhcpe-postgresql-0' in progress
2025-07-31 09:16:33.134 UTC [72] ERROR connection error: PQconnectStart
2025-07-31 09:16:33.134 UTC [72] ERROR libpq: connection to server on socket "/var/run/postgresql/.s.PGSQL.5432" failed: No such file or directory
		Is the server running locally and accepting connections on that socket?
2025-07-31 09:16:33.134 UTC [72] WARNING [postgres]: default timeout
2025-07-31 09:16:33.134 UTC [72] ERROR connection error: PQconnectStart
2025-07-31 09:16:33.134 UTC [72] ERROR libpq: connection to server on socket "/var/run/postgresql/.s.PGSQL.5432" failed: No such file or directory
		Is the server running locally and accepting connections on that socket?
2025-07-31 09:16:38,104 INFO: Lock owner: postgres-fuhcpe-postgresql-0; I am postgres-fuhcpe-postgresql-1
2025-07-31 09:16:38,105 INFO: bootstrap from leader 'postgres-fuhcpe-postgresql-0' in progress
rm: cannot remove '/home/postgres/pgdata/pgroot/data': Transport endpoint is not connected
pg_basebackup: error: could not access directory "/home/postgres/pgdata/pgroot/data": Transport endpoint is not connected
2025-07-31 09:16:39,008 ERROR: Error creating replica using method basebackup_fast_xlog: /scripts/basebackup.sh exited with code=1
2025-07-31 09:16:39,008 ERROR: failed to bootstrap from leader 'postgres-fuhcpe-postgresql-0'
2025-07-31 09:16:39,008 INFO: Removing data directory: /home/postgres/pgdata/pgroot/data
2025-07-31 09:16:48,105 INFO: Lock owner: postgres-fuhcpe-postgresql-0; I am postgres-fuhcpe-postgresql-1
2025-07-31 09:16:48,143 INFO: trying to bootstrap from leader 'postgres-fuhcpe-postgresql-0'
mkdir: cannot create directory ‘/home/postgres/pgdata’: Transport endpoint is not connected
rm: cannot remove '/home/postgres/pgdata/pgroot/data': Transport endpoint is not connected
rm: cannot remove '/home/postgres/pgdata/pgroot/wal_fast/*': Transport endpoint is not connected
/scripts/basebackup.sh: line 89: /home/postgres/pgdata/pgroot/wal_fast/receivewal.lock: Transport endpoint is not connected
flock: 9: Bad file descriptor
cat: /home/postgres/pgdata/pgroot/wal_fast/receivewal.pid: Transport endpoint is not connected
pg_basebackup: error: could not access directory "/home/postgres/pgdata/pgroot/data": Transport endpoint is not connected
2025-07-31 09:16:49.280 UTC [72] LOG {ticks: 0, maint: 0, retry: 0}
2025-07-31 09:16:58,105 INFO: Lock owner: postgres-fuhcpe-postgresql-0; I am postgres-fuhcpe-postgresql-1
2025-07-31 09:16:58,146 INFO: bootstrap from leader 'postgres-fuhcpe-postgresql-0' in progress
rm: cannot remove '/home/postgres/pgdata/pgroot/data': Transport endpoint is not connected
pg_basebackup: error: could not access directory "/home/postgres/pgdata/pgroot/data": Transport endpoint is not connected
2025-07-31 09:17:08,105 INFO: Lock owner: postgres-fuhcpe-postgresql-0; I am postgres-fuhcpe-postgresql-1
2025-07-31 09:17:08,105 INFO: bootstrap from leader 'postgres-fuhcpe-postgresql-0' in progress
2025-07-31 09:17:18,105 INFO: Lock owner: postgres-fuhcpe-postgresql-0; I am postgres-fuhcpe-postgresql-1
2025-07-31 09:17:18,105 INFO: bootstrap from leader 'postgres-fuhcpe-postgresql-0' in progress
rm: cannot remove '/home/postgres/pgdata/pgroot/data': Transport endpoint is not connected
pg_basebackup: error: could not access directory "/home/postgres/pgdata/pgroot/data": Transport endpoint is not connected
2025-07-31 09:17:18,997 ERROR: Error creating replica using method basebackup_fast_xlog: /scripts/basebackup.sh exited with code=1
2025-07-31 09:17:18,997 ERROR: failed to bootstrap from leader 'postgres-fuhcpe-postgresql-0'
2025-07-31 09:17:18,997 INFO: Removing data directory: /home/postgres/pgdata/pgroot/data
2025-07-31 09:17:19.310 UTC [72] LOG {ticks: 0, maint: 0, retry: 0}
2025-07-31 09:17:23.091 UTC [72] ERROR connection error: PQconnectStart
2025-07-31 09:17:23.091 UTC [72] ERROR libpq: connection to server on socket "/var/run/postgresql/.s.PGSQL.5432" failed: No such file or directory
		Is the server running locally and accepting connections on that socket?
2025-07-31 09:17:23.091 UTC [72] WARNING [executions_loop]: default timeout
2025-07-31 09:17:28,104 INFO: Lock owner: postgres-fuhcpe-postgresql-0; I am postgres-fuhcpe-postgresql-1
2025-07-31 09:17:28,141 INFO: trying to bootstrap from leader 'postgres-fuhcpe-postgresql-0'
mkdir: cannot create directory ‘/home/postgres/pgdata’: Transport endpoint is not connected
rm: cannot remove '/home/postgres/pgdata/pgroot/data': Transport endpoint is not connected
rm: cannot remove '/home/postgres/pgdata/pgroot/wal_fast/*': Transport endpoint is not connected
/scripts/basebackup.sh: line 89: /home/postgres/pgdata/pgroot/wal_fast/receivewal.lock: Transport endpoint is not connected
flock: 9: Bad file descriptor
...
flock: 9: Bad file descriptor
cat: /home/postgres/pgdata/pgroot/wal_fast/receivewal.pid: Transport endpoint is not connected
pg_basebackup: error: could not access directory "/home/postgres/pgdata/pgroot/data": Transport endpoint is not connected
2025-07-31 09:42:18,104 INFO: Lock owner: postgres-fuhcpe-postgresql-0; I am postgres-fuhcpe-postgresql-1
2025-07-31 09:42:18,144 INFO: bootstrap from leader 'postgres-fuhcpe-postgresql-0' in progress
rm: cannot remove '/home/postgres/pgdata/pgroot/data': Transport endpoint is not connected
pg_basebackup: error: could not access directory "/home/postgres/pgdata/pgroot/data": Transport endpoint is not connected
2025-07-31 09:42:20.469 UTC [72] LOG {ticks: 0, maint: 0, retry: 0}
2025-07-31 09:42:23.170 UTC [72] ERROR connection error: PQconnectStart
2025-07-31 09:42:23.170 UTC [72] ERROR libpq: connection to server on socket "/var/run/postgresql/.s.PGSQL.5432" failed: No such file or directory
		Is the server running locally and accepting connections on that socket?
2025-07-31 09:42:23.170 UTC [72] WARNING [executions_loop]: default timeout
2025-07-31 09:42:28,104 INFO: Lock owner: postgres-fuhcpe-postgresql-0; I am postgres-fuhcpe-postgresql-1
2025-07-31 09:42:28,104 INFO: bootstrap from leader 'postgres-fuhcpe-postgresql-0' in progress
2025-07-31 09:42:33.404 UTC [72] ERROR connection error: PQconnectStart
2025-07-31 09:42:33.404 UTC [72] ERROR libpq: connection to server on socket "/var/run/postgresql/.s.PGSQL.5432" failed: No such file or directory
		Is the server running locally and accepting connections on that socket?
2025-07-31 09:42:33.404 UTC [72] WARNING [postgres]: default timeout
2025-07-31 09:42:33.404 UTC [72] ERROR connection error: PQconnectStart
2025-07-31 09:42:33.404 UTC [72] ERROR libpq: connection to server on socket "/var/run/postgresql/.s.PGSQL.5432" failed: No such file or directory
		Is the server running locally and accepting connections on that socket?
2025-07-31 09:42:38,105 INFO: Lock owner: postgres-fuhcpe-postgresql-0; I am postgres-fuhcpe-postgresql-1
2025-07-31 09:42:38,105 INFO: bootstrap from leader 'postgres-fuhcpe-postgresql-0' in progress
rm: cannot remove '/home/postgres/pgdata/pgroot/data': Transport endpoint is not connected
pg_basebackup: error: could not access directory "/home/postgres/pgdata/pgroot/data": Transport endpoint is not connected
2025-07-31 09:42:38,997 ERROR: Error creating replica using method basebackup_fast_xlog: /scripts/basebackup.sh exited with code=1
2025-07-31 09:42:38,997 ERROR: failed to bootstrap from leader 'postgres-fuhcpe-postgresql-0'
2025-07-31 09:42:38,997 INFO: Removing data directory: /home/postgres/pgdata/pgroot/data
2025-07-31 09:42:48,104 INFO: Lock owner: postgres-fuhcpe-postgresql-0; I am postgres-fuhcpe-postgresql-1
2025-07-31 09:42:48,148 INFO: trying to bootstrap from leader 'postgres-fuhcpe-postgresql-0'
mkdir: cannot create directory ‘/home/postgres/pgdata’: Transport endpoint is not connected
rm: cannot remove '/home/postgres/pgdata/pgroot/data': Transport endpoint is not connected
rm: cannot remove '/home/postgres/pgdata/pgroot/wal_fast/*': Transport endpoint is not connected
/scripts/basebackup.sh: line 89: /home/postgres/pgdata/pgroot/wal_fast/receivewal.lock: Transport endpoint is not connected
flock: 9: Bad file descriptor
cat: /home/postgres/pgdata/pgroot/wal_fast/receivewal.pid: Transport endpoint is not connected
pg_basebackup: error: could not access directory "/home/postgres/pgdata/pgroot/data": Transport endpoint is not connected
2025-07-31 09:42:50.486 UTC [72] LOG {ticks: 0, maint: 0, retry: 0}
2025-07-31 09:42:58,104 INFO: Lock owner: postgres-fuhcpe-postgresql-0; I am postgres-fuhcpe-postgresql-1
2025-07-31 09:42:58,399 INFO: bootstrap from leader 'postgres-fuhcpe-postgresql-0' in progress
rm: cannot remove '/home/postgres/pgdata/pgroot/data': Transport endpoint is not connected
pg_basebackup: error: could not access directory "/home/postgres/pgdata/pgroot/data": Transport endpoint is not connected
2025-07-31 09:43:08,104 INFO: Lock owner: postgres-fuhcpe-postgresql-0; I am postgres-fuhcpe-postgresql-1
2025-07-31 09:43:08,104 INFO: bootstrap from leader 'postgres-fuhcpe-postgresql-0' in progress
2025-07-31 09:43:18,105 INFO: Lock owner: postgres-fuhcpe-postgresql-0; I am postgres-fuhcpe-postgresql-1
2025-07-31 09:43:18,105 INFO: bootstrap from leader 'postgres-fuhcpe-postgresql-0' in progress
rm: cannot remove '/home/postgres/pgdata/pgroot/data': Transport endpoint is not connected
pg_basebackup: error: could not access directory "/home/postgres/pgdata/pgroot/data": Transport endpoint is not connected
2025-07-31 09:43:19,000 ERROR: Error creating replica using method basebackup_fast_xlog: /scripts/basebackup.sh exited with code=1
2025-07-31 09:43:19,001 ERROR: failed to bootstrap from leader 'postgres-fuhcpe-postgresql-0'
2025-07-31 09:43:19,001 INFO: Removing data directory: /home/postgres/pgdata/pgroot/data
2025-07-31 09:43:20.516 UTC [72] LOG {ticks: 0, maint: 0, retry: 0}
2025-07-31 09:43:23.173 UTC [72] ERROR connection error: PQconnectStart
2025-07-31 09:43:23.173 UTC [72] ERROR libpq: connection to server on socket "/var/run/postgresql/.s.PGSQL.5432" failed: No such file or directory
		Is the server running locally and accepting connections on that socket?
2025-07-31 09:43:23.173 UTC [72] WARNING [executions_loop]: default timeout
2025-07-31 09:43:28,104 INFO: Lock owner: postgres-fuhcpe-postgresql-0; I am postgres-fuhcpe-postgresql-1
2025-07-31 09:43:28,146 INFO: trying to bootstrap from leader 'postgres-fuhcpe-postgresql-0'
mkdir: cannot create directory ‘/home/postgres/pgdata’: Transport endpoint is not connected
rm: cannot remove '/home/postgres/pgdata/pgroot/data': Transport endpoint is not connected
rm: cannot remove '/home/postgres/pgdata/pgroot/wal_fast/*': Transport endpoint is not connected
/scripts/basebackup.sh: line 89: /home/postgres/pgdata/pgroot/wal_fast/receivewal.lock: Transport endpoint is not connected
flock: 9: Bad file descriptor
cat: /home/postgres/pgdata/pgroot/wal_fast/receivewal.pid: Transport endpoint is not connected
pg_basebackup: error: could not access directory "/home/postgres/pgdata/pgroot/data": Transport endpoint is not connected
2025-07-31 09:43:33.415 UTC [72] ERROR connection error: PQconnectStart
2025-07-31 09:43:33.415 UTC [72] ERROR libpq: connection to server on socket "/var/run/postgresql/.s.PGSQL.5432" failed: No such file or directory
		Is the server running locally and accepting connections on that socket?
2025-07-31 09:43:33.415 UTC [72] WARNING [postgres]: default timeout
2025-07-31 09:43:33.415 UTC [72] ERROR connection error: PQconnectStart
2025-07-31 09:43:33.415 UTC [72] ERROR libpq: connection to server on socket "/var/run/postgresql/.s.PGSQL.5432" failed: No such file or directory
		Is the server running locally and accepting connections on that socket?
2025-07-31 09:43:38,104 INFO: Lock owner: postgres-fuhcpe-postgresql-0; I am postgres-fuhcpe-postgresql-1
2025-07-31 09:43:38,143 INFO: bootstrap from leader 'postgres-fuhcpe-postgresql-0' in progress
rm: cannot remove '/home/postgres/pgdata/pgroot/data': Transport endpoint is not connected
pg_basebackup: error: could not access directory "/home/postgres/pgdata/pgroot/data": Transport endpoint is not connected
2025-07-31 09:43:48,105 INFO: Lock owner: postgres-fuhcpe-postgresql-0; I am postgres-fuhcpe-postgresql-1
2025-07-31 09:43:48,105 INFO: bootstrap from leader 'postgres-fuhcpe-postgresql-0' in progress
2025-07-31 09:43:50.533 UTC [72] LOG {ticks: 0, maint: 0, retry: 0}
2025-07-31 09:43:58,104 INFO: Lock owner: postgres-fuhcpe-postgresql-0; I am postgres-fuhcpe-postgresql-1
2025-07-31 09:43:58,104 INFO: bootstrap from leader 'postgres-fuhcpe-postgresql-0' in progress

Expected behavior A clear and concise description of what you expected to happen.

Screenshots If applicable, add screenshots to help explain your problem.

Desktop (please complete the following information):

  • OS: [e.g. iOS]
  • Browser [e.g. chrome, safari]
  • Version [e.g. 22]

Additional context Add any other context about the problem here.

JashBook avatar Jul 31 '25 09:07 JashBook

This issue has been marked as stale because it has been open for 30 days with no activity

github-actions[bot] avatar Sep 01 '25 00:09 github-actions[bot]

happening with istio as well

gabrieloandco avatar Oct 03 '25 10:10 gabrieloandco

This issue has been marked as stale because it has been open for 30 days with no activity

github-actions[bot] avatar Nov 03 '25 00:11 github-actions[bot]