helm-charts icon indicating copy to clipboard operation
helm-charts copied to clipboard

[ISSUE] Postgresql went into not running state

Open this-is-r-gaurav opened this issue 3 years ago • 2 comments

Bug Description We deployed timescale-single non HA and promscale on a production setup on eks (nfs storage without encryption) with default configuration of timescale, certain workload was running on that setup, after a day or two few of the pod on that node restarted. Within those pod timescale also restarted, But when the pod come up again, there was no socket in /var/run/postgresql/.s.PGSQL.5432 in the container In the logs we only able to see.

2022-06-21 12:18:11,298 WARNING: Postgresql is not running.

though socket was not there but postgres --single process was running inside the container.

Since promscale was not able to communicate with the DB, we disconnected prometheus and promscale, and after 4-5 hours. Postgres was running up again. We are unable to debug the cause. Need some help on that.

Expected behavior After timescale restarting , postgres should be also running.

Deployment

  • What is in your values.yaml
    • image: pg13.7-ts2.6.1-p0
  • What version of the Chart are you using?
    • chart 0.12.0
  • What is your Kubernetes Environment (for exampe: GKE, EKS, minikube, microk8s)
    • EKS with no encryption

this-is-r-gaurav avatar Jun 22 '22 06:06 this-is-r-gaurav

I see lots of BIND SELECT process being created, which is consuming too many FDS on the system. If i try to increase the FD it is not still working as expected

this-is-r-gaurav avatar Jun 30 '22 12:06 this-is-r-gaurav

we are facing same issue, any help would be appreciated, the only way to start this is by ssh'ing to the pod and running, patronictl restart timescaledb

image image

#$ cat /etc/timescaledb/patroni.yaml

bootstrap:
  dcs:
    loop_wait: 10
    maximum_lag_on_failover: 33554432
    postgresql:
      parameters:
        archive_command: /etc/timescaledb/scripts/pgbackrest_archive.sh %p
        archive_mode: "on"
        archive_timeout: 1800s
        autovacuum_analyze_scale_factor: 0.02
        autovacuum_max_workers: 10
        autovacuum_naptime: 5s
        autovacuum_vacuum_cost_limit: 500
        autovacuum_vacuum_scale_factor: 0.05
        hot_standby: "on"
        log_autovacuum_min_duration: 1min
        log_checkpoints: "on"
        log_connections: "on"
        log_disconnections: "on"
        log_line_prefix: '%t [%p]: [%c-%l] %u@%d,app=%a [%e] '
        log_lock_waits: "on"
        log_min_duration_statement: 1s
        log_statement: ddl
        max_connections: 100
        max_prepared_transactions: 150
        shared_preload_libraries: timescaledb,pg_stat_statements
        ssl: "on"
        ssl_cert_file: /etc/certificate/tls.crt
        ssl_key_file: /etc/certificate/tls.key
        tcp_keepalives_idle: 900
        tcp_keepalives_interval: 100
        temp_file_limit: 1GB
        timescaledb.passfile: ../.pgpass
        unix_socket_directories: /var/run/postgresql
        unix_socket_permissions: "0750"
        wal_level: hot_standby
        wal_log_hints: "on"
      use_pg_rewind: true
      use_slots: true
    retry_timeout: 10
    ttl: 30
  method: restore_or_initdb
  post_init: /etc/timescaledb/scripts/post_init.sh
  restore_or_initdb:
    command: |
      /etc/timescaledb/scripts/restore_or_initdb.sh --encoding=UTF8 --locale=C.UTF-8
    keep_existing_recovery_conf: true
kubernetes:
  role_label: role
  scope_label: cluster-name
  use_endpoints: true
log:
  level: WARNING
postgresql:
  authentication:
    replication:
      username: standby
    superuser:
      username: postgres
  basebackup:
  - waldir: /var/lib/postgresql/wal/pg_wal
  callbacks:
    on_reload: /etc/timescaledb/scripts/patroni_callback.sh
    on_restart: /etc/timescaledb/scripts/patroni_callback.sh
    on_role_change: /etc/timescaledb/scripts/patroni_callback.sh
    on_start: /etc/timescaledb/scripts/patroni_callback.sh
    on_stop: /etc/timescaledb/scripts/patroni_callback.sh
  create_replica_methods:
  - pgbackrest
  - basebackup
  listen: 0.0.0.0:5432
  parameters:
    temp_file_limit: 20GB
  pg_hba:
  - local     all             postgres                              peer
  - local     all             all                                   md5
  - hostnossl all,replication all                all                reject
  - hostssl   all             all                127.0.0.1/32       md5
  - hostssl   all             all                ::1/128            md5
  - hostssl   replication     standby            all                md5
  - hostssl   all             all                all                md5
  pgbackrest:
    command: /etc/timescaledb/scripts/pgbackrest_restore.sh
    keep_data: true
    no_master: true
    no_params: true
  recovery_conf:
    restore_command: /etc/timescaledb/scripts/pgbackrest_archive_get.sh %f "%p"
  use_unix_socket: true
restapi:
  listen: 0.0.0.0:8008

kodeine avatar Feb 28 '23 15:02 kodeine