postgres-operator
postgres-operator copied to clipboard
Standby-Cluster - WARNING: Retry got exception: 'connection problems'
Please, answer some short questions which should help us to understand your problem / question better?
- Which image of the operator are you using? registry.opensource.zalan.do/acid/postgres-operator:v1.7.1
- Where do you run it - cloud or metal? Kubernetes or OpenShift? Bare Metal K8s
- Are you running Postgres Operator in production? no
- Type of issue? question
Hi,
we are trying to create a standby cluster but we are stuck in an infinite loop in which the pod always states "connection error". This seems to appear after the bootstrap_standby_leader-stage has successfully finished. For context: we successfully configured S3 with WAL-E in the main cluster. Also the cloning directly from S3 appears to work flawlessly. The only thing currently not working is deploying a standby cluster.
❯ kubectl logs infra-demo-0 -f
2021-11-16 14:08:07,744 - bootstrapping - INFO - Figuring out my environment (Google? AWS? Openstack? Local?)
2021-11-16 14:08:09,750 - bootstrapping - INFO - Could not connect to 169.254.169.254, assuming local Docker setup
2021-11-16 14:08:09,751 - bootstrapping - INFO - No meta-data available for this provider
2021-11-16 14:08:09,752 - bootstrapping - INFO - Looks like your running local
2021-11-16 14:08:09,783 - bootstrapping - INFO - Configuring wal-e
2021-11-16 14:08:09,784 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env/WALE_S3_PREFIX
2021-11-16 14:08:09,784 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env/WALG_S3_PREFIX
2021-11-16 14:08:09,785 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env/AWS_ACCESS_KEY_ID
2021-11-16 14:08:09,785 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env/AWS_SECRET_ACCESS_KEY
2021-11-16 14:08:09,785 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env/WALE_S3_ENDPOINT
2021-11-16 14:08:09,785 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env/AWS_ENDPOINT
2021-11-16 14:08:09,785 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env/WALG_DISABLE_S3_SSE
2021-11-16 14:08:09,785 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env/AWS_S3_FORCE_PATH_STYLE
2021-11-16 14:08:09,786 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env/WALG_DOWNLOAD_CONCURRENCY
2021-11-16 14:08:09,786 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env/WALG_UPLOAD_CONCURRENCY
2021-11-16 14:08:09,786 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env/WALE_LOG_DESTINATION
2021-11-16 14:08:09,786 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env/PGPORT
2021-11-16 14:08:09,786 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env/BACKUP_NUM_TO_RETAIN
2021-11-16 14:08:09,787 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env/TMPDIR
2021-11-16 14:08:09,787 - bootstrapping - INFO - Configuring bootstrap
2021-11-16 14:08:09,787 - bootstrapping - INFO - Configuring crontab
2021-11-16 14:08:09,787 - bootstrapping - INFO - Skipping creation of renice cron job due to lack of SYS_NICE capability
2021-11-16 14:08:09,797 - bootstrapping - INFO - Configuring certificate
2021-11-16 14:08:09,798 - bootstrapping - INFO - Generating ssl self-signed certificate
2021-11-16 14:08:09,834 - bootstrapping - INFO - Configuring pgbouncer
2021-11-16 14:08:09,835 - bootstrapping - INFO - No PGBOUNCER_CONFIGURATION was specified, skipping
2021-11-16 14:08:09,835 - bootstrapping - INFO - Configuring standby-cluster
2021-11-16 14:08:09,835 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env-standby/WALE_S3_PREFIX
2021-11-16 14:08:09,836 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env-standby/WALG_S3_PREFIX
2021-11-16 14:08:09,836 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env-standby/AWS_ACCESS_KEY_ID
2021-11-16 14:08:09,836 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env-standby/AWS_SECRET_ACCESS_KEY
2021-11-16 14:08:09,836 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env-standby/WALE_S3_ENDPOINT
2021-11-16 14:08:09,837 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env-standby/AWS_ENDPOINT
2021-11-16 14:08:09,837 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env-standby/WALG_DISABLE_S3_SSE
2021-11-16 14:08:09,837 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env-standby/AWS_S3_FORCE_PATH_STYLE
2021-11-16 14:08:09,838 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env-standby/USE_WALG_BACKUP
2021-11-16 14:08:09,838 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env-standby/WALE_LOG_DESTINATION
2021-11-16 14:08:09,838 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env-standby/TMPDIR
2021-11-16 14:08:09,838 - bootstrapping - INFO - Configuring log
2021-11-16 14:08:09,838 - bootstrapping - INFO - Writing to file /run/etc/log.d/env/LOG_TMPDIR
2021-11-16 14:08:09,838 - bootstrapping - INFO - Writing to file /run/etc/log.d/env/LOG_AWS_HOST
2021-11-16 14:08:09,838 - bootstrapping - INFO - Writing to file /run/etc/log.d/env/LOG_S3_KEY
2021-11-16 14:08:09,839 - bootstrapping - INFO - Writing to file /run/etc/log.d/env/LOG_S3_BUCKET
2021-11-16 14:08:09,839 - bootstrapping - INFO - Writing to file /run/etc/log.d/env/PGLOG
2021-11-16 14:08:09,839 - bootstrapping - INFO - Configuring patroni
2021-11-16 14:08:09,848 - bootstrapping - INFO - Writing to file /run/postgres.yml
2021-11-16 14:08:09,848 - bootstrapping - INFO - Configuring pgqd
2021-11-16 14:08:09,848 - bootstrapping - INFO - Configuring pam-oauth2
2021-11-16 14:08:09,849 - bootstrapping - INFO - Writing to file /etc/pam.d/postgresql
2021-11-16 14:08:11,108 WARNING: Kubernetes RBAC doesn't allow GET access to the 'kubernetes' endpoint in the 'default' namespace. Disabling 'bypass_api_service'.
2021-11-16 14:08:11,122 INFO: No PostgreSQL configuration items changed, nothing to reload.
2021-11-16 14:08:11,124 INFO: Lock owner: None; I am infra-demo-0
2021-11-16 14:08:11,191 INFO: trying to bootstrap a new standby leader
wal_e.main INFO MSG: starting WAL-E
DETAIL: The subcommand is "backup-fetch".
STRUCTURED: time=2021-11-16T14:08:17.557794-00 pid=80
2021-11-16 14:08:21,624 INFO: Lock owner: None; I am infra-demo-0
2021-11-16 14:08:21,625 INFO: not healthy enough for leader race
2021-11-16 14:08:21,643 INFO: bootstrap_standby_leader in progress
2021-11-16 14:08:31,624 INFO: Lock owner: None; I am infra-demo-0
2021-11-16 14:08:31,624 INFO: not healthy enough for leader race
2021-11-16 14:08:31,624 INFO: bootstrap_standby_leader in progress
wal_e.worker.s3.s3_worker INFO MSG: beginning partition download
DETAIL: The partition being downloaded is part_00000000.tar.lzo.
HINT: The absolute S3 key is dbtestpostgres/spilo/infra-demo/3f5f7482-c9f8-4654-97dd-fb0c644caa80/wal/13/basebackups_005/base_00000004000000000000006D_00000040/tar_partitions/part_00000000.tar.lzo.
STRUCTURED: time=2021-11-16T14:08:36.697552-00 pid=80
2021-11-16 14:08:41,624 INFO: Lock owner: None; I am infra-demo-0
2021-11-16 14:08:41,624 INFO: not healthy enough for leader race
2021-11-16 14:08:41,624 INFO: bootstrap_standby_leader in progress
2021-11-16 14:08:48,845 INFO: replica has been created using bootstrap_standby_with_wale
2021-11-16 14:08:48,847 INFO: bootstrapped clone from remote master None
2021-11-16 14:08:49,043 INFO: postmaster pid=97
2021-11-16 14:08:49 UTC [97]: [1-1] 6193bb71.61 0 LOG: Auto detecting pg_stat_kcache.linux_hz parameter...
2021-11-16 14:08:49 UTC [97]: [2-1] 6193bb71.61 0 LOG: pg_stat_kcache.linux_hz is set to 500000
/var/run/postgresql:5432 - no response
2021-11-16 14:08:49 UTC [97]: [3-1] 6193bb71.61 0 LOG: redirecting log output to logging collector process
2021-11-16 14:08:49 UTC [97]: [4-1] 6193bb71.61 0 HINT: Future log output will appear in directory "../pg_log".
/var/run/postgresql:5432 - rejecting connections
/var/run/postgresql:5432 - rejecting connections
/var/run/postgresql:5432 - rejecting connections
2021-11-16 14:08:51,625 INFO: Lock owner: None; I am infra-demo-0
2021-11-16 14:08:51,625 INFO: not healthy enough for leader race
2021-11-16 14:08:51,643 INFO: bootstrap_standby_leader in progress
/var/run/postgresql:5432 - rejecting connections
[...]
2021-11-16 14:09:01,625 INFO: Lock owner: None; I am infra-demo-0
2021-11-16 14:09:01,625 INFO: not healthy enough for leader race
2021-11-16 14:09:01,625 INFO: bootstrap_standby_leader in progress
2021-11-16 14:09:49,797 INFO: Lock owner: None; I am infra-demo-0
2021-11-16 14:09:49,797 INFO: Still starting up as a standby.
2021-11-16 14:09:49,798 WARNING: Could not activate Linux watchdog device: "Can't open watchdog device: [Errno 2] No such file or directory: '/dev/watchdog'"
2021-11-16 14:09:49,856 INFO: establishing a new patroni connection to the postgres cluster
2021-11-16 14:09:50,529 INFO: establishing a new patroni connection to the postgres cluster
2021-11-16 14:09:50,531 WARNING: Retry got exception: 'connection problems'
2021-11-16 14:09:50,584 INFO: initialized a new cluster
/var/run/postgresql:5432 - rejecting connections
[...]
2021-11-16 14:26:11,097 INFO: Lock owner: infra-demo-0; I am infra-demo-0
2021-11-16 14:26:11,146 WARNING: manual failover: members list is empty
2021-11-16 14:26:11,146 INFO: establishing a new patroni connection to the postgres cluster
2021-11-16 14:26:11,429 INFO: establishing a new patroni connection to the postgres cluster
2021-11-16 14:26:11,430 WARNING: Retry got exception: 'connection problems'
2021-11-16 14:26:11,431 INFO: master start has timed out, but continuing to wait because failover is not possible
The last four messages repeat over and over again until the timeout is reached.
Additionally this is the current config pod_environment_configmap:
---
apiVersion: v1
data:
AWS_ACCESS_KEY_ID: C34...
AWS_ENDPOINT: https://someurl
AWS_S3_FORCE_PATH_STYLE: "true"
AWS_SECRET_ACCESS_KEY: ZD...
STANDBY_AWS_ACCESS_KEY_ID: C34...
STANDBY_AWS_ENDPOINT: https://someurl
STANDBY_AWS_S3_FORCE_PATH_STYLE: "true"
STANDBY_AWS_SECRET_ACCESS_KEY: ZD...
STANDBY_USE_WALG_RESTORE: "false"
USE_WALG_BACKUP: "false"
USE_WALG_RESTORE: "false"
kind: ConfigMap
metadata:
name: postgres-pod-config
What are we doing wrong?
Thanks in advance, Greetings, Sonic-Y3k
Config looks fine, I think. So the problem is with bootstrapping a new standby cluster? What do the Postgres logs inside the pod tell? Depending on the size of the source cluster it might take a while for the standby to build and showing the typical Patroni heart beat messages in the logs.
Hi @FxKu ,
yeah that's correct. The issue appears while bootstrapping a new standby cluster. The really strange thing, in this context, is that cloning a cluster via S3 works flawlessly.
The Postgres-Cluster isn't that big, a full clone from S3 takes about 5 Minutes until it's up and running.
Here are the logs:
Looking at the postgres log from the standby pod, it seems like /scripts/wal-e-wal-fetch.sh wal-fetch
is called with invalid arguments.
Greetings, Sonic-Y3k
Can you also check postgresql-.log
files only show backup logs. And please show the standby
section of your postgres manifest.
Hi @FxKu,
absolutely. I just created a new standby cluster to reduce the logout. Here's a gist with the only *.log and *.csv-File containing anything.
root@infra-demo-0:/home/postgres/pgdata/pgroot/pg_log# ls -lh
total 152K
-rw-r--r--. 1 postgres postgres 0 Nov 22 08:13 postgresql-0.csv
-rw-r--r--. 1 postgres postgres 103K Nov 22 08:26 postgresql-1.csv
-rw-r--r--. 1 postgres postgres 37K Nov 22 08:26 postgresql-1.log
-rw-r--r--. 1 postgres postgres 0 Nov 22 08:13 postgresql-2.csv
-rw-r--r--. 1 postgres postgres 0 Nov 22 08:13 postgresql-3.csv
-rw-r--r--. 1 postgres postgres 0 Nov 22 08:13 postgresql-4.csv
-rw-r--r--. 1 postgres postgres 0 Nov 22 08:13 postgresql-5.csv
-rw-r--r--. 1 postgres postgres 0 Nov 22 08:13 postgresql-6.csv
-rw-r--r--. 1 postgres postgres 0 Nov 22 08:13 postgresql-7.csv
Also here's the complete manifest of the standby cluster:
---
apiVersion: acid.zalan.do/v1
kind: postgresql
metadata:
name: infra-demo
spec:
databases:
demo: demo
numberOfInstances: 1
patroni:
pg_hba:
- local all all trust
- host all all 10.0.0.0/8 md5
- hostssl all all 10.0.0.0/8 pam
- hostssl all +zalandos 127.0.0.1/32 pam
- host all all 127.0.0.1/32 md5
- hostssl all +zalandos ::1/128 pam
- host all all ::1/128 md5
- local replication standby trust
- hostssl replication standby all md5
- hostnossl all all all reject
- hostssl all +zalandos all pam
- hostssl all all all md5
postgresql:
version: "13"
resources:
limits:
cpu: 2000m
memory: 4Gi
requests:
cpu: 2000m
memory: 4Gi
teamId: infra
users:
demo:
- superuser
- createdb
volume:
size: 2Gi
standby:
s3_wal_path: "s3://replaced_bucket_name/dbtestpostgres/spilo/infra-demo/6f903fc4-2754-4111-993e-1c1e5f2b093f/wal/13"
Having the same issue, help appreciate @FxKu
We are running into the same issue and can confirm that this is an issue with WAL-E configuration.
Without configuring WAL-E everything works as expected. But as soon as we try adding the appropriate config for WAL-E, connection issues appear. After reverting the WAL-E config (we are using terraform, so changes are predictable), things again work as expected.
Environment: k8s Server: v1.22.6-eks-7d68063 Postgres-Operator: v1.7.1
We have this issue recently too. There are no obvious changes. For us the following applies:
- When killing a replica pod, it will try to startup and bootstrap from master.
- The same log output appears than the thread owner has posted (
WARNING: Retry got exception: 'connection problems'
)
I've logged into the problematic replica pod and was getting aware, that the following process tree is running. This takes about 10 - 15 mins. in our case. Then magically, the processes are gone and the replica is getting bootstrapped finally:
postgres 77 75 0 11:59 ? 00:00:00 sh -c envdir "/run/etc/wal-e.d/env" timeout "0" /scripts/restore_command.sh "00000006.history" "pg_wal/RECOVERYHISTORY"
postgres 80 77 0 11:59 ? 00:00:00 timeout 0 /scripts/restore_command.sh 00000006.history pg_wal/RECOVERYHISTORY
postgres 81 80 0 11:59 ? 00:00:00 wal-g wal-fetch 00000006.history pg_wal/RECOVERYHISTORY
What makes me suspicious here, there is no 00000006.history
on our S3 storage. I'm not sure what he tries to do here. The cluster is on timeline 5 and therefore a 00000005.history
exists.
Any help is appreciated. I'm a bit stuck here.
An update from my side. In our case, the issue was on S3 storage side. The Object Storage falsely replied with error code 500
instead of 404
when an object was requested which does not exist. This seem to lead to a infinite loop during cluster bootstrapping cause wal-g searches for an object, that does not exist.
I'm still unsure why wal-g does so.
An update from my side. In our case, the issue was on S3 storage side. The Object Storage falsely replied with error code
500
instead of404
when an object was requested which does not exist. This seem to lead to a infinite loop during cluster bootstrapping cause wal-g searches for an object, that does not exist.I'm still unsure why wal-g does so.
Are you using minio? I am, and my restore process seems to be stuck too. I'm still in the evaluation phase of all this.
An update from my side. In our case, the issue was on S3 storage side. The Object Storage falsely replied with error code
500
instead of404
when an object was requested which does not exist. This seem to lead to a infinite loop during cluster bootstrapping cause wal-g searches for an object, that does not exist. I'm still unsure why wal-g does so.Are you using minio? I am, and my restore process seems to be stuck too. I'm still in the evaluation phase of all this.
No, we're not using Minio here.