postgres-operator
postgres-operator copied to clipboard
standby cluster from s3 error
Please, answer some short questions which should help us to understand your problem / question better?
- Which image of the operator are you using? registry.opensource.zalan.do/acid/postgres-operator:v1.8.2
- **Where do you run it - cloud or metal? Kubernetes
- Are you running Postgres Operator in production? [no]
- Type of issue? [Bug report/question]
error start standby cluster:
2022-08-26 03:48:58,245 WARNING: Kubernetes RBAC doesn't allow GET access to the 'kubernetes' endpoint in the 'default' namespace. Disabling 'bypass_api_service'.
2022-08-26 03:48:58,289 INFO: No PostgreSQL configuration items changed, nothing to reload.
2022-08-26 03:48:58,296 INFO: Lock owner: None; I am dns-fake-0
2022-08-26 03:48:58,461 INFO: trying to bootstrap a new standby leader
Can not find any backups
2022-08-26 03:49:04,813 ERROR: Error creating replica using method bootstrap_standby_with_wale: envdir "/run/etc/wal-e.d/env-standby" bash /scripts/wale_restore.sh exited with code=1
2022-08-26 03:49:04,814 ERROR: failed to bootstrap clone from remote master None
2022-08-26 03:49:04,815 INFO: Removing data directory: /home/postgres/pgdata/pgroot/data
2022-08-26 03:49:08,796 INFO: removing initialize key after failed attempt to bootstrap the cluster
Traceback (most recent call last):
File "/usr/local/bin/patroni", line 11, in <module>
sys.exit(main())
File "/usr/local/lib/python3.6/dist-packages/patroni/__main__.py", line 143, in main
return patroni_main()
File "/usr/local/lib/python3.6/dist-packages/patroni/__main__.py", line 135, in patroni_main
abstract_main(Patroni, schema)
File "/usr/local/lib/python3.6/dist-packages/patroni/daemon.py", line 100, in abstract_main
controller.run()
File "/usr/local/lib/python3.6/dist-packages/patroni/__main__.py", line 105, in run
super(Patroni, self).run()
File "/usr/local/lib/python3.6/dist-packages/patroni/daemon.py", line 59, in run
self._run_cycle()
File "/usr/local/lib/python3.6/dist-packages/patroni/__main__.py", line 108, in _run_cycle
logger.info(self.ha.run_cycle())
File "/usr/local/lib/python3.6/dist-packages/patroni/ha.py", line 1514, in run_cycle
info = self._run_cycle()
File "/usr/local/lib/python3.6/dist-packages/patroni/ha.py", line 1388, in _run_cycle
return self.post_bootstrap()
File "/usr/local/lib/python3.6/dist-packages/patroni/ha.py", line 1280, in post_bootstrap
self.cancel_initialization()
File "/usr/local/lib/python3.6/dist-packages/patroni/ha.py", line 1273, in cancel_initialization
raise PatroniFatalException('Failed to bootstrap cluster')
patroni.exceptions.PatroniFatalException: 'Failed to bootstrap cluster'
/etc/runit/runsvdir/default/patroni: finished with code=1 signal=0
/etc/runit/runsvdir/default/patroni: sleeping 30 seconds
standby cluster config:
apiVersion: acid.zalan.do/v1
kind: postgresql
metadata:
name: fake
namespace: develop
spec:
teamId: "xx"
numberOfInstances: 1
postgresql:
version: "14"
standby:
s3_wal_path: "s3://fake-wal/spilo/dns-fake/wal/14/"
volume:
size: 2Gi
storageClass: rook-ceph-block-ext
ERROR: failed to bootstrap clone from remote master None ?
there is my s3 data:
/s3 # s3cmd ls s3://fake-wal/spilo/dns-fake/wal/14/
DIR s3://fake-wal/spilo/dns-fake/wal/14/basebackups_005/
DIR s3://fake-wal/spilo/dns-fake/wal/14/wal_005/
and i check this:
root@dns-fake-0:/home/postgres# envdir "/run/etc/wal-e.d/env" wal-g backup-list
name modified wal_segment_backup_start
base_000000010000000000000003_00090088 2022-08-26T06:41:21Z 000000010000000000000003
root@dns-fake-0:/home/postgres# envdir "/run/etc/wal-e.d/env" wal-e backup-list
wal_e.main INFO MSG: starting WAL-E
DETAIL: The subcommand is "backup-list".
STRUCTURED: time=2022-08-26T09:11:46.223476-00 pid=227
name last_modified expanded_size_bytes wal_segment_backup_start wal_segment_offset_backup_start wal_segment_backup_stop wal_segment_offset_backup_stop
base_000000010000000000000003_00090088 2022-08-26T06:41:21.654Z 000000010000000000000003 00090088
and this is timeout:
root@dns-fake-0:/home/postgres# envdir "/run/etc/wal-e.d/env-standby" wal-e backup-list
...
timeout
why wal-e skip some env file:
2022-08-26 09:09:56,158 - bootstrapping - INFO - Figuring out my environment (Google? AWS? Openstack? Local?)
2022-08-26 09:09:57,388 - bootstrapping - INFO - Looks like your running openstack
2022-08-26 09:09:57,455 - bootstrapping - INFO - Configuring wal-e
2022-08-26 09:09:57,457 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env/WALE_S3_PREFIX
2022-08-26 09:09:57,459 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env/WALG_S3_PREFIX
2022-08-26 09:09:57,459 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env/AWS_ACCESS_KEY_ID
2022-08-26 09:09:57,460 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env/AWS_SECRET_ACCESS_KEY
2022-08-26 09:09:57,461 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env/WALE_S3_ENDPOINT
2022-08-26 09:09:57,462 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env/AWS_ENDPOINT
2022-08-26 09:09:57,463 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env/WALE_DISABLE_S3_SSE
2022-08-26 09:09:57,464 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env/WALG_DISABLE_S3_SSE
2022-08-26 09:09:57,465 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env/AWS_S3_FORCE_PATH_STYLE
2022-08-26 09:09:57,465 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env/WALG_DOWNLOAD_CONCURRENCY
2022-08-26 09:09:57,465 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env/WALG_UPLOAD_CONCURRENCY
2022-08-26 09:09:57,466 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env/WALE_LOG_DESTINATION
2022-08-26 09:09:57,466 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env/PGPORT
2022-08-26 09:09:57,467 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env/BACKUP_NUM_TO_RETAIN
2022-08-26 09:09:57,545 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env/TMPDIR
2022-08-26 09:09:57,545 - bootstrapping - INFO - Configuring crontab
2022-08-26 09:09:57,546 - bootstrapping - INFO - Skipping creation of renice cron job due to lack of SYS_NICE capability
2022-08-26 09:09:57,572 - bootstrapping - INFO - Configuring standby-cluster
2022-08-26 09:09:57,575 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env-standby/WALE_S3_PREFIX
2022-08-26 09:09:57,576 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env-standby/WALG_S3_PREFIX
2022-08-26 09:09:57,576 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env-standby/AWS_REGION
2022-08-26 09:09:57,577 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env-standby/AWS_INSTANCE_PROFILE
2022-08-26 09:09:57,578 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env-standby/WALG_S3_SSE
2022-08-26 09:09:57,578 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env-standby/USE_WALG_BACKUP
2022-08-26 09:09:57,579 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env-standby/USE_WALG_RESTORE
2022-08-26 09:09:57,579 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env-standby/WALE_LOG_DESTINATION
2022-08-26 09:09:57,580 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env-standby/TMPDIR
2022-08-26 09:09:57,580 - bootstrapping - INFO - Configuring pam-oauth2
2022-08-26 09:09:57,581 - bootstrapping - INFO - Writing to file /etc/pam.d/postgresql
2022-08-26 09:09:57,582 - bootstrapping - INFO - Configuring pgbouncer
2022-08-26 09:09:57,582 - bootstrapping - INFO - No PGBOUNCER_CONFIGURATION was specified, skipping
I assume you're not running in AWS? Then, for cloning to work, you'll also have to set
-
STANDBY_AWS_REGION
-
STANDBY_AWS_ENDPOINT
-
STANDBY_WAL_S3_BUCKET
-
STANDBY_AWS_ACCESS_KEY_ID
-
STANDBY_AWS_SECRET_ACCESS_KEY
. Then it should work. And for cloning, the same game, just prefixed with CLONE_
instead of STANDBY_
. At least that's what it took to get it working for us.
yes, i am running with Ceph RWG.
And where should I add these environment variables? in Operator Configmap
? or just manually modify Statefulset
resource?
I assume you're not running in AWS? Then, for cloning to work, you'll also have to set
STANDBY_AWS_REGION
STANDBY_AWS_ENDPOINT
STANDBY_WAL_S3_BUCKET
STANDBY_AWS_ACCESS_KEY_ID
STANDBY_AWS_SECRET_ACCESS_KEY
. Then it should work. And for cloning, the same game, just prefixed with
CLONE_
instead ofSTANDBY_
. At least that's what it took to get it working for us.
I've added environment variables. But there are other problems:
2022-08-29 01:29:05,421 WARNING: Kubernetes RBAC doesn't allow GET access to the 'kubernetes' endpoint in the 'default' namespace. Disabling 'bypass_api_service'.
2022-08-29 01:29:05,471 INFO: No PostgreSQL configuration items changed, nothing to reload.
2022-08-29 01:29:05,478 INFO: Lock owner: None; I am dns-fake-0
2022-08-29 01:29:05,621 INFO: trying to bootstrap a new standby leader
INFO: 2022/08/29 01:29:05.857391 Selecting the latest backup...
INFO: 2022/08/29 01:29:05.909714 LATEST backup is: 'base_000000010000000000000011_00000040'
2022-08-29 01:29:15,978 INFO: Lock owner: None; I am dns-fake-0
2022-08-29 01:29:15,978 INFO: not healthy enough for leader race
2022-08-29 01:29:16,042 INFO: bootstrap_standby_leader in progress
2022-08-29 01:29:25,978 INFO: Lock owner: None; I am dns-fake-0
2022-08-29 01:29:25,978 INFO: not healthy enough for leader race
...
It seems to be here?
https://github.com/zalando/patroni/blob/cea1fa869baafc077d17dd5eb1b60350d09d0592/patroni/ha.py#L1246
I encountered the same issue, but I resolved it by directly defining the environment variables in the cluster's YML file.
For some reason, custom configmaps and secrets defined in Configmap as below, work well when cloned, but I suspect they did not function in standby.
pod_environment_configmap: XXX additional_secret_mount": XXX
I will share the cluster definition file when it's successful.
apiVersion: "acid.zalan.do/v1"
kind: postgresql
metadata:
name: fcs-pg-cluster01-standby1
spec:
env:
- name: ALLOW_NOSSL
value: "true"
- name: STANDBY_USE_WALG_BACKUP
value: "true"
- name: STANDBY_USE_WALG_RESTORE
value: "true"
- name: STANDBY_WITH_WALE
value: "true"
- name: STANDBY_AWS_ACCESS_KEY_ID
value: "******"
- name: STANDBY_AWS_SECRET_ACCESS_KEY
value: "******"
- name: STANDBY_AWS_REGION
value: "ap-northeast-1"
- name: STANDBY_AWS_ENDPOINT
value: "https://s3.ap-northeast-1.amazonaws.com"
- name: STANDBY_WAL_S3_BUCKET
value: "pgo-physical-backup"
standby:
s3_wal_path: "s3://pgo-physical-backup/spilo/fcs-pg-cluster01/wal/14/"
teamId: "acid"
numberOfInstances: 1
allowedSourceRanges:
- 10.0.0.0/8
- 192.168.0.0/16
enableMasterLoadBalancer: true
postgresql:
version: '14'
volume:
size: 10Gi
@jun-fcss did you get it working facing similar issue ?