postgres-operator icon indicating copy to clipboard operation
postgres-operator copied to clipboard

standby cluster from s3 error

Open meimeitou opened this issue 2 years ago • 5 comments

Please, answer some short questions which should help us to understand your problem / question better?

  • Which image of the operator are you using? registry.opensource.zalan.do/acid/postgres-operator:v1.8.2
  • **Where do you run it - cloud or metal? Kubernetes
  • Are you running Postgres Operator in production? [no]
  • Type of issue? [Bug report/question]

error start standby cluster:

2022-08-26 03:48:58,245 WARNING: Kubernetes RBAC doesn't allow GET access to the 'kubernetes' endpoint in the 'default' namespace. Disabling 'bypass_api_service'.
2022-08-26 03:48:58,289 INFO: No PostgreSQL configuration items changed, nothing to reload.
2022-08-26 03:48:58,296 INFO: Lock owner: None; I am dns-fake-0
2022-08-26 03:48:58,461 INFO: trying to bootstrap a new standby leader
Can not find any backups
2022-08-26 03:49:04,813 ERROR: Error creating replica using method bootstrap_standby_with_wale: envdir "/run/etc/wal-e.d/env-standby" bash /scripts/wale_restore.sh exited with code=1
2022-08-26 03:49:04,814 ERROR: failed to bootstrap clone from remote master None
2022-08-26 03:49:04,815 INFO: Removing data directory: /home/postgres/pgdata/pgroot/data
2022-08-26 03:49:08,796 INFO: removing initialize key after failed attempt to bootstrap the cluster
Traceback (most recent call last):
  File "/usr/local/bin/patroni", line 11, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.6/dist-packages/patroni/__main__.py", line 143, in main
    return patroni_main()
  File "/usr/local/lib/python3.6/dist-packages/patroni/__main__.py", line 135, in patroni_main
    abstract_main(Patroni, schema)
  File "/usr/local/lib/python3.6/dist-packages/patroni/daemon.py", line 100, in abstract_main
    controller.run()
  File "/usr/local/lib/python3.6/dist-packages/patroni/__main__.py", line 105, in run
    super(Patroni, self).run()
  File "/usr/local/lib/python3.6/dist-packages/patroni/daemon.py", line 59, in run
    self._run_cycle()
  File "/usr/local/lib/python3.6/dist-packages/patroni/__main__.py", line 108, in _run_cycle
    logger.info(self.ha.run_cycle())
  File "/usr/local/lib/python3.6/dist-packages/patroni/ha.py", line 1514, in run_cycle
    info = self._run_cycle()
  File "/usr/local/lib/python3.6/dist-packages/patroni/ha.py", line 1388, in _run_cycle
    return self.post_bootstrap()
  File "/usr/local/lib/python3.6/dist-packages/patroni/ha.py", line 1280, in post_bootstrap
    self.cancel_initialization()
  File "/usr/local/lib/python3.6/dist-packages/patroni/ha.py", line 1273, in cancel_initialization
    raise PatroniFatalException('Failed to bootstrap cluster')
patroni.exceptions.PatroniFatalException: 'Failed to bootstrap cluster'
/etc/runit/runsvdir/default/patroni: finished with code=1 signal=0
/etc/runit/runsvdir/default/patroni: sleeping 30 seconds

standby cluster config:

apiVersion: acid.zalan.do/v1
kind: postgresql
metadata:
  name: fake
  namespace: develop
spec:
  teamId: "xx"
  numberOfInstances: 1
  postgresql:
    version: "14"
  standby:
    s3_wal_path: "s3://fake-wal/spilo/dns-fake/wal/14/"
  volume:
    size: 2Gi
    storageClass: rook-ceph-block-ext

ERROR: failed to bootstrap clone from remote master None ?

there is my s3 data:

/s3 # s3cmd ls s3://fake-wal/spilo/dns-fake/wal/14/
                          DIR  s3://fake-wal/spilo/dns-fake/wal/14/basebackups_005/
                          DIR  s3://fake-wal/spilo/dns-fake/wal/14/wal_005/

and i check this:

root@dns-fake-0:/home/postgres# envdir "/run/etc/wal-e.d/env" wal-g backup-list
name                                   modified             wal_segment_backup_start
base_000000010000000000000003_00090088 2022-08-26T06:41:21Z 000000010000000000000003

root@dns-fake-0:/home/postgres# envdir "/run/etc/wal-e.d/env" wal-e backup-list
wal_e.main   INFO     MSG: starting WAL-E
        DETAIL: The subcommand is "backup-list".
        STRUCTURED: time=2022-08-26T09:11:46.223476-00 pid=227
name    last_modified   expanded_size_bytes     wal_segment_backup_start        wal_segment_offset_backup_start wal_segment_backup_stop wal_segment_offset_backup_stop
base_000000010000000000000003_00090088  2022-08-26T06:41:21.654Z                000000010000000000000003        00090088

and this is timeout:

root@dns-fake-0:/home/postgres# envdir "/run/etc/wal-e.d/env-standby" wal-e backup-list
...
timeout

why wal-e skip some env file:

2022-08-26 09:09:56,158 - bootstrapping - INFO - Figuring out my environment (Google? AWS? Openstack? Local?)
2022-08-26 09:09:57,388 - bootstrapping - INFO - Looks like your running openstack
2022-08-26 09:09:57,455 - bootstrapping - INFO - Configuring wal-e
2022-08-26 09:09:57,457 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env/WALE_S3_PREFIX
2022-08-26 09:09:57,459 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env/WALG_S3_PREFIX
2022-08-26 09:09:57,459 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env/AWS_ACCESS_KEY_ID
2022-08-26 09:09:57,460 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env/AWS_SECRET_ACCESS_KEY
2022-08-26 09:09:57,461 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env/WALE_S3_ENDPOINT
2022-08-26 09:09:57,462 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env/AWS_ENDPOINT
2022-08-26 09:09:57,463 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env/WALE_DISABLE_S3_SSE
2022-08-26 09:09:57,464 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env/WALG_DISABLE_S3_SSE
2022-08-26 09:09:57,465 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env/AWS_S3_FORCE_PATH_STYLE
2022-08-26 09:09:57,465 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env/WALG_DOWNLOAD_CONCURRENCY
2022-08-26 09:09:57,465 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env/WALG_UPLOAD_CONCURRENCY
2022-08-26 09:09:57,466 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env/WALE_LOG_DESTINATION
2022-08-26 09:09:57,466 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env/PGPORT
2022-08-26 09:09:57,467 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env/BACKUP_NUM_TO_RETAIN
2022-08-26 09:09:57,545 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env/TMPDIR
2022-08-26 09:09:57,545 - bootstrapping - INFO - Configuring crontab
2022-08-26 09:09:57,546 - bootstrapping - INFO - Skipping creation of renice cron job due to lack of SYS_NICE capability
2022-08-26 09:09:57,572 - bootstrapping - INFO - Configuring standby-cluster
2022-08-26 09:09:57,575 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env-standby/WALE_S3_PREFIX
2022-08-26 09:09:57,576 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env-standby/WALG_S3_PREFIX
2022-08-26 09:09:57,576 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env-standby/AWS_REGION
2022-08-26 09:09:57,577 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env-standby/AWS_INSTANCE_PROFILE
2022-08-26 09:09:57,578 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env-standby/WALG_S3_SSE
2022-08-26 09:09:57,578 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env-standby/USE_WALG_BACKUP
2022-08-26 09:09:57,579 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env-standby/USE_WALG_RESTORE
2022-08-26 09:09:57,579 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env-standby/WALE_LOG_DESTINATION
2022-08-26 09:09:57,580 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env-standby/TMPDIR
2022-08-26 09:09:57,580 - bootstrapping - INFO - Configuring pam-oauth2
2022-08-26 09:09:57,581 - bootstrapping - INFO - Writing to file /etc/pam.d/postgresql
2022-08-26 09:09:57,582 - bootstrapping - INFO - Configuring pgbouncer
2022-08-26 09:09:57,582 - bootstrapping - INFO - No PGBOUNCER_CONFIGURATION was specified, skipping

meimeitou avatar Aug 26 '22 03:08 meimeitou

I assume you're not running in AWS? Then, for cloning to work, you'll also have to set

  • STANDBY_AWS_REGION
  • STANDBY_AWS_ENDPOINT
  • STANDBY_WAL_S3_BUCKET
  • STANDBY_AWS_ACCESS_KEY_ID
  • STANDBY_AWS_SECRET_ACCESS_KEY

. Then it should work. And for cloning, the same game, just prefixed with CLONE_ instead of STANDBY_. At least that's what it took to get it working for us.

tamcore avatar Aug 26 '22 10:08 tamcore

yes, i am running with Ceph RWG. And where should I add these environment variables? in Operator Configmap? or just manually modify Statefulset resource?

meimeitou avatar Aug 29 '22 01:08 meimeitou

I assume you're not running in AWS? Then, for cloning to work, you'll also have to set

  • STANDBY_AWS_REGION
  • STANDBY_AWS_ENDPOINT
  • STANDBY_WAL_S3_BUCKET
  • STANDBY_AWS_ACCESS_KEY_ID
  • STANDBY_AWS_SECRET_ACCESS_KEY

. Then it should work. And for cloning, the same game, just prefixed with CLONE_ instead of STANDBY_. At least that's what it took to get it working for us.

I've added environment variables. But there are other problems:

2022-08-29 01:29:05,421 WARNING: Kubernetes RBAC doesn't allow GET access to the 'kubernetes' endpoint in the 'default' namespace. Disabling 'bypass_api_service'.
2022-08-29 01:29:05,471 INFO: No PostgreSQL configuration items changed, nothing to reload.
2022-08-29 01:29:05,478 INFO: Lock owner: None; I am dns-fake-0
2022-08-29 01:29:05,621 INFO: trying to bootstrap a new standby leader
INFO: 2022/08/29 01:29:05.857391 Selecting the latest backup...
INFO: 2022/08/29 01:29:05.909714 LATEST backup is: 'base_000000010000000000000011_00000040'
2022-08-29 01:29:15,978 INFO: Lock owner: None; I am dns-fake-0
2022-08-29 01:29:15,978 INFO: not healthy enough for leader race
2022-08-29 01:29:16,042 INFO: bootstrap_standby_leader in progress
2022-08-29 01:29:25,978 INFO: Lock owner: None; I am dns-fake-0
2022-08-29 01:29:25,978 INFO: not healthy enough for leader race
...

It seems to be here?

https://github.com/zalando/patroni/blob/cea1fa869baafc077d17dd5eb1b60350d09d0592/patroni/ha.py#L1246

meimeitou avatar Aug 29 '22 01:08 meimeitou

I encountered the same issue, but I resolved it by directly defining the environment variables in the cluster's YML file.

For some reason, custom configmaps and secrets defined in Configmap as below, work well when cloned, but I suspect they did not function in standby.

pod_environment_configmap: XXX additional_secret_mount": XXX

I will share the cluster definition file when it's successful.

apiVersion: "acid.zalan.do/v1"
kind: postgresql
metadata:
  name: fcs-pg-cluster01-standby1
spec:
  env:
  - name: ALLOW_NOSSL
    value: "true"
  - name: STANDBY_USE_WALG_BACKUP
    value: "true"
  - name: STANDBY_USE_WALG_RESTORE
    value: "true"
  - name: STANDBY_WITH_WALE
    value: "true"
  - name: STANDBY_AWS_ACCESS_KEY_ID
    value: "******"
  - name: STANDBY_AWS_SECRET_ACCESS_KEY
    value: "******"
  - name: STANDBY_AWS_REGION
    value: "ap-northeast-1"
  - name: STANDBY_AWS_ENDPOINT
    value: "https://s3.ap-northeast-1.amazonaws.com"
  - name: STANDBY_WAL_S3_BUCKET
    value: "pgo-physical-backup"

  standby:
    s3_wal_path: "s3://pgo-physical-backup/spilo/fcs-pg-cluster01/wal/14/"

  teamId: "acid"

  numberOfInstances: 1

  allowedSourceRanges:
    - 10.0.0.0/8
    - 192.168.0.0/16

  enableMasterLoadBalancer: true

  postgresql:
    version: '14'

  volume:
    size: 10Gi

jun-fcss avatar Aug 13 '23 13:08 jun-fcss

@jun-fcss did you get it working facing similar issue ?

NIK2727 avatar Jan 30 '24 04:01 NIK2727