postgres-operator icon indicating copy to clipboard operation
postgres-operator copied to clipboard

S3 WAL archiving not working as expected

Open samox73 opened this issue 3 years ago • 7 comments

Please, answer some short questions which should help us to understand your problem / question better?

  • Which image of the operator are you using? e.g. registry.opensource.zalan.do/acid/postgres-operator:v1.8.2
  • Where do you run it - cloud or metal? Kubernetes or OpenShift? AWS EKS K8s
  • Are you running Postgres Operator in production? yes
  • Type of issue? Bug report/unclear documentation

I've followed https://postgres-operator.readthedocs.io/en/latest/administrator/#wal-archiving-and-physical-basebackups to set up continuous backups. When entering the command to produce a backup manually I get

# envdir "/run/etc/wal-e.d/env" /scripts/postgres_backup.sh "/home/postgres/pgdata/pgroot/data"
2022-10-06 09:31:09.065 - /scripts/postgres_backup.sh - I was called as: /scripts/postgres_backup.sh /home/postgres/pgdata/pgroot/data
psql: error: connection to server on socket "/var/run/postgresql/.s.PGSQL.5432" failed: FATAL:  database "root" does not exist
2022-10-06 09:31:09.096 - /scripts/postgres_backup.sh - ERROR: Recovery state unknown: 

and in the logs I spot these lines after trying to do a manual backup

postgresql-4.csv
2022-10-06 09:37:44.036 UTC,,,872,"[local]",633ea1e8.368,1,"",2022-10-06 09:37:44 UTC,,0,LOG,00000,"connection received: host=[local]",,,,,,,,,"","not initialized",,0
2022-10-06 09:37:44.036 UTC,"root","root",872,"[local]",633ea1e8.368,2,"authentication",2022-10-06 09:37:44 UTC,7/137,0,LOG,00000,"connection authorized: user=root database=root application_name=psql",,,,,,,,,"","client backend",,0
2022-10-06 09:37:44.036 UTC,"root","root",872,"[local]",633ea1e8.368,3,"startup",2022-10-06 09:37:44 UTC,7/137,0,FATAL,3D000,"database ""root"" does not exist",,,,,,,,,"","client backend",,0
2022-10-06 09:37:47.557 UTC,,,292,,633e9c7f.124,37,,2022-10-06 09:14:39 UTC,,0,LOG,00000,"archive command failed with exit code 2","The failed archive command was: envdir ""/run/etc/wal-e.d/env"" wal-e wal-push ""pg_wal/00000002.history""",,,,,,,,"","archiver",,0
.
.
.
2022-10-06 09:41:26.238 UTC,"postgres","postgres",406,"[local]",633ea2c6.196,2,"authentication",2022-10-06 09:41:26 UTC,7/18,0,LOG,00000,"connection authorized: user=postgres database=postgres application_name=psql",,,,,,,,,"","client backend",,0
2022-10-06 09:41:26.238 UTC,"postgres","postgres",406,"[local]",633ea2c6.196,3,"startup",2022-10-06 09:41:26 UTC,7/18,0,FATAL,28000,"role ""postgres"" does not exist",,,,,,,,,"","client backend",,0
2022-10-06 09:41:27.211 UTC,,,411,"[local]",633ea2c7.19b,1,"",2022-10-06 09:41:27 UTC,,0,LOG,00000,"connection received: host=[local]",,,,,,,,,"","not initialized",,0
2022-10-06 09:41:27.211 UTC,"postgres","postgres",411,"[local]",633ea2c7.19b,2,"authentication",2022-10-06 09:41:27 UTC,7/19,0,LOG,00000,"connection authorized: user=postgres database=postgres application_name=pgq ticker",,,,,,,,,"","client backend",,0
2022-10-06 09:41:27.211 UTC,"postgres","postgres",411,"[local]",633ea2c7.19b,3,"startup",2022-10-06 09:41:27 UTC,7/19,0,FATAL,28000,"role ""postgres"" does not exist",,,,,,,,,"","client backend",,0
2022-10-06 09:42:08.556 UTC,,,323,,633ea2a5.143,5,,2022-10-06 09:40:53 UTC,,0,LOG,00000,"archive command failed with exit code 1","The failed archive command was: envdir ""/run/etc/wal-e.d/env"" wal-g wal-push ""pg_wal/00000002.history""",,,,,,,,"","archiver",,0
2022-10-06 09:42:12.793 UTC,,,323,,633ea2a5.143,6,,2022-10-06 09:40:53 UTC,,0,LOG,00000,"archive command failed with exit code 1","The failed archive command was: envdir ""/run/etc/wal-e.d/env"" wal-g wal-push ""pg_wal/00000002.history""",,,,,,,,"","archiver",,0
2022-10-06 09:42:17.031 UTC,,,323,,633ea2a5.143,7,,2022-10-06 09:40:53 UTC,,0,LOG,00000,"archive command failed with exit code 1","The failed archive command was: envdir ""/run/etc/wal-e.d/env"" wal-g wal-push ""pg_wal/00000002.history""",,,,,,,,"","archiver",,0
2022-10-06 09:42:17.031 UTC,,,323,,633ea2a5.143,8,,2022-10-06 09:40:53 UTC,,0,WARNING,01000,"archiving write-ahead log file ""00000002.history"" failed too many times, will try again later",,,,,,,,,"","archiver",,0
Environment variables of the pod
AWS_DEFAULT_REGION:eu-central-1
AWS_REGION:eu-central-1
AWS_ROLE_ARN:arn:aws:iam::************:role/postgres-pod-role
AWS_STS_REGIONAL_ENDPOINTS:regional
AWS_WEB_IDENTITY_TOKEN_FILE:/var/run/secrets/eks.amazonaws.com/serviceaccount/token
BACKUP_NUM_TO_RETAIN:5
BACKUP_SCHEDULE:*/5 * * * *
CLONE_USE_WALG_RESTORE:true
DCS_ENABLE_KUBERNETES_API:true
HUMAN_ROLE:zalandos
KUBERNETES_LABELS:{"application":"spilo"}
KUBERNETES_ROLE_LABEL:spilo-role
KUBERNETES_SCOPE_LABEL:*******
KUBE_IAM_ROLE:postgres-pod-role
PAM_OAUTH2:https://info.example.com/oauth2/tokeninfo?access_token= uid realm=/employees
PGPASSWORD_STANDBY:secretKeyRef(standby.*******-postgres-development.postgres.credentials.password) 
PGPASSWORD_SUPERUSER:secretKeyRef(root.*******-postgres-development.postgres.credentials.password) 
PGROOT:/home/postgres/pgdata/pgroot
PGUSER_STANDBY:standby
PGUSER_SUPERUSER:root
PGVERSION:14
SCOPE:*******-postgres-development
SPILO_CONFIGURATION:{"postgresql":{},"bootstrap":{"initdb":[{"auth-host":"md5"},{"auth-local":"trust"}],"users":{"zalandos":{"password":"","options":["CREATEDB","NOLOGIN"]}},"dcs":{}}}
USE_WALG_BACKUP:true
USE_WALG_RESTORE:true
WAL_BUCKET_SCOPE_PREFIX:
WAL_BUCKET_SCOPE_SUFFIX:
WAL_S3_BUCKET:*******-postgres

The cron job also does not seem to be executed every 5 minutes as there are no logs statements for it.

samox73 avatar Oct 06 '22 09:10 samox73

Running the wal-g upload command manually works without any errors

# envdir /run/etc/wal-e.d/env wal-g wal-push ../data/pg_wal/00000002.history 
INFO: 2022/10/06 12:23:12.494843 FILE PATH: 00000002.history.lz4
INFO: 2022/10/06 12:23:12.586170 FILE PATH: 000000010000000000000002.partial.lz4
INFO: 2022/10/06 12:23:12.613509 FILE PATH: 000000020000000000000003.partial.lz4
INFO: 2022/10/06 12:23:12.640760 FILE PATH: 000000020000000000000002.lz4

and I can see the data in the S3 bucket. However, now I get these errors in postgresql-4.log:

ERROR: 2022/10/06 12:23:39.465899 Couldn't check whether there is an overwrite attempt due to inner error: failed to read object: 'spilo/powerbot-postgres-development/wal/14/wal_005/00000002.history.lz4' from S3: AccessDenied: Access Denied
        status code: 403, request id: EANDRVDGZDMMEBA9, host id: 0lVxvS3DilpYcvJS9Zf1RNbRezcr17JWvKsNaMkZ9yGahqOiQ2DMi+rWY3ZnsVOBQqEk3bMdnkA=

Even though the service account is permitted to access all resources on this bucket with all action S3 actions s3:*.

Listing backups, still does not work:

# envdir /run/etc/wal-e.d/env wal-g backup-list
INFO: 2022/10/06 12:26:28.955900 No backups found

samox73 avatar Oct 06 '22 12:10 samox73

So I figured out that for automatic backups to work, the env vars AWS_ACCESS_KEY_ID and AS_SECRET_ACCESS_KEY need to be set. I could not get it to work with the more secure Service Account linked to AWS Role with attached policy approach.

Another thing I noticed that for the manual backup to work one has to set PGUSER to the correct superuser of the postgres DB. In my case this user is postgres and the script was trying user root.

samox73 avatar Oct 07 '22 14:10 samox73

Any news around? Im facing the same issue

angelbarrera92 avatar Jun 11 '23 17:06 angelbarrera92

After this change https://github.com/zalando/spilo/pull/769 WAL archiving works with Service Account linked to AWS Role, if you alter postgres-pod service account with proper role ARN manually. Operator cannot do that for you. Just add variables to postgres pod: AWS_REGION: bucket region WAL_S3_BUCKET: backup bucket BACKUP_NUM_TO_RETAIN: "7" USE_WALG_BACKUP: "true" #Wal-e won't work USE_WALG_RESTORE: "true" WAL_BUCKET_SCOPE_PREFIX: "" WAL_BUCKET_SCOPE_SUFFIX: "" For being able to restore from that S3 archive you would need to add two more variables to CLONE_ part: CLONE_AWS_REGION CLONE_WAL_S3_BUCKET CLONE_USE_WALG_RESTORE: "true" CLONE_AWS_WEB_IDENTITY_TOKEN_FILE: /var/run/secrets/eks.amazonaws.com/serviceaccount/token CLONE_AWS_ROLE_ARN: same ARN as from postgres-pod service account annotation

oleksiytsyban avatar Aug 04 '23 00:08 oleksiytsyban

Don't run the backups as the root user. Run them as the postgres user, or whatever user the svc is executing as.

bck01215 avatar Aug 20 '23 01:08 bck01215

Posting my solution here for anyone who may face the same issue.

I created these two Kubernetes resources:

---
apiVersion: v1
kind: ConfigMap
metadata:
  name: pod-env-configmap
  namespace: postgres-operator
data:
  CLONE_AWS_WEB_IDENTITY_TOKEN_FILE: /var/run/secrets/eks.amazonaws.com/serviceaccount/token
  CLONE_USE_WALG_RESTORE: "true"
  USE_WALG_BACKUP: "true"
  USE_WALG_RESTORE: "true"
---
apiVersion: v1
kind: Secret
metadata:
  name: pod-env-secret
  namespace: my-namespace
data:
  CLONE_AWS_ROLE_ARN: UHV0IHlvdXIgQVdTIHJvbGUgQVJOIGhlcmUgOikgdGhpcyBpcyBqdXN0IGFuIGV4YW1wbGU=

And then I set the following values in the operator's Helm chart release:

configKubernetes:
  pod_environment_configmap: postgres-operator/pod-env-configmap
  pod_environment_secret: pod-env-secret

Note: I set the CLONE_AWS_ROLE_ARN env var in the Secret instead of the ConfigMap just because in my scenario I have a different role (and then a different secret) for each namespace.

dmotte avatar Mar 13 '24 16:03 dmotte

Thanks, @dmotte .Will see to declare and observe resources in this pattern when applicable.

rishiraj88 avatar Mar 14 '24 07:03 rishiraj88