postgres-operator
                                
                                 postgres-operator copied to clipboard
                                
                                    postgres-operator copied to clipboard
                            
                            
                            
                        S3 WAL archiving not working as expected
Please, answer some short questions which should help us to understand your problem / question better?
- Which image of the operator are you using? e.g. registry.opensource.zalan.do/acid/postgres-operator:v1.8.2
- Where do you run it - cloud or metal? Kubernetes or OpenShift? AWS EKS K8s
- Are you running Postgres Operator in production? yes
- Type of issue? Bug report/unclear documentation
I've followed https://postgres-operator.readthedocs.io/en/latest/administrator/#wal-archiving-and-physical-basebackups to set up continuous backups. When entering the command to produce a backup manually I get
# envdir "/run/etc/wal-e.d/env" /scripts/postgres_backup.sh "/home/postgres/pgdata/pgroot/data"
2022-10-06 09:31:09.065 - /scripts/postgres_backup.sh - I was called as: /scripts/postgres_backup.sh /home/postgres/pgdata/pgroot/data
psql: error: connection to server on socket "/var/run/postgresql/.s.PGSQL.5432" failed: FATAL:  database "root" does not exist
2022-10-06 09:31:09.096 - /scripts/postgres_backup.sh - ERROR: Recovery state unknown: 
and in the logs I spot these lines after trying to do a manual backup
postgresql-4.csv
2022-10-06 09:37:44.036 UTC,,,872,"[local]",633ea1e8.368,1,"",2022-10-06 09:37:44 UTC,,0,LOG,00000,"connection received: host=[local]",,,,,,,,,"","not initialized",,0
2022-10-06 09:37:44.036 UTC,"root","root",872,"[local]",633ea1e8.368,2,"authentication",2022-10-06 09:37:44 UTC,7/137,0,LOG,00000,"connection authorized: user=root database=root application_name=psql",,,,,,,,,"","client backend",,0
2022-10-06 09:37:44.036 UTC,"root","root",872,"[local]",633ea1e8.368,3,"startup",2022-10-06 09:37:44 UTC,7/137,0,FATAL,3D000,"database ""root"" does not exist",,,,,,,,,"","client backend",,0
2022-10-06 09:37:47.557 UTC,,,292,,633e9c7f.124,37,,2022-10-06 09:14:39 UTC,,0,LOG,00000,"archive command failed with exit code 2","The failed archive command was: envdir ""/run/etc/wal-e.d/env"" wal-e wal-push ""pg_wal/00000002.history""",,,,,,,,"","archiver",,0
.
.
.
2022-10-06 09:41:26.238 UTC,"postgres","postgres",406,"[local]",633ea2c6.196,2,"authentication",2022-10-06 09:41:26 UTC,7/18,0,LOG,00000,"connection authorized: user=postgres database=postgres application_name=psql",,,,,,,,,"","client backend",,0
2022-10-06 09:41:26.238 UTC,"postgres","postgres",406,"[local]",633ea2c6.196,3,"startup",2022-10-06 09:41:26 UTC,7/18,0,FATAL,28000,"role ""postgres"" does not exist",,,,,,,,,"","client backend",,0
2022-10-06 09:41:27.211 UTC,,,411,"[local]",633ea2c7.19b,1,"",2022-10-06 09:41:27 UTC,,0,LOG,00000,"connection received: host=[local]",,,,,,,,,"","not initialized",,0
2022-10-06 09:41:27.211 UTC,"postgres","postgres",411,"[local]",633ea2c7.19b,2,"authentication",2022-10-06 09:41:27 UTC,7/19,0,LOG,00000,"connection authorized: user=postgres database=postgres application_name=pgq ticker",,,,,,,,,"","client backend",,0
2022-10-06 09:41:27.211 UTC,"postgres","postgres",411,"[local]",633ea2c7.19b,3,"startup",2022-10-06 09:41:27 UTC,7/19,0,FATAL,28000,"role ""postgres"" does not exist",,,,,,,,,"","client backend",,0
2022-10-06 09:42:08.556 UTC,,,323,,633ea2a5.143,5,,2022-10-06 09:40:53 UTC,,0,LOG,00000,"archive command failed with exit code 1","The failed archive command was: envdir ""/run/etc/wal-e.d/env"" wal-g wal-push ""pg_wal/00000002.history""",,,,,,,,"","archiver",,0
2022-10-06 09:42:12.793 UTC,,,323,,633ea2a5.143,6,,2022-10-06 09:40:53 UTC,,0,LOG,00000,"archive command failed with exit code 1","The failed archive command was: envdir ""/run/etc/wal-e.d/env"" wal-g wal-push ""pg_wal/00000002.history""",,,,,,,,"","archiver",,0
2022-10-06 09:42:17.031 UTC,,,323,,633ea2a5.143,7,,2022-10-06 09:40:53 UTC,,0,LOG,00000,"archive command failed with exit code 1","The failed archive command was: envdir ""/run/etc/wal-e.d/env"" wal-g wal-push ""pg_wal/00000002.history""",,,,,,,,"","archiver",,0
2022-10-06 09:42:17.031 UTC,,,323,,633ea2a5.143,8,,2022-10-06 09:40:53 UTC,,0,WARNING,01000,"archiving write-ahead log file ""00000002.history"" failed too many times, will try again later",,,,,,,,,"","archiver",,0
Environment variables of the pod
AWS_DEFAULT_REGION:eu-central-1
AWS_REGION:eu-central-1
AWS_ROLE_ARN:arn:aws:iam::************:role/postgres-pod-role
AWS_STS_REGIONAL_ENDPOINTS:regional
AWS_WEB_IDENTITY_TOKEN_FILE:/var/run/secrets/eks.amazonaws.com/serviceaccount/token
BACKUP_NUM_TO_RETAIN:5
BACKUP_SCHEDULE:*/5 * * * *
CLONE_USE_WALG_RESTORE:true
DCS_ENABLE_KUBERNETES_API:true
HUMAN_ROLE:zalandos
KUBERNETES_LABELS:{"application":"spilo"}
KUBERNETES_ROLE_LABEL:spilo-role
KUBERNETES_SCOPE_LABEL:*******
KUBE_IAM_ROLE:postgres-pod-role
PAM_OAUTH2:https://info.example.com/oauth2/tokeninfo?access_token= uid realm=/employees
PGPASSWORD_STANDBY:secretKeyRef(standby.*******-postgres-development.postgres.credentials.password) 
PGPASSWORD_SUPERUSER:secretKeyRef(root.*******-postgres-development.postgres.credentials.password) 
PGROOT:/home/postgres/pgdata/pgroot
PGUSER_STANDBY:standby
PGUSER_SUPERUSER:root
PGVERSION:14
SCOPE:*******-postgres-development
SPILO_CONFIGURATION:{"postgresql":{},"bootstrap":{"initdb":[{"auth-host":"md5"},{"auth-local":"trust"}],"users":{"zalandos":{"password":"","options":["CREATEDB","NOLOGIN"]}},"dcs":{}}}
USE_WALG_BACKUP:true
USE_WALG_RESTORE:true
WAL_BUCKET_SCOPE_PREFIX:
WAL_BUCKET_SCOPE_SUFFIX:
WAL_S3_BUCKET:*******-postgres
The cron job also does not seem to be executed every 5 minutes as there are no logs statements for it.
Running the wal-g upload command manually works without any errors
# envdir /run/etc/wal-e.d/env wal-g wal-push ../data/pg_wal/00000002.history 
INFO: 2022/10/06 12:23:12.494843 FILE PATH: 00000002.history.lz4
INFO: 2022/10/06 12:23:12.586170 FILE PATH: 000000010000000000000002.partial.lz4
INFO: 2022/10/06 12:23:12.613509 FILE PATH: 000000020000000000000003.partial.lz4
INFO: 2022/10/06 12:23:12.640760 FILE PATH: 000000020000000000000002.lz4
and I can see the data in the S3 bucket. However, now I get these errors in postgresql-4.log:
ERROR: 2022/10/06 12:23:39.465899 Couldn't check whether there is an overwrite attempt due to inner error: failed to read object: 'spilo/powerbot-postgres-development/wal/14/wal_005/00000002.history.lz4' from S3: AccessDenied: Access Denied
        status code: 403, request id: EANDRVDGZDMMEBA9, host id: 0lVxvS3DilpYcvJS9Zf1RNbRezcr17JWvKsNaMkZ9yGahqOiQ2DMi+rWY3ZnsVOBQqEk3bMdnkA=
Even though the service account is permitted to access all resources on this bucket with all action S3 actions s3:*.
Listing backups, still does not work:
# envdir /run/etc/wal-e.d/env wal-g backup-list
INFO: 2022/10/06 12:26:28.955900 No backups found
So I figured out that for automatic backups to work, the env vars AWS_ACCESS_KEY_ID and AS_SECRET_ACCESS_KEY need to be set. I could not get it to work with the more secure Service Account linked to AWS Role with attached policy approach.
Another thing I noticed that for the manual backup to work one has to set PGUSER to the correct superuser of the postgres DB. In my case this user is postgres and the script was trying user root.
Any news around? Im facing the same issue
After this change https://github.com/zalando/spilo/pull/769 WAL archiving works with Service Account linked to AWS Role, if you alter postgres-pod service account with proper role ARN manually. Operator cannot do that for you. Just add variables to postgres pod: AWS_REGION: bucket region WAL_S3_BUCKET: backup bucket BACKUP_NUM_TO_RETAIN: "7" USE_WALG_BACKUP: "true" #Wal-e won't work USE_WALG_RESTORE: "true" WAL_BUCKET_SCOPE_PREFIX: "" WAL_BUCKET_SCOPE_SUFFIX: "" For being able to restore from that S3 archive you would need to add two more variables to CLONE_ part: CLONE_AWS_REGION CLONE_WAL_S3_BUCKET CLONE_USE_WALG_RESTORE: "true" CLONE_AWS_WEB_IDENTITY_TOKEN_FILE: /var/run/secrets/eks.amazonaws.com/serviceaccount/token CLONE_AWS_ROLE_ARN: same ARN as from postgres-pod service account annotation
Don't run the backups as the root user. Run them as the postgres user, or whatever user the svc is executing as.
Posting my solution here for anyone who may face the same issue.
I created these two Kubernetes resources:
---
apiVersion: v1
kind: ConfigMap
metadata:
  name: pod-env-configmap
  namespace: postgres-operator
data:
  CLONE_AWS_WEB_IDENTITY_TOKEN_FILE: /var/run/secrets/eks.amazonaws.com/serviceaccount/token
  CLONE_USE_WALG_RESTORE: "true"
  USE_WALG_BACKUP: "true"
  USE_WALG_RESTORE: "true"
---
apiVersion: v1
kind: Secret
metadata:
  name: pod-env-secret
  namespace: my-namespace
data:
  CLONE_AWS_ROLE_ARN: UHV0IHlvdXIgQVdTIHJvbGUgQVJOIGhlcmUgOikgdGhpcyBpcyBqdXN0IGFuIGV4YW1wbGU=
And then I set the following values in the operator's Helm chart release:
configKubernetes:
  pod_environment_configmap: postgres-operator/pod-env-configmap
  pod_environment_secret: pod-env-secret
Note: I set the CLONE_AWS_ROLE_ARN env var in the Secret instead of the ConfigMap just because in my scenario I have a different role (and then a different secret) for each namespace.
Thanks, @dmotte .Will see to declare and observe resources in this pattern when applicable.