postgres-operator
postgres-operator copied to clipboard
WAL-G clone from S3 (IAM Role): AccessDenied
Hello!
I'm trying without success to clone a new PostgreSQL cluster managed by Zalando operator v1.10.0. I use AWS + S3 + IAM Roles. Backups work fine and I can see in it from a pod:
root@test-postgresql-0:/home/postgres# envdir "/run/etc/wal-e.d/env" wal-g backup-list
name modified wal_segment_backup_start
base_000000010000000000000002 2023-08-17T12:47:14Z 000000010000000000000002
root@test-postgresql-0:/home/postgres#
I tried to clone it in multiple ways, one of them is using this manifest:
spec:
clone:
uid: "d4355d73-f5d6-4786-a093-7239125a2a12"
cluster: "test-postgresql"
timestamp: "2023-08-17T12:47:00+00:00"
s3_wal_path: "s3://postgres-backup/spilo/postgresql-test-postgresql/wal/14/"
When I check if the new pod has access and can "see" the backup:
root@test-clone-postgresql-0:/home/postgres# envdir "/run/etc/wal-e.d/env-clone-test-postgresql/" wal-g backup-list
name modified wal_segment_backup_start
base_000000010000000000000002 2023-08-17T12:47:14Z 000000010000000000000002
root@test-clone-postgresql-0:/home/postgres#
The answer is "YES", it can. I installed aws cli tool into the new pod and check access to my S3:
root@test-clone-postgresql-0:/home/postgres# aws sts get-caller-identity
{
"UserId": "AROA4UUBI7GDK5ACMHPAB:botocore-session-1692281238",
"Account": "1111111111",
"Arn": "arn:aws:sts::1111111111:assumed-role/postgres-backup-s3-access-role/botocore-session-1692281238"
}
and getting list:
root@test-clone-postgresql-0:/home/postgres# aws s3 ls s3://postgres-backup/spilo/postgresql-test-postgresql/wal/14/basebackups_005/
PRE base_000000010000000000000002/
2023-08-17 12:47:14 388 base_000000010000000000000002_backup_stop_sentinel.json
But startup logs showed wired result:
2023-08-17 12:59:10,460 INFO: No PostgreSQL configuration items changed, nothing to reload.
2023-08-17 12:59:10,461 INFO: Lock owner: None; I am test-clone-postgresql-0
2023-08-17 12:59:10,541 INFO: trying to bootstrap a new cluster
2023-08-17 12:59:10,541 INFO: Running custom bootstrap script: envdir "/run/etc/wal-e.d/env-clone-test-postgresql" python3 /scripts/clone_with_wale.py --recovery-target-time="2023-08-17T12:47:00+00:00"
2023-08-17 12:59:10,597 INFO: Trying s3://postgres-backup/spilo/postgresql-test-postgresql/wal/14/ for clone
ERROR: 2023/08/17 12:59:13.824667 failed to list s3 folder: 'spilo/postgresql-test-postgresql/wal/14/basebackups_005/': AccessDenied: Access Denied
status code: 403, request id: 5TPDP0S5NSN16QFA, host id: 4u4XDWohJwKE+q5R7qMUACDr2tdg//CBGCiyaHW56I8tRj1jeIDrDWpf1RRJeBKN9Y02xmQgXxez7NiihScwOw==
2023-08-17 12:59:13,827 ERROR: Clone failed
Traceback (most recent call last):
File "/scripts/clone_with_wale.py", line 185, in main
run_clone_from_s3(options)
File "/scripts/clone_with_wale.py", line 166, in run_clone_from_s3
backup_name, update_envdir = find_backup(options.recovery_target_time, env)
File "/scripts/clone_with_wale.py", line 150, in find_backup
backup_list = list_backups(env)
File "/scripts/clone_with_wale.py", line 84, in list_backups
output = subprocess.check_output(backup_list_cmd, env=env)
File "/usr/lib/python3.10/subprocess.py", line 420, in check_output
return run(*popenargs, stdout=PIPE, timeout=timeout, check=True,
File "/usr/lib/python3.10/subprocess.py", line 524, in run
raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['wal-g', 'backup-list']' returned non-zero exit status 1.
2023-08-17 12:59:13,834 INFO: removing initialize key after failed attempt to bootstrap the cluster
Traceback (most recent call last):
File "/usr/local/bin/patroni", line 8, in <module>
sys.exit(main())
File "/usr/local/lib/python3.10/dist-packages/patroni/__main__.py", line 144, in main
return patroni_main()
File "/usr/local/lib/python3.10/dist-packages/patroni/__main__.py", line 136, in patroni_main
abstract_main(Patroni, schema)
File "/usr/local/lib/python3.10/dist-packages/patroni/daemon.py", line 108, in abstract_main
controller.run()
File "/usr/local/lib/python3.10/dist-packages/patroni/__main__.py", line 106, in run
super(Patroni, self).run()
File "/usr/local/lib/python3.10/dist-packages/patroni/daemon.py", line 65, in run
self._run_cycle()
File "/usr/local/lib/python3.10/dist-packages/patroni/__main__.py", line 109, in _run_cycle
logger.info(self.ha.run_cycle())
File "/usr/local/lib/python3.10/dist-packages/patroni/ha.py", line 1771, in run_cycle
info = self._run_cycle()
File "/usr/local/lib/python3.10/dist-packages/patroni/ha.py", line 1593, in _run_cycle
return self.post_bootstrap()
File "/usr/local/lib/python3.10/dist-packages/patroni/ha.py", line 1484, in post_bootstrap
self.cancel_initialization()
File "/usr/local/lib/python3.10/dist-packages/patroni/ha.py", line 1477, in cancel_initialization
raise PatroniFatalException('Failed to bootstrap cluster')
patroni.exceptions.PatroniFatalException: 'Failed to bootstrap cluster'
/etc/runit/runsvdir/default/patroni: finished with code=1 signal=0
/etc/runit/runsvdir/default/patroni: exceeded maximum number of restarts 5
stopping /etc/runit/runsvdir/default/patroni
I have the exact same issue. The problem is the those backup and restore scripts are running under postgres user.
Looks like it cannot get AWS Token.
postgres@restore-clone-test-postgresql-0:~$ aws sts get-caller-identity
{
"UserId": "AROAV2VJLLNLHOUT5Z35U:i-0912d11bcf58a8555",
"Account": "**************",
"Arn": "arn:aws:sts::**************:assumed-role/ec2-role"
}
Any updates for this issue? @dorsegal could you share how to properly configure PGO to use IRSA approach? Has it done with additional env vars (extra configMap / secrets for PGO)?
Same issue here, I've tried to get the IRSA approach to work for a week now. Documentation for this is sparse and the support for IRSA (operator-ui can't even use IRSA because it only ships with WAL-E) is lacking.
Executing the backup manually with
envdir "/run/etc/wal-e.d/env-clone-CLUSTER" python3 /scripts/clone_with_wale.py --recovery-target-time="2023-12-22T02:14:26.000+00:00" --scope CLUSTER --datadir /home/postgres/pgdata/test
works both with root and postgres user, but when bootstraping during pod startup it fails with AccessDenied: Access Denied indicating that WAL-G does not assume the specified AWS role via the ServiceAccount correctly.
@samox73 @tolikkostin We found a workaround for this in case it helps, we had to map all the env properties in the default env dir ("/run/etc/wal-e.d/env") to the clone env ("/run/etc/wal-e.d/env-clone-xx/")
This can be done using the CLONE_ configmap values.
I was comparing both directories and realized some of the values were not there and this fixed this issue.
Specifically these three values:
CLONE_AWS_STS_REGIONAL_ENDPOINTS: "regional" CLONE_AWS_ROLE_ARN: "< arn >" CLONE_AWS_WEB_IDENTITY_TOKEN_FILE: "<same value as $AWS_WEB_IDENTITY_TOKEN_FILE>"
Posting my solution here for anyone who may face the same issue.
I created these two Kubernetes resources:
---
apiVersion: v1
kind: ConfigMap
metadata:
name: pod-env-configmap
namespace: postgres-operator
data:
CLONE_AWS_WEB_IDENTITY_TOKEN_FILE: /var/run/secrets/eks.amazonaws.com/serviceaccount/token
CLONE_USE_WALG_RESTORE: "true"
USE_WALG_BACKUP: "true"
USE_WALG_RESTORE: "true"
---
apiVersion: v1
kind: Secret
metadata:
name: pod-env-secret
namespace: my-namespace
data:
CLONE_AWS_ROLE_ARN: UHV0IHlvdXIgQVdTIHJvbGUgQVJOIGhlcmUgOikgdGhpcyBpcyBqdXN0IGFuIGV4YW1wbGU=
And then I set the following values in the operator's Helm chart release:
configKubernetes:
pod_environment_configmap: postgres-operator/pod-env-configmap
pod_environment_secret: pod-env-secret
Note: I set the CLONE_AWS_ROLE_ARN env var in the Secret instead of the ConfigMap just because in my scenario I have a different role (and then a different secret) for each namespace.