postgres-operator icon indicating copy to clipboard operation
postgres-operator copied to clipboard

WAL-G clone from S3 (IAM Role): AccessDenied

Open Khuman2 opened this issue 2 years ago • 5 comments

Hello!

I'm trying without success to clone a new PostgreSQL cluster managed by Zalando operator v1.10.0. I use AWS + S3 + IAM Roles. Backups work fine and I can see in it from a pod:

root@test-postgresql-0:/home/postgres# envdir "/run/etc/wal-e.d/env" wal-g backup-list
name                          modified             wal_segment_backup_start
base_000000010000000000000002 2023-08-17T12:47:14Z 000000010000000000000002
root@test-postgresql-0:/home/postgres# 

I tried to clone it in multiple ways, one of them is using this manifest:

spec:
  clone:
    uid: "d4355d73-f5d6-4786-a093-7239125a2a12"
    cluster: "test-postgresql"
    timestamp: "2023-08-17T12:47:00+00:00"
    s3_wal_path: "s3://postgres-backup/spilo/postgresql-test-postgresql/wal/14/"

When I check if the new pod has access and can "see" the backup:

root@test-clone-postgresql-0:/home/postgres# envdir "/run/etc/wal-e.d/env-clone-test-postgresql/" wal-g backup-list
name                          modified             wal_segment_backup_start
base_000000010000000000000002 2023-08-17T12:47:14Z 000000010000000000000002
root@test-clone-postgresql-0:/home/postgres#

The answer is "YES", it can. I installed aws cli tool into the new pod and check access to my S3:

root@test-clone-postgresql-0:/home/postgres# aws sts get-caller-identity
{
    "UserId": "AROA4UUBI7GDK5ACMHPAB:botocore-session-1692281238",
    "Account": "1111111111",
    "Arn": "arn:aws:sts::1111111111:assumed-role/postgres-backup-s3-access-role/botocore-session-1692281238"
}

and getting list:

root@test-clone-postgresql-0:/home/postgres# aws s3 ls s3://postgres-backup/spilo/postgresql-test-postgresql/wal/14/basebackups_005/
                           PRE base_000000010000000000000002/
2023-08-17 12:47:14        388 base_000000010000000000000002_backup_stop_sentinel.json

But startup logs showed wired result:

2023-08-17 12:59:10,460 INFO: No PostgreSQL configuration items changed, nothing to reload.
2023-08-17 12:59:10,461 INFO: Lock owner: None; I am test-clone-postgresql-0
2023-08-17 12:59:10,541 INFO: trying to bootstrap a new cluster
2023-08-17 12:59:10,541 INFO: Running custom bootstrap script: envdir "/run/etc/wal-e.d/env-clone-test-postgresql" python3 /scripts/clone_with_wale.py --recovery-target-time="2023-08-17T12:47:00+00:00"
2023-08-17 12:59:10,597 INFO: Trying s3://postgres-backup/spilo/postgresql-test-postgresql/wal/14/ for clone
ERROR: 2023/08/17 12:59:13.824667 failed to list s3 folder: 'spilo/postgresql-test-postgresql/wal/14/basebackups_005/': AccessDenied: Access Denied
    status code: 403, request id: 5TPDP0S5NSN16QFA, host id: 4u4XDWohJwKE+q5R7qMUACDr2tdg//CBGCiyaHW56I8tRj1jeIDrDWpf1RRJeBKN9Y02xmQgXxez7NiihScwOw==
2023-08-17 12:59:13,827 ERROR: Clone failed
Traceback (most recent call last):
  File "/scripts/clone_with_wale.py", line 185, in main
    run_clone_from_s3(options)
  File "/scripts/clone_with_wale.py", line 166, in run_clone_from_s3
    backup_name, update_envdir = find_backup(options.recovery_target_time, env)
  File "/scripts/clone_with_wale.py", line 150, in find_backup
    backup_list = list_backups(env)
  File "/scripts/clone_with_wale.py", line 84, in list_backups
    output = subprocess.check_output(backup_list_cmd, env=env)
  File "/usr/lib/python3.10/subprocess.py", line 420, in check_output
    return run(*popenargs, stdout=PIPE, timeout=timeout, check=True,
  File "/usr/lib/python3.10/subprocess.py", line 524, in run
    raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['wal-g', 'backup-list']' returned non-zero exit status 1.
2023-08-17 12:59:13,834 INFO: removing initialize key after failed attempt to bootstrap the cluster
Traceback (most recent call last):
  File "/usr/local/bin/patroni", line 8, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.10/dist-packages/patroni/__main__.py", line 144, in main
    return patroni_main()
  File "/usr/local/lib/python3.10/dist-packages/patroni/__main__.py", line 136, in patroni_main
    abstract_main(Patroni, schema)
  File "/usr/local/lib/python3.10/dist-packages/patroni/daemon.py", line 108, in abstract_main
    controller.run()
  File "/usr/local/lib/python3.10/dist-packages/patroni/__main__.py", line 106, in run
    super(Patroni, self).run()
  File "/usr/local/lib/python3.10/dist-packages/patroni/daemon.py", line 65, in run
    self._run_cycle()
  File "/usr/local/lib/python3.10/dist-packages/patroni/__main__.py", line 109, in _run_cycle
    logger.info(self.ha.run_cycle())
  File "/usr/local/lib/python3.10/dist-packages/patroni/ha.py", line 1771, in run_cycle
    info = self._run_cycle()
  File "/usr/local/lib/python3.10/dist-packages/patroni/ha.py", line 1593, in _run_cycle
    return self.post_bootstrap()
  File "/usr/local/lib/python3.10/dist-packages/patroni/ha.py", line 1484, in post_bootstrap
    self.cancel_initialization()
  File "/usr/local/lib/python3.10/dist-packages/patroni/ha.py", line 1477, in cancel_initialization
    raise PatroniFatalException('Failed to bootstrap cluster')
patroni.exceptions.PatroniFatalException: 'Failed to bootstrap cluster'
/etc/runit/runsvdir/default/patroni: finished with code=1 signal=0
/etc/runit/runsvdir/default/patroni: exceeded maximum number of restarts 5
stopping /etc/runit/runsvdir/default/patroni

Khuman2 avatar Aug 17 '23 14:08 Khuman2

I have the exact same issue. The problem is the those backup and restore scripts are running under postgres user. Looks like it cannot get AWS Token.

postgres@restore-clone-test-postgresql-0:~$ aws sts get-caller-identity
{
    "UserId": "AROAV2VJLLNLHOUT5Z35U:i-0912d11bcf58a8555",
    "Account": "**************",
    "Arn": "arn:aws:sts::**************:assumed-role/ec2-role"
}

dorsegal avatar Nov 17 '23 10:11 dorsegal

Any updates for this issue? @dorsegal could you share how to properly configure PGO to use IRSA approach? Has it done with additional env vars (extra configMap / secrets for PGO)?

tolikkostin avatar Dec 15 '23 11:12 tolikkostin

Same issue here, I've tried to get the IRSA approach to work for a week now. Documentation for this is sparse and the support for IRSA (operator-ui can't even use IRSA because it only ships with WAL-E) is lacking.

Executing the backup manually with

envdir "/run/etc/wal-e.d/env-clone-CLUSTER" python3 /scripts/clone_with_wale.py --recovery-target-time="2023-12-22T02:14:26.000+00:00" --scope CLUSTER --datadir /home/postgres/pgdata/test

works both with root and postgres user, but when bootstraping during pod startup it fails with AccessDenied: Access Denied indicating that WAL-G does not assume the specified AWS role via the ServiceAccount correctly.

samox73 avatar Dec 22 '23 13:12 samox73

@samox73 @tolikkostin We found a workaround for this in case it helps, we had to map all the env properties in the default env dir ("/run/etc/wal-e.d/env") to the clone env ("/run/etc/wal-e.d/env-clone-xx/")

This can be done using the CLONE_ configmap values.

I was comparing both directories and realized some of the values were not there and this fixed this issue.

Specifically these three values:

CLONE_AWS_STS_REGIONAL_ENDPOINTS: "regional" CLONE_AWS_ROLE_ARN: "< arn >" CLONE_AWS_WEB_IDENTITY_TOKEN_FILE: "<same value as $AWS_WEB_IDENTITY_TOKEN_FILE>"

boney9 avatar Jan 14 '24 11:01 boney9

Posting my solution here for anyone who may face the same issue.

I created these two Kubernetes resources:

---
apiVersion: v1
kind: ConfigMap
metadata:
  name: pod-env-configmap
  namespace: postgres-operator
data:
  CLONE_AWS_WEB_IDENTITY_TOKEN_FILE: /var/run/secrets/eks.amazonaws.com/serviceaccount/token
  CLONE_USE_WALG_RESTORE: "true"
  USE_WALG_BACKUP: "true"
  USE_WALG_RESTORE: "true"
---
apiVersion: v1
kind: Secret
metadata:
  name: pod-env-secret
  namespace: my-namespace
data:
  CLONE_AWS_ROLE_ARN: UHV0IHlvdXIgQVdTIHJvbGUgQVJOIGhlcmUgOikgdGhpcyBpcyBqdXN0IGFuIGV4YW1wbGU=

And then I set the following values in the operator's Helm chart release:

configKubernetes:
  pod_environment_configmap: postgres-operator/pod-env-configmap
  pod_environment_secret: pod-env-secret

Note: I set the CLONE_AWS_ROLE_ARN env var in the Secret instead of the ConfigMap just because in my scenario I have a different role (and then a different secret) for each namespace.

dmotte avatar Mar 13 '24 16:03 dmotte