postgres-operator icon indicating copy to clipboard operation
postgres-operator copied to clipboard

iam for sa documentation

Open bck01215 opened this issue 1 year ago • 7 comments

Per #1124 you no longer need to use kube2iam, however, the docs say

kube_iam_role AWS IAM role to supply in the iam.amazonaws.com/role annotation of Postgres pods. Only used when combined with kube2iam project on AWS. The default is empty.

I am new to this project and unsure how to implement this. It would be very useful if the docs showed how to use iam for sa

bck01215 avatar Sep 18 '23 12:09 bck01215

Just went through this struggle.

Suffice it to say that the biggest missing piece for me was that you must opt-in to using wal-g since the default wal-e backup does not have sts support and will fail with access denied errors even if your service account is provisioned correctly. This can be done via any number of pod environment configurations described here, though I found the actual values you need to set in this article only after cluing into the fact that this might even be necessary from this issue comment.

  • USE_WALG_BACKUP
  • USE_WALG_RESTORE
  • CLONE_USE_WALG_RESTORE

Happy to give more details and help spruce up the docs in this regard!

rusty-jules avatar Nov 02 '23 10:11 rusty-jules

Hey @rusty-jules. Have you managed to configure zalando postgres operator to using service account with binded AWS IAM policy for s3 access? Could you share how you did that ?

tolikkostin avatar Dec 14 '23 07:12 tolikkostin

@tolikkostin Sure thing!

  1. In OperatorConfiguration under configuration -> kubernetes -> pod_service_account_definition add the required eks iam role annotation to get an injected service account token (you can hardcode your account id/iam role name)
pod_service_account_definition: |
  metadata:
    name: postgres-pod
    annotations:
      # NOTE: we are using flux kustomization substitution here to inject the account_id value at runtime
      eks.amazonaws.com/role-arn: arn:aws:iam::${aws_account_id}:role/${iam_role_name}
  kind: ServiceAccount
  apiVersion: v1
  1. In OperatorConfiguration under configuration -> kubernetes -> pod_environment_configmap reference a configmap that you apply separately, in the form <namespace>/<configmap-name>. The configmap should contain keys with the environment variables I mentioned in the earlier comment.
# operatorconfiguration.yaml
pod_environment_configmap: postgres-operator/postgres-env

# postgres-env-configmap.yaml
metadata:
  name: postgres-env
  namespace: postgres-operator
data:
  USE_WALG_BACKUP: "true"
  USE_WALG_RESTORE: "true"
  CLONE_USE_WALG_RESTORE: "true"
kind: ConfigMap
apiVersion: v1
  1. In OperatorConfiguration, under configuration -> aws_or_gcp set your s3 bucket values. Note that the "" values were actually left so.
aws_or_gcp:
  aws_region: ${aws-region}
  enable_ebs_gp3_migration: false
  enable_ebs_gp3_migration_max_size: 1000
  gcp_credentials: ""
  kube_iam_role: "" # we are not using this field with this method, as this relies on a third-party project
  log_s3_bucket: ""
  wal_az_storage_account: ""
  wal_gs_bucket: ""
  wal_s3_bucket: ${my-backup-bucket} # this should just be the bucket name, no "s3://" prefix

Optionally set my-backup-bucket as the value of configuration.logical_backup.logical_backup_s3_bucket and logical_backup_provider to "s3"

  1. Create an IAM Role with permissions to access your my-backup-bucket S3 bucket and an assume role policy that allows the above ServiceAccount to assume it. Note that the s3 permissions must be on the iam role, not an s3 bucket policy. The assume role policy may look something like this
{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "",
            "Effect": "Allow",
            "Principal": {
                "Federated": "arn:aws:iam::${aws_account_id}:oidc-provider/oidc.eks.us-west-1.amazonaws.com/id/${eks_id}"
            },
            "Action": "sts:AssumeRoleWithWebIdentity",
            "Condition": {
                "StringEquals": {
                    "oidc.eks.us-west-1.amazonaws.com/id/${eks_id}:aud": "sts.amazonaws.com"
                },
                "StringLike": {
                    "oidc.eks.us-west-1.amazonaws.com/id/${eks_id}:sub": "system:serviceaccount:*:postgres-pod"
                }
            }
        }
    ]
}

I believe that was it! wal-g should find the eks injected serviceaccount token automatically (using environment variables that EKS injects when it sees the serviceaccount annotation) and use it when it attempts to write to the bucket. I remember jumping onto the postgres pod directly and running manual backups to debug this, there's a script to do this in the spilo container at /scripts/postgres_backup.sh which will let you know if it can't access the bucket (found out about this from #2067, the script must be run with the envdir command). Also note that postgres-operator does not automatically pick up changes to the OperatorConfiguration CRD if that's what you're using - you'll need to kick the deployment when you update it.

Hope this helps.

rusty-jules avatar Dec 17 '23 10:12 rusty-jules

@rusty-jules many thanks for your well-detailed guide! Awesome!

tolikkostin avatar Dec 18 '23 16:12 tolikkostin

@rusty-jules Have you tried to create a new cluster (restore) by cloning that backup?

oleksiytsyban avatar Feb 06 '24 01:02 oleksiytsyban

@oleksiytsyban sure have! I found that I needed a few more fields in the postgres-env ConfigMap, namely all of the fields that are automatically set by EKS via serviceaccount annotation, but with the CLONE_* prefix:

data:
  CLONE_AWS_REGION: "${aws_region}"
  CLONE_AWS_ROLE_ARN: "${iam_role_arn}"
  CLONE_AWS_WEB_IDENTITY_TOKEN_FILE: "/var/run/secrets/eks.amazonaws.com/serviceaccount/token"
  CLONE_AWS_STS_REGIONAL_ENDPOINTS: regional # this one depends on the settings of your account/vpc
  # this one was necessary on EC2 instances that use IMDSv2 metadata api only, since spilo still uses v1
  SPILO_PROVIDER: aws

Then set the clone uid and timestamp in cloned cluster custom resource.

apiVersion: acid.zalan.do/v1
kind: postgresql
spec:
  clone:
    uid: "${cloned_cluster_uid}" # get this from the k8s custom resource of the previous cluster, or the s3 backup key
    timestamp: "2024-02-05T08:00:00+00:00" # the timezone is required in the format of ±00:00 (UTC)

rusty-jules avatar Feb 06 '24 01:02 rusty-jules

@rusty-jules Thank you for confirming. I did something similar here: https://github.com/zalando/postgres-operator/issues/2067#issuecomment-1664786251 And created an issue about that: https://github.com/zalando/spilo/issues/897

I was hoping maybe the issue has been fixed and I can get rid of those additional variables. Not yet.

oleksiytsyban avatar Feb 06 '24 01:02 oleksiytsyban