kops icon indicating copy to clipboard operation
kops copied to clipboard

AWS sts:AssumeRole stopped working with role/OrganizationAccountAccessRole in 1.30.x

Open vitaliyf opened this issue 1 year ago • 2 comments

/kind bug

1. What kops version are you running? The command kops version, will display this information.

Testing upgrade from Client version: 1.29.2 (git-v1.29.2) to Client version: 1.30.1 (git-v1.30.1)

2. What Kubernetes version are you running? kubectl version will print the version if a cluster is running or provide the Kubernetes version specified as a kops flag.

v1.29.9

3. What cloud provider are you using?

AWS

4. What commands did you run? What is the simplest way to reproduce this issue?

kops_v1.30.1 update cluster - no other changes to manifest or environment, only executing newer kops binary.

5. What happened after the commands executed?

$ export AWS_PROFILE=company-name-dev3 $ kops_v1.30.1 update cluster

SDK 2024/09/20 14:31:06 DEBUG request failed with unretryable error https response error StatusCode: 403, RequestID: 623bd87e-11e1-4b06-9f16-10f60ba2f030, api error AccessDenied: User: arn:aws:sts::[redacted]006:assumed-role/OrganizationAccountAccessRole/aws-go-sdk-1726842666098977639 is not authorized to perform: sts:AssumeRole on resource: arn:aws:iam::[redacted]006:role/OrganizationAccountAccessRole Error: error determining default DNS zone: error querying zones: error listing hosted zones: operation error Route 53: ListHostedZones, get identity: get credentials: failed to refresh cached credentials, operation error STS: AssumeRole, https response error StatusCode: 403, RequestID: 623bd87e-11e1-4b06-9f16-10f60ba2f030, api error AccessDenied: User: arn:aws:sts::[redacted]006:assumed-role/OrganizationAccountAccessRole/aws-go-sdk-1726842666098977639 is not authorized to perform: sts:AssumeRole on resource: arn:aws:iam::[redacted]006:role/OrganizationAccountAccessRole

6. What did you expect to happen?

With kops-1.29.2 the output shows proposed changes that need to be applied with --yes

AWS CLI is able to successfully get Route53 zones from the same shell:

$ aws route53 list-hosted-zones
{
    "HostedZones": [
        {
            "Id": "/hostedzone/Z0[redacted]",
            "Name": "k8s.dev3.us-west-2.example.com.",
            "CallerReference": "8e483d8f-0d3c-4bcc-9c68-ecb4dea807ae",
            "Config": {
                "PrivateZone": false
            },
            "ResourceRecordSetCount": 8
        }
}

7. Please provide your cluster manifest. Execute kops get --name my.example.com -o yaml to display your cluster manifest. You may want to remove your cluster name and other sensitive information.

8. Please run the commands with most verbose logging by adding the -v 10 flag. Paste the logs into this report, or in a gist and provide the gist link here.

https://gist.github.com/vitaliyf/cfddd9ad771ee613ee850bb9e2d3fe14

9. Anything else do we need to know?

$ cat ~/.aws/config
[default]
region = us-west-2

[profile company-name]
aws_account_id = company-name
region = us-west-2
output = json
color = ff0000

[profile company-name-dev1]
role_arn = arn:aws:iam::[redacted]385:role/OrganizationAccountAccessRole
source_profile = company-name

[profile company-name-dev2]
role_arn = arn:aws:iam::[redacted]813:role/OrganizationAccountAccessRole
source_profile = company-name
color = 00ff00

[profile company-name-dev3]
role_arn = arn:aws:iam::[redacted]006:role/OrganizationAccountAccessRole
source_profile = company-name
color = 0000ff

This cluster has been continuously upgraded one kops/kubernetes version at a time for at least a couple years, so it is pretty routine for us to test and execute such upgrades in-place.

I tried to look around and I suspect this is related to aws-sdk-go-v2 upgrade.

For example, they have this issue: https://github.com/aws/aws-sdk-go-v2/issues/2686 - and coincidentally or not, that ticket is referenced by https://github.com/cert-manager/cert-manager/pull/7236 where they are also dealing with "Missing Region" error just like https://github.com/kubernetes/kops/issues/16645 from kops-1.30.0

vitaliyf avatar Sep 20 '24 14:09 vitaliyf

Workaround: use awsudo or other workarounds from https://kops.sigs.k8s.io/mfa/#the-workaround-2

$ awsudo company-name-dev3 kops_v1.30.1 update cluster

...
  	                    	+ NODEUP_URL_AMD64=https://artifacts.k8s.io/binaries/kops/1.30.1/linux/amd64/nodeup,https://github.com/kubernetes/kops/releases/download/v1.30.1/nodeup-linux-amd64
  	                    	- NODEUP_URL_AMD64=https://artifacts.k8s.io/binaries/kops/1.29.2/linux/amd64/nodeup,https://github.com/kubernetes/kops/releases/download/v1.29.2/nodeup-linux-amd64
...more as-expected output..

Must specify --yes to apply changes

vitaliyf avatar Sep 20 '24 14:09 vitaliyf

FYI @rifelpet

hakman avatar Sep 22 '24 07:09 hakman

Any progress on this ?

aramhakobyan avatar Nov 26 '24 15:11 aramhakobyan

I don't have a sufficient AWS Organization setup to be able to reproduce this bug and given that the awsudo workaround is straight forward, I haven't made any progress.

If anyone wants to contribute a fix to the AWS SDK code in this file, I'm happy to review it and test it against other ~/.aws/config setups.

rifelpet avatar Nov 30 '24 22:11 rifelpet

Hey @rifelpet, it seems like maybe it's the fact that we're trying to configure the region for STS purposes when we obtain it from IDMS, but that configuring STS causes issues for some configs.

I've tested the following change locally and it seems to be working for me: https://github.com/jvaldron/kops/commit/fed5ac36417203ec2446a5041bbc5ef9bf01485d

If this makes sense, I'll open a PR for it.

jValdron avatar Dec 09 '24 13:12 jValdron

@jValdron yes feel free to open a PR for it

rifelpet avatar Dec 17 '24 01:12 rifelpet