kops icon indicating copy to clipboard operation
kops copied to clipboard

`aws-cloud-controller-manager`: SyncLoadBalancerFailed on new aws account

Open piec opened this issue 2 years ago • 12 comments
trafficstars

/kind bug

1. What kops version are you running? The command kops version, will display this information.

Client version: 1.25.3 (git-v1.25.3)

2. What Kubernetes version are you running? kubectl version will print the version if a cluster is running or provide the Kubernetes version specified as a kops flag.

Client Version: v1.26.1
Kustomize Version: v4.5.7
Server Version: v1.25.6

3. What cloud provider are you using? AWS

4. What commands did you run? What is the simplest way to reproduce this issue? Create a new aws account, I don't reproduce this with an account where there's already a kops cluster running ok. Create a cluster with default options on aws with with route53 following the Gettings started. I'm using a kops IAM user with the correct permissions. Create a LoadBalancer service with the service.beta.kubernetes.io/aws-load-balancer-type: nlb annotation (doc), please see the collapsed yml below.

service and deployment to reproduce.yml
apiVersion: v1
kind: Service
metadata:
  annotations:
    service.beta.kubernetes.io/aws-load-balancer-type: nlb
  name: nginx
  namespace: default
spec:
  externalTrafficPolicy: Local
  type: LoadBalancer
  ports:
  - port: 80
    protocol: TCP
    targetPort: 80
  selector:
    app: nginx
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: nginx-deployment
  labels:
    app: nginx
  namespace: default
spec:
  replicas: 3
  selector:
    matchLabels:
      app: nginx
  template:
    metadata:
      labels:
        app: nginx
    spec:
      containers:
      - name: nginx
        image: nginx:1.14.2
        ports:
        - containerPort: 80

5. What happened after the commands executed?

$ kubectl -n default get service nginx
NAME    TYPE           CLUSTER-IP       EXTERNAL-IP   PORT(S)        AGE
nginx   LoadBalancer   100.69.194.141   <pending>     80:32598/TCP   96s
$ kubectl -n default describe service nginx
[...]
  Warning  SyncLoadBalancerFailed  5s                service-controller  Error syncing load balancer: failed to ensure load balancer: error creating load balancer: "AccessDenied: User: arn:aws:sts::xxx:assumed-role/aws-cloud-controller-manager.kube-system.sa.ktest2.example.com/yyy is not authorized to perform: ec2:DescribeInternetGateways\n\tstatus code: 403, request id: a86cd923-7432-4910-8827-1e0abc411854"
$ kubectl -n kube-system logs daemonsets/aws-cloud-controller-manager --tail 10
[...]
I0215 17:12:02.745101       1 controller.go:417] Ensuring load balancer for service default/nginx
I0215 17:12:02.745703       1 event.go:294] "Event occurred" object="default/nginx" fieldPath="" kind="Service" apiVersion="v1" type="Normal" reason="EnsuringLoadBalancer" message="Ensuring load balancer"
I0215 17:12:02.745784       1 aws.go:3984] EnsureLoadBalancer(ktest2.example.com, default, nginx, eu-central-1, , [{ TCP <nil> 80 {0 80 } 32598}], map[kubectl.kubernetes.io/last-applied-configuration:{"apiVersion":"v1","kind":"Service","metadata":{"annotations":{"service.beta.kubernetes.io/aws-load-balancer-type":"nlb"},"name":"nginx","namespace":"default"},"spec":{"externalTrafficPolicy":"Local","ports":[{"port":80,"protocol":"TCP","targetPort":80}],"selector":{"app":"nginx"},"type":"LoadBalancer"}}
 service.beta.kubernetes.io/aws-load-balancer-type:nlb])

Does not get better if I let it run at least 1hour.

6. What did you expect to happen? something like:

$ kubectl -n default describe service nginx
NAME      TYPE           CLUSTER-IP      EXTERNAL-IP                   PORT(S)                      AGE
nginx   LoadBalancer   100.71.32.90   [...].amazonaws.com   80:30599/TCP,443:30536/TCP   30s

7. Please provide your cluster manifest. Execute kops get --name my.example.com -o yaml to display your cluster manifest. You may want to remove your cluster name and other sensitive information.

cluster spec yml
apiVersion: kops.k8s.io/v1alpha2
kind: Cluster
metadata:
  creationTimestamp: "2023-02-15T16:26:45Z"
  name: ktest2.example.com
spec:
  api:
    dns: {}
  authorization:
    rbac: {}
  channel: stable
  cloudProvider: aws
  configBase: s3://ktest2-bucket/ktest2.example.com
  etcdClusters:
  - cpuRequest: 200m
    etcdMembers:
    - encryptedVolume: true
      instanceGroup: master-eu-central-1a
      name: a
    memoryRequest: 100Mi
    name: main
  - cpuRequest: 100m
    etcdMembers:
    - encryptedVolume: true
      instanceGroup: master-eu-central-1a
      name: a
    memoryRequest: 100Mi
    name: events
  iam:
    allowContainerRegistry: true
    legacy: false
    useServiceAccountExternalPermissions: true
  kubeProxy:
    enabled: false
  kubelet:
    anonymousAuth: false
  kubernetesApiAccess:
  - 0.0.0.0/0
  - ::/0
  kubernetesVersion: 1.25.6
  masterPublicName: api.ktest2.example.com
  networkCIDR: 172.20.0.0/16
  networking:
    cilium:
      enableNodePort: true
  nonMasqueradeCIDR: 100.64.0.0/10
  serviceAccountIssuerDiscovery:
    discoveryStore: s3://ktest2-bucket/ktest2.example.com/discovery/ktest2.example.com
    enableAWSOIDCProvider: true
  sshAccess:
  - 0.0.0.0/0
  - ::/0
  subnets:
  - cidr: 172.20.32.0/19
    name: eu-central-1a
    type: Public
    zone: eu-central-1a
  topology:
    dns:
      type: Public
    masters: public
    nodes: public
ig spec yml
apiVersion: kops.k8s.io/v1alpha2
kind: InstanceGroup
metadata:
  creationTimestamp: "2023-02-15T16:26:46Z"
  generation: 1
  labels:
    kops.k8s.io/cluster: ktest2.example.com
  name: master-eu-central-1a
spec:
  image: 099720109477/ubuntu/images/hvm-ssd/ubuntu-focal-20.04-amd64-server-20230112
  instanceMetadata:
    httpPutResponseHopLimit: 3
    httpTokens: required
  machineType: t3.large
  maxSize: 1
  minSize: 1
  role: Master
  subnets:
  - eu-central-1a

---

apiVersion: kops.k8s.io/v1alpha2
kind: InstanceGroup
metadata:
  creationTimestamp: "2023-02-15T16:26:46Z"
  labels:
    kops.k8s.io/cluster: ktest2.example.com
  name: nodes-eu-central-1a
spec:
  image: 099720109477/ubuntu/images/hvm-ssd/ubuntu-focal-20.04-amd64-server-20230112
  instanceMetadata:
    httpPutResponseHopLimit: 1
    httpTokens: required
  machineType: t3.large
  maxSize: 2
  minSize: 2
  role: Node
  subnets:
  - eu-central-1a

8. Please run the commands with most verbose logging by adding the -v 10 flag. Paste the logs into this report, or in a gist and provide the gist link here.

9. Anything else do we need to know?

I tried to add manually new permissions to the aws-cloud-controller-manager.kube-system.sa.ktest2.example.com role

initial (failing) aws-cloud-controller-manager.kube-system.sa.ktest2.example.com role
{
    "Statement": [
        {
            "Action": "ec2:CreateTags",
            "Condition": {
                "StringEquals": {
                    "aws:RequestTag/KubernetesCluster": "ktest2.example.com",
                    "ec2:CreateAction": [
                        "CreateSecurityGroup"
                    ]
                }
            },
            "Effect": "Allow",
            "Resource": [
                "arn:aws:ec2:*:*:security-group/*"
            ]
        },
        {
            "Action": [
                "ec2:CreateTags",
                "ec2:DeleteTags"
            ],
            "Condition": {
                "Null": {
                    "aws:RequestTag/KubernetesCluster": "true"
                },
                "StringEquals": {
                    "aws:ResourceTag/KubernetesCluster": "ktest2.example.com"
                }
            },
            "Effect": "Allow",
            "Resource": [
                "arn:aws:ec2:*:*:security-group/*"
            ]
        },
        {
            "Action": [
                "autoscaling:DescribeAutoScalingGroups",
                "autoscaling:DescribeTags",
                "ec2:DescribeAvailabilityZones",
                "ec2:DescribeInstances",
                "ec2:DescribeRegions",
                "ec2:DescribeRouteTables",
                "ec2:DescribeSecurityGroups",
                "ec2:DescribeSubnets",
                "ec2:DescribeVpcs",
                "elasticloadbalancing:DescribeListeners",
                "elasticloadbalancing:DescribeLoadBalancerAttributes",
                "elasticloadbalancing:DescribeLoadBalancerPolicies",
                "elasticloadbalancing:DescribeLoadBalancers",
                "elasticloadbalancing:DescribeTargetGroups",
                "elasticloadbalancing:DescribeTargetHealth",
                "kms:DescribeKey"
            ],
            "Effect": "Allow",
            "Resource": "*"
        },
        {
            "Action": [
                "ec2:AuthorizeSecurityGroupIngress",
                "ec2:DeleteSecurityGroup",
                "ec2:ModifyInstanceAttribute",
                "ec2:RevokeSecurityGroupIngress",
                "elasticloadbalancing:AddTags",
                "elasticloadbalancing:ApplySecurityGroupsToLoadBalancer",
                "elasticloadbalancing:AttachLoadBalancerToSubnets",
                "elasticloadbalancing:ConfigureHealthCheck",
                "elasticloadbalancing:CreateLoadBalancerListeners",
                "elasticloadbalancing:CreateLoadBalancerPolicy",
                "elasticloadbalancing:DeleteListener",
                "elasticloadbalancing:DeleteLoadBalancer",
                "elasticloadbalancing:DeleteLoadBalancerListeners",
                "elasticloadbalancing:DeleteTargetGroup",
                "elasticloadbalancing:DeregisterInstancesFromLoadBalancer",
                "elasticloadbalancing:DeregisterTargets",
                "elasticloadbalancing:DetachLoadBalancerFromSubnets",
                "elasticloadbalancing:ModifyListener",
                "elasticloadbalancing:ModifyLoadBalancerAttributes",
                "elasticloadbalancing:ModifyTargetGroup",
                "elasticloadbalancing:RegisterInstancesWithLoadBalancer",
                "elasticloadbalancing:RegisterTargets",
                "elasticloadbalancing:SetLoadBalancerPoliciesForBackendServer",
                "elasticloadbalancing:SetLoadBalancerPoliciesOfListener"
            ],
            "Condition": {
                "StringEquals": {
                    "aws:ResourceTag/KubernetesCluster": "ktest2.example.com"
                }
            },
            "Effect": "Allow",
            "Resource": "*"
        },
        {
            "Action": [
                "ec2:CreateSecurityGroup",
                "elasticloadbalancing:CreateListener",
                "elasticloadbalancing:CreateLoadBalancer",
                "elasticloadbalancing:CreateTargetGroup"
            ],
            "Condition": {
                "StringEquals": {
                    "aws:RequestTag/KubernetesCluster": "ktest2.example.com"
                }
            },
            "Effect": "Allow",
            "Resource": "*"
        },
        {
            "Action": "ec2:CreateSecurityGroup",
            "Effect": "Allow",
            "Resource": "arn:aws:ec2:*:*:vpc/*"
        }
    ],
    "Version": "2012-10-17"
}

After adding sequentially ec2:DescribeInternetGateways and iam:CreateServiceLinkedRole. The aws-cloud-controller-manager is able to create the LB.

Since it wants iam:CreateServiceLinkedRole I took a look at the most recent IAM roles and I noticed it created a new AWSServiceRoleForElasticLoadBalancing. Once this role is created the 2 permissions that I added to aws-cloud-controller-manager can be removed and LBs can still be created correctly. So apparently the only missing part was the AWSServiceRoleForElasticLoadBalancing role

What should be the correct way to solve this issue? I think additionalPolicies are not added to the aws-cloud-controller-manager role.

Thanks

related: #10753

piec avatar Feb 15 '23 17:02 piec

After you delete a kops cluster the AWSServiceRoleForElasticLoadBalancing role stays so it's hard to reproduce on an existing account

piec avatar Feb 15 '23 17:02 piec

I use Kops Client version: 1.25.3 (git-v1.25.3) and have the same problem. I do not see AWSServiceRoleForElasticLoadBalancing under my list of policies. I created a policy to allow iam:CreateServiceLinkedRole which did not work for me


{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "AllowCreateServiceLinkedRoleForELB",
            "Effect": "Allow",
            "Action": "iam:CreateServiceLinkedRole",
            "Resource": "arn:aws:iam::<ID>:role/aws-service-role/elasticloadbalancing.amazonaws.com/AWSServiceRoleForElasticLoadBalancing"
        }
    ]
}

Any ideas what is needed here?

danie-dejager avatar Feb 23 '23 11:02 danie-dejager

I updated the role to the following and it is still failing:


{
 "Version": "2012-10-17",
 "Statement": [
   {
     "Effect": "Allow",
     "Action": [
       "ec2:Describe*",
       "iam:CreateServiceLinkedRole",
       "tag:GetResources",
       "elasticloadbalancing:*"
     ],
     "Resource": [
       "*"
     ]
   }
 ]
}

Error syncing load balancer: failed to ensure load balancer: AccessDenied: User: arn:aws:sts:::assumed-role/masters.example.com/i-is not authorized to perform: iam:CreateServiceLinkedRole on resource: arn:aws:iam:::role/aws-service-role/elasticloadbalancing.amazonaws.com/AWSServiceRoleForElasticLoadBalancing because no identity-based policy allows the iam:CreateServiceLinkedRole action

danie-dejager avatar Feb 23 '23 13:02 danie-dejager

I found the problem. There is a Role created for the cluster. This role is named masters.<cluster_name> and can be seen following the ARN in the error arn:aws:sts::<acc id>:assumed-role/masters.example.com/i-<EC2 Instance2>

I added to the masters Role a new policy containing the following to fix the problem:


{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": "iam:CreateServiceLinkedRole",
            "Resource": "arn:aws:iam::<acc ID>:role/aws-service-role/elasticloadbalancing.amazonaws.com/AWSServiceRoleForElasticLoadBalancing",
            "Condition": {
                "StringLike": {
                    "iam:AWSServiceName": "elasticloadbalancing.amazonaws.com"
                }
            }
        }
    ]
}

BTW. This is a shortcoming for the latest 1.26.0-beta.2 (git-v1.26.0-beta.2) version too

danie-dejager avatar Feb 23 '23 14:02 danie-dejager

@daniejstriata what solved it for me it to modify the aws-cloud-controller-manager.kube-system.sa.<cluster> role, add ec2:DescribeInternetGateways, iam:CreateServiceLinkedRole. I did not need to create a new role

piec avatar Feb 23 '23 14:02 piec

Seems both our solutions is doing the same thing, just a little differently. The masters role should still be correctly created by kops. This was not an issue for me until I deleted the cluster and recreated it. Maybe kops did not validate the roles correctly when I created new clusters in the same region reusing the cluster name.

EDIT: I deleted the cluster and the master roles was deleted. I'll create a new cluster tomorrow morning and post the newly created master policy hear before I updated it.

danie-dejager avatar Feb 23 '23 18:02 danie-dejager

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot avatar May 24 '23 19:05 k8s-triage-robot

/remove-lifecycle stale

danie-dejager avatar May 25 '23 19:05 danie-dejager

Identify IAM Role: Identify the IAM role associated with the aws-cloud-controller-manager service account. Make a note of its name or ARN.

Update IAM Policy: Use the AWS CLI to add the required permissions to the IAM role's policy. Replace YOUR_ROLE_NAME_OR_ARN with the actual IAM role name or ARN:

aws iam attach-role-policy --role-name YOUR_ROLE_NAME_OR_ARN --policy-arn arn:aws:iam::aws:policy/service-role/AWSLambdaVPCAccessExecutionRole

This command attaches the AWSLambdaVPCAccessExecutionRole policy, which includes the ec2:DescribeInternetGateways and iam:CreateServiceLinkedRole permissions.

Verify Changes:

Verify that the policy is attached to the IAM role using the AWS CLI Check the output to ensure that the attached policy includes the necessary permissions. :

aws iam list-attached-role-policies --role-name YOUR_ROLE_NAME_OR_ARN

Now

aws iam attach-role-policy --role-name IAM_ROLE_NAME_OR_ARN --policy-arn arn:aws:iam::aws:policy/AmazonEC2FullAccess

The above command attaches the AmazonEC2FullAccess policy, which includes the ec2:DescribeInternetGateways permission. Note that this policy provides broad EC2-related permissions, so consider adjusting permissions to match your security requirements.

for verification aws iam list-attached-role-policies --role-name IAM_ROLE_NAME_OR_ARN Check the output to ensure that the newly attached policy is listed.

Now delete the old service and re -create it again .

DIVYANSH856 avatar Aug 09 '23 20:08 DIVYANSH856

I have a similar issue on k8s 1.27.5 (1.25 and 1.26 as well) and kops 1.27.1. When creating a service of type LoadBalancer it initially fails a few times:

Error syncing load balancer: failed to ensure load balancer: Unable to update load balancer attributes during attribute sync: "AccessDenied: User: arn:aws:sts::xxx:assumed-role/masters.arku.k8s.local/i-xxx is not authorized to perform: elasticloadbalancing:ModifyLoadBalancerAttributes on resource: arn:aws:elasticloadbalancing:eu-central-1:xxx:loadbalancer/xxx because no identity-based policy allows the elasticloadbalancing:ModifyLoadBalancerAttributes action\n\tstatus code: 403, request id: xxx"

But after a few minutes it succeeds creating the LB. This happens for any later LB as well.

Adding AmazonEC2FullAccess to the IAM role masters.arku.k8s.local fixes the issue.

This happens for any cluster that I create in my account. I am provisioning 10s of clusters per week, and this did not happen with v1.19.

Is there anything I can do to fix that in the kOps code? Is this a bug at all? The issue is lying around here untouched...

Opened https://github.com/kubernetes/kops/issues/15990 for that.

gekart avatar Sep 24 '23 20:09 gekart

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot avatar Jan 29 '24 14:01 k8s-triage-robot

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle rotten
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

k8s-triage-robot avatar Feb 28 '24 15:02 k8s-triage-robot

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue with /reopen
  • Mark this issue as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

k8s-triage-robot avatar Mar 29 '24 15:03 k8s-triage-robot

@k8s-triage-robot: Closing this issue, marking it as "Not Planned".

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue with /reopen
  • Mark this issue as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot avatar Mar 29 '24 15:03 k8s-ci-robot