kops
kops copied to clipboard
`aws-cloud-controller-manager`: SyncLoadBalancerFailed on new aws account
/kind bug
1. What kops version are you running? The command kops version, will display
this information.
Client version: 1.25.3 (git-v1.25.3)
2. What Kubernetes version are you running? kubectl version will print the
version if a cluster is running or provide the Kubernetes version specified as
a kops flag.
Client Version: v1.26.1
Kustomize Version: v4.5.7
Server Version: v1.25.6
3. What cloud provider are you using? AWS
4. What commands did you run? What is the simplest way to reproduce this issue?
Create a new aws account, I don't reproduce this with an account where there's already a kops cluster running ok.
Create a cluster with default options on aws with with route53 following the Gettings started. I'm using a kops IAM user with the correct permissions.
Create a LoadBalancer service with the service.beta.kubernetes.io/aws-load-balancer-type: nlb annotation (doc), please see the collapsed yml below.
service and deployment to reproduce.yml
apiVersion: v1
kind: Service
metadata:
annotations:
service.beta.kubernetes.io/aws-load-balancer-type: nlb
name: nginx
namespace: default
spec:
externalTrafficPolicy: Local
type: LoadBalancer
ports:
- port: 80
protocol: TCP
targetPort: 80
selector:
app: nginx
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: nginx-deployment
labels:
app: nginx
namespace: default
spec:
replicas: 3
selector:
matchLabels:
app: nginx
template:
metadata:
labels:
app: nginx
spec:
containers:
- name: nginx
image: nginx:1.14.2
ports:
- containerPort: 80
5. What happened after the commands executed?
$ kubectl -n default get service nginx
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
nginx LoadBalancer 100.69.194.141 <pending> 80:32598/TCP 96s
$ kubectl -n default describe service nginx
[...]
Warning SyncLoadBalancerFailed 5s service-controller Error syncing load balancer: failed to ensure load balancer: error creating load balancer: "AccessDenied: User: arn:aws:sts::xxx:assumed-role/aws-cloud-controller-manager.kube-system.sa.ktest2.example.com/yyy is not authorized to perform: ec2:DescribeInternetGateways\n\tstatus code: 403, request id: a86cd923-7432-4910-8827-1e0abc411854"
$ kubectl -n kube-system logs daemonsets/aws-cloud-controller-manager --tail 10
[...]
I0215 17:12:02.745101 1 controller.go:417] Ensuring load balancer for service default/nginx
I0215 17:12:02.745703 1 event.go:294] "Event occurred" object="default/nginx" fieldPath="" kind="Service" apiVersion="v1" type="Normal" reason="EnsuringLoadBalancer" message="Ensuring load balancer"
I0215 17:12:02.745784 1 aws.go:3984] EnsureLoadBalancer(ktest2.example.com, default, nginx, eu-central-1, , [{ TCP <nil> 80 {0 80 } 32598}], map[kubectl.kubernetes.io/last-applied-configuration:{"apiVersion":"v1","kind":"Service","metadata":{"annotations":{"service.beta.kubernetes.io/aws-load-balancer-type":"nlb"},"name":"nginx","namespace":"default"},"spec":{"externalTrafficPolicy":"Local","ports":[{"port":80,"protocol":"TCP","targetPort":80}],"selector":{"app":"nginx"},"type":"LoadBalancer"}}
service.beta.kubernetes.io/aws-load-balancer-type:nlb])
Does not get better if I let it run at least 1hour.
6. What did you expect to happen? something like:
$ kubectl -n default describe service nginx
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
nginx LoadBalancer 100.71.32.90 [...].amazonaws.com 80:30599/TCP,443:30536/TCP 30s
7. Please provide your cluster manifest. Execute
kops get --name my.example.com -o yaml to display your cluster manifest.
You may want to remove your cluster name and other sensitive information.
cluster spec yml
apiVersion: kops.k8s.io/v1alpha2
kind: Cluster
metadata:
creationTimestamp: "2023-02-15T16:26:45Z"
name: ktest2.example.com
spec:
api:
dns: {}
authorization:
rbac: {}
channel: stable
cloudProvider: aws
configBase: s3://ktest2-bucket/ktest2.example.com
etcdClusters:
- cpuRequest: 200m
etcdMembers:
- encryptedVolume: true
instanceGroup: master-eu-central-1a
name: a
memoryRequest: 100Mi
name: main
- cpuRequest: 100m
etcdMembers:
- encryptedVolume: true
instanceGroup: master-eu-central-1a
name: a
memoryRequest: 100Mi
name: events
iam:
allowContainerRegistry: true
legacy: false
useServiceAccountExternalPermissions: true
kubeProxy:
enabled: false
kubelet:
anonymousAuth: false
kubernetesApiAccess:
- 0.0.0.0/0
- ::/0
kubernetesVersion: 1.25.6
masterPublicName: api.ktest2.example.com
networkCIDR: 172.20.0.0/16
networking:
cilium:
enableNodePort: true
nonMasqueradeCIDR: 100.64.0.0/10
serviceAccountIssuerDiscovery:
discoveryStore: s3://ktest2-bucket/ktest2.example.com/discovery/ktest2.example.com
enableAWSOIDCProvider: true
sshAccess:
- 0.0.0.0/0
- ::/0
subnets:
- cidr: 172.20.32.0/19
name: eu-central-1a
type: Public
zone: eu-central-1a
topology:
dns:
type: Public
masters: public
nodes: public
ig spec yml
apiVersion: kops.k8s.io/v1alpha2
kind: InstanceGroup
metadata:
creationTimestamp: "2023-02-15T16:26:46Z"
generation: 1
labels:
kops.k8s.io/cluster: ktest2.example.com
name: master-eu-central-1a
spec:
image: 099720109477/ubuntu/images/hvm-ssd/ubuntu-focal-20.04-amd64-server-20230112
instanceMetadata:
httpPutResponseHopLimit: 3
httpTokens: required
machineType: t3.large
maxSize: 1
minSize: 1
role: Master
subnets:
- eu-central-1a
---
apiVersion: kops.k8s.io/v1alpha2
kind: InstanceGroup
metadata:
creationTimestamp: "2023-02-15T16:26:46Z"
labels:
kops.k8s.io/cluster: ktest2.example.com
name: nodes-eu-central-1a
spec:
image: 099720109477/ubuntu/images/hvm-ssd/ubuntu-focal-20.04-amd64-server-20230112
instanceMetadata:
httpPutResponseHopLimit: 1
httpTokens: required
machineType: t3.large
maxSize: 2
minSize: 2
role: Node
subnets:
- eu-central-1a
8. Please run the commands with most verbose logging by adding the -v 10 flag.
Paste the logs into this report, or in a gist and provide the gist link here.
9. Anything else do we need to know?
I tried to add manually new permissions to the aws-cloud-controller-manager.kube-system.sa.ktest2.example.com role
initial (failing) aws-cloud-controller-manager.kube-system.sa.ktest2.example.com role
{
"Statement": [
{
"Action": "ec2:CreateTags",
"Condition": {
"StringEquals": {
"aws:RequestTag/KubernetesCluster": "ktest2.example.com",
"ec2:CreateAction": [
"CreateSecurityGroup"
]
}
},
"Effect": "Allow",
"Resource": [
"arn:aws:ec2:*:*:security-group/*"
]
},
{
"Action": [
"ec2:CreateTags",
"ec2:DeleteTags"
],
"Condition": {
"Null": {
"aws:RequestTag/KubernetesCluster": "true"
},
"StringEquals": {
"aws:ResourceTag/KubernetesCluster": "ktest2.example.com"
}
},
"Effect": "Allow",
"Resource": [
"arn:aws:ec2:*:*:security-group/*"
]
},
{
"Action": [
"autoscaling:DescribeAutoScalingGroups",
"autoscaling:DescribeTags",
"ec2:DescribeAvailabilityZones",
"ec2:DescribeInstances",
"ec2:DescribeRegions",
"ec2:DescribeRouteTables",
"ec2:DescribeSecurityGroups",
"ec2:DescribeSubnets",
"ec2:DescribeVpcs",
"elasticloadbalancing:DescribeListeners",
"elasticloadbalancing:DescribeLoadBalancerAttributes",
"elasticloadbalancing:DescribeLoadBalancerPolicies",
"elasticloadbalancing:DescribeLoadBalancers",
"elasticloadbalancing:DescribeTargetGroups",
"elasticloadbalancing:DescribeTargetHealth",
"kms:DescribeKey"
],
"Effect": "Allow",
"Resource": "*"
},
{
"Action": [
"ec2:AuthorizeSecurityGroupIngress",
"ec2:DeleteSecurityGroup",
"ec2:ModifyInstanceAttribute",
"ec2:RevokeSecurityGroupIngress",
"elasticloadbalancing:AddTags",
"elasticloadbalancing:ApplySecurityGroupsToLoadBalancer",
"elasticloadbalancing:AttachLoadBalancerToSubnets",
"elasticloadbalancing:ConfigureHealthCheck",
"elasticloadbalancing:CreateLoadBalancerListeners",
"elasticloadbalancing:CreateLoadBalancerPolicy",
"elasticloadbalancing:DeleteListener",
"elasticloadbalancing:DeleteLoadBalancer",
"elasticloadbalancing:DeleteLoadBalancerListeners",
"elasticloadbalancing:DeleteTargetGroup",
"elasticloadbalancing:DeregisterInstancesFromLoadBalancer",
"elasticloadbalancing:DeregisterTargets",
"elasticloadbalancing:DetachLoadBalancerFromSubnets",
"elasticloadbalancing:ModifyListener",
"elasticloadbalancing:ModifyLoadBalancerAttributes",
"elasticloadbalancing:ModifyTargetGroup",
"elasticloadbalancing:RegisterInstancesWithLoadBalancer",
"elasticloadbalancing:RegisterTargets",
"elasticloadbalancing:SetLoadBalancerPoliciesForBackendServer",
"elasticloadbalancing:SetLoadBalancerPoliciesOfListener"
],
"Condition": {
"StringEquals": {
"aws:ResourceTag/KubernetesCluster": "ktest2.example.com"
}
},
"Effect": "Allow",
"Resource": "*"
},
{
"Action": [
"ec2:CreateSecurityGroup",
"elasticloadbalancing:CreateListener",
"elasticloadbalancing:CreateLoadBalancer",
"elasticloadbalancing:CreateTargetGroup"
],
"Condition": {
"StringEquals": {
"aws:RequestTag/KubernetesCluster": "ktest2.example.com"
}
},
"Effect": "Allow",
"Resource": "*"
},
{
"Action": "ec2:CreateSecurityGroup",
"Effect": "Allow",
"Resource": "arn:aws:ec2:*:*:vpc/*"
}
],
"Version": "2012-10-17"
}
After adding sequentially ec2:DescribeInternetGateways and iam:CreateServiceLinkedRole. The aws-cloud-controller-manager is able to create the LB.
Since it wants iam:CreateServiceLinkedRole I took a look at the most recent IAM roles and I noticed it created a new AWSServiceRoleForElasticLoadBalancing. Once this role is created the 2 permissions that I added to aws-cloud-controller-manager can be removed and LBs can still be created correctly. So apparently the only missing part was the AWSServiceRoleForElasticLoadBalancing role
What should be the correct way to solve this issue? I think additionalPolicies are not added to the aws-cloud-controller-manager role.
Thanks
related: #10753
After you delete a kops cluster the AWSServiceRoleForElasticLoadBalancing role stays so it's hard to reproduce on an existing account
I use Kops Client version: 1.25.3 (git-v1.25.3) and have the same problem. I do not see AWSServiceRoleForElasticLoadBalancing under my list of policies. I created a policy to allow iam:CreateServiceLinkedRole which did not work for me
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "AllowCreateServiceLinkedRoleForELB",
"Effect": "Allow",
"Action": "iam:CreateServiceLinkedRole",
"Resource": "arn:aws:iam::<ID>:role/aws-service-role/elasticloadbalancing.amazonaws.com/AWSServiceRoleForElasticLoadBalancing"
}
]
}
Any ideas what is needed here?
I updated the role to the following and it is still failing:
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"ec2:Describe*",
"iam:CreateServiceLinkedRole",
"tag:GetResources",
"elasticloadbalancing:*"
],
"Resource": [
"*"
]
}
]
}
Error syncing load balancer: failed to ensure load balancer: AccessDenied: User: arn:aws:sts::
I found the problem. There is a Role created for the cluster. This role is named masters.<cluster_name> and can be seen following the ARN in the error
arn:aws:sts::<acc id>:assumed-role/masters.example.com/i-<EC2 Instance2>
I added to the masters Role a new policy containing the following to fix the problem:
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": "iam:CreateServiceLinkedRole",
"Resource": "arn:aws:iam::<acc ID>:role/aws-service-role/elasticloadbalancing.amazonaws.com/AWSServiceRoleForElasticLoadBalancing",
"Condition": {
"StringLike": {
"iam:AWSServiceName": "elasticloadbalancing.amazonaws.com"
}
}
}
]
}
BTW. This is a shortcoming for the latest 1.26.0-beta.2 (git-v1.26.0-beta.2) version too
@daniejstriata what solved it for me it to modify the aws-cloud-controller-manager.kube-system.sa.<cluster> role, add ec2:DescribeInternetGateways, iam:CreateServiceLinkedRole. I did not need to create a new role
Seems both our solutions is doing the same thing, just a little differently. The masters role should still be correctly created by kops. This was not an issue for me until I deleted the cluster and recreated it. Maybe kops did not validate the roles correctly when I created new clusters in the same region reusing the cluster name.
EDIT: I deleted the cluster and the master roles was deleted. I'll create a new cluster tomorrow morning and post the newly created master policy hear before I updated it.
The Kubernetes project currently lacks enough contributors to adequately respond to all issues.
This bot triages un-triaged issues according to the following rules:
- After 90d of inactivity,
lifecycle/staleis applied - After 30d of inactivity since
lifecycle/stalewas applied,lifecycle/rottenis applied - After 30d of inactivity since
lifecycle/rottenwas applied, the issue is closed
You can:
- Mark this issue as fresh with
/remove-lifecycle stale - Close this issue with
/close - Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale
/remove-lifecycle stale
Identify IAM Role: Identify the IAM role associated with the aws-cloud-controller-manager service account. Make a note of its name or ARN.
Update IAM Policy: Use the AWS CLI to add the required permissions to the IAM role's policy. Replace YOUR_ROLE_NAME_OR_ARN with the actual IAM role name or ARN:
aws iam attach-role-policy --role-name YOUR_ROLE_NAME_OR_ARN --policy-arn arn:aws:iam::aws:policy/service-role/AWSLambdaVPCAccessExecutionRole
This command attaches the AWSLambdaVPCAccessExecutionRole policy, which includes the ec2:DescribeInternetGateways and iam:CreateServiceLinkedRole permissions.
Verify Changes:
Verify that the policy is attached to the IAM role using the AWS CLI Check the output to ensure that the attached policy includes the necessary permissions. :
aws iam list-attached-role-policies --role-name YOUR_ROLE_NAME_OR_ARN
Now
aws iam attach-role-policy --role-name IAM_ROLE_NAME_OR_ARN --policy-arn arn:aws:iam::aws:policy/AmazonEC2FullAccess
The above command attaches the AmazonEC2FullAccess policy, which includes the ec2:DescribeInternetGateways permission. Note that this policy provides broad EC2-related permissions, so consider adjusting permissions to match your security requirements.
for verification aws iam list-attached-role-policies --role-name IAM_ROLE_NAME_OR_ARN Check the output to ensure that the newly attached policy is listed.
Now delete the old service and re -create it again .
I have a similar issue on k8s 1.27.5 (1.25 and 1.26 as well) and kops 1.27.1. When creating a service of type LoadBalancer it initially fails a few times:
Error syncing load balancer: failed to ensure load balancer: Unable to update load balancer attributes during attribute sync: "AccessDenied: User: arn:aws:sts::xxx:assumed-role/masters.arku.k8s.local/i-xxx is not authorized to perform: elasticloadbalancing:ModifyLoadBalancerAttributes on resource: arn:aws:elasticloadbalancing:eu-central-1:xxx:loadbalancer/xxx because no identity-based policy allows the elasticloadbalancing:ModifyLoadBalancerAttributes action\n\tstatus code: 403, request id: xxx"
But after a few minutes it succeeds creating the LB. This happens for any later LB as well.
Adding AmazonEC2FullAccess to the IAM role masters.arku.k8s.local fixes the issue.
This happens for any cluster that I create in my account. I am provisioning 10s of clusters per week, and this did not happen with v1.19.
Is there anything I can do to fix that in the kOps code? Is this a bug at all? The issue is lying around here untouched...
Opened https://github.com/kubernetes/kops/issues/15990 for that.
The Kubernetes project currently lacks enough contributors to adequately respond to all issues.
This bot triages un-triaged issues according to the following rules:
- After 90d of inactivity,
lifecycle/staleis applied - After 30d of inactivity since
lifecycle/stalewas applied,lifecycle/rottenis applied - After 30d of inactivity since
lifecycle/rottenwas applied, the issue is closed
You can:
- Mark this issue as fresh with
/remove-lifecycle stale - Close this issue with
/close - Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.
This bot triages un-triaged issues according to the following rules:
- After 90d of inactivity,
lifecycle/staleis applied - After 30d of inactivity since
lifecycle/stalewas applied,lifecycle/rottenis applied - After 30d of inactivity since
lifecycle/rottenwas applied, the issue is closed
You can:
- Mark this issue as fresh with
/remove-lifecycle rotten - Close this issue with
/close - Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle rotten
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.
This bot triages issues according to the following rules:
- After 90d of inactivity,
lifecycle/staleis applied - After 30d of inactivity since
lifecycle/stalewas applied,lifecycle/rottenis applied - After 30d of inactivity since
lifecycle/rottenwas applied, the issue is closed
You can:
- Reopen this issue with
/reopen - Mark this issue as fresh with
/remove-lifecycle rotten - Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/close not-planned
@k8s-triage-robot: Closing this issue, marking it as "Not Planned".
In response to this:
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.
This bot triages issues according to the following rules:
- After 90d of inactivity,
lifecycle/staleis applied- After 30d of inactivity since
lifecycle/stalewas applied,lifecycle/rottenis applied- After 30d of inactivity since
lifecycle/rottenwas applied, the issue is closedYou can:
- Reopen this issue with
/reopen- Mark this issue as fresh with
/remove-lifecycle rotten- Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/close not-planned
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.