aws-load-balancer-controller icon indicating copy to clipboard operation
aws-load-balancer-controller copied to clipboard

operation error Elastic Load Balancing v2: DescribeLoadBalancers, get identity: get credentials: failed to refresh cached credentials, no EC2 IMDS role found, operation error ec2imds: GetMetadata, canceled, context deadline exceeded

Open g-bohncke opened this issue 1 year ago • 8 comments

Describe the bug A concise description of what the bug is.

When running the latest version chart 1.10.1 app version : v2.10.1 we are encountering: the following error.

operation error Elastic Load Balancing v2: DescribeLoadBalancers, get identity: get credentials: failed to refresh cached credentials, no EC2 IMDS role found, operation error ec2imds: GetMetadata, canceled, context deadline exceeded.

this seems to be related to the change to AWS SDK Go v2 version and looks like the code ignores the vcpId and region from the helm chart. "Instead of depending on IMDSv2, you can specify the AWS Region and the VPC via the controller flags --aws-region and --aws-vpc-id." the SDK looks to be always pulling the metadata. cloud.go

Steps to reproduce install the latest version on a private cluster.

Expected outcome A concise description of what you expected to happen. That the service works

Environment

  • AWS Load Balancer controller version v2.10.1
  • Kubernetes version 1.29
  • Using EKS (yes/no), if so version? Yes 1.29

Additional Context:

  • the latest policy has been applied and we use the policy via the node. (option B according to the docs).
  • we already verified that all the instances have a hop count of 2.

g-bohncke avatar Nov 26 '24 14:11 g-bohncke

Hey @g-bohncke , If you look here, we always infer the vpc-id and region from config first if its set before we infer it from ec2metadata. So it should have worked for you. Can we know which helm flags are using to set these values?

shraddhabang avatar Nov 27 '24 21:11 shraddhabang

Hi, I think I have the same issue and I suspect it's a configuration problem. However I can't find what it is. Maybe some guidance could help.

This is the error I see: {"level":"error","ts":"2025-01-07T20:39:55Z","msg":"Reconciler error","controller":"service","namespace":"database","name":"yugabyted-ui-service","reconcileID":"4e4d44e6-7394-4fb3-9469-cc5085c13282","error":"operation error Elastic Load Balancing v2: DescribeLoadBalancers, get identity: get credentials: failed to refresh cached credentials, no EC2 IMDS role found, operation error ec2imds: GetMetadata, canceled, context deadline exceeded"}

Here is what I did. This is a self installed k8s cluster in aws EC2 with rancher. It doesn't have public IP addresses, but to access internet nodes are behind a NAT. currently using chart version 1.9.2 (but had also the issue with 1.11.0) aws lbc is started with the following arguments:

Args:
--cluster-name=testjs2
--ingress-class=alb
--aws-region=us-east-1
--aws-vpc-id=vpc-<REDACTED>
--enable-shield=false
--enable-waf=false
--enable-wafv2=false

Shield, waf, and wafv2 disabled as per documented here: https://kubernetes-sigs.github.io/aws-load-balancer-controller/latest/deploy/installation/#additional-requirements-for-isolated-cluster

I have enabled IMDSv2 and enabled hop limit to 2:

aws ec2 describe-instances --instance-id i-<REDACTED> --query 'Reservations[].Instances[].MetadataOptions'
[
    {
        "State": "applied",
        "HttpTokens": "required",
        "HttpPutResponseHopLimit": 2,
        "HttpEndpoint": "enabled",
        "HttpProtocolIpv6": "disabled",
        "InstanceMetadataTags": "disabled"
    }
]

I have attached policies to nodes, option B from the following doc: https://kubernetes-sigs.github.io/aws-load-balancer-controller/latest/deploy/installation/#option-b-attach-iam-policies-to-nodes The policies applied for the worker nodes are the following: https://raw.githubusercontent.com/kubernetes-sigs/aws-load-balancer-controller/v2.11.0/docs/install/iam_policy.json

I don't use TargetGroupBinding

I have enabled incoming TCP connection to port 9443 on worker nodes

aws ec2 describe-instances --instance-id i-<REDACTED> --query 'Reservations[].Instances[].SecurityGroups[]'
[
    {
        "GroupId": "sg-0e6cdc5a83cea1d18",
        "GroupName": "rancher-nodes"
    }
]

aws ec2 describe-security-groups --group-ids sg-0e6cdc5a83cea1d18
...
        {
          "IpProtocol": "tcp",
          "FromPort": 9443,
          "ToPort": 9443,
          "UserIdGroupPairs": [
            {
              "UserId": "503561456987",
              "GroupId": "sg-0e6cdc5a83cea1d18"
            }
          ],
          "IpRanges": [],
          "Ipv6Ranges": [],
          "PrefixListIds": []
        },
 ...

What am I missing?

jsfrerot avatar Jan 07 '25 21:01 jsfrerot

So, I fix my issue. TLDR: set HttpPutResponseHopLimit to 3

I found this documentation that explains how to access metadata from and ec2 host. Then I opened a shell in one of my pods and ran

curl -v http://169.254.169.254/latest/meta-data/ and got, of course, a 401 Unauthorized

you have to get a token to do the request:

TOKEN=`curl -X PUT "http://169.254.169.254/latest/api/token" -H "X-aws-ec2-metadata-token-ttl-seconds: 21600"`
curl -H "X-aws-ec2-metadata-token: $TOKEN" http://169.254.169.254/latest/meta-data/

for some reason, I could get a 401 when accessing http://169.254.169.254/latest/meta-data/ but a timeout when trying to get a token! http://169.254.169.254/latest/api/token

Changed the HttpPutResponseHopLimit to 3 aws ec2 modify-instance-metadata-options --instance-id i-1234567898abcdef0 --http-put-response-hop-limit 3

then I could get a token from http://169.254.169.254/latest/api/token

Hope this will help other folks to waste less time on this!

jsfrerot avatar Jan 09 '25 14:01 jsfrerot

This helped orient me^^ thank you.

Some findings:

  1. The default EKS Node group now sets IMDBv2 to Required and Http-Put-Repsonse-Hop-Limit to 1 (if you don't specify a launch template). You will need a custom Launch Template if you want to continue down this IMDBv2 path, or alter the Launch Template that gets created after Node Group gets created. The values are buried in Advance section. What is also interesting is the default Launch Template is v1/v2 Optional .. so there is a mismatch between EKS Node Group UX and that.

  2. If you want to skip it and instead rely on the flags and use Helm, the parameters are region and vpcId.

gxpd-jjh avatar Jan 16 '25 00:01 gxpd-jjh

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot avatar Apr 16 '25 00:04 k8s-triage-robot

Hello, I'm actually getting the same issue here and using pod identity setup and im getting

failed to refresh cached credentials, no EC2 IMDS role found, operation error ec2imds: GetMetadata, canceled, context deadline exceeded"

i am explicitly providing --aws-region and --aws-vpc-id via the helm chart and i can confirm the args are there. Why does it still try to get the metadata when I have explicitly provided?

The thing is, ive seen it work with pod identity. I'm trying this in a new aws account using same setup and now dont understand why its not leveraging the IAM role properly or trying to get metadata when i gave it region and vpc id (which worked before)

perezjasonr avatar May 13 '25 12:05 perezjasonr

Hello, I'm actually getting the same issue here and using pod identity setup and im getting

failed to refresh cached credentials, no EC2 IMDS role found, operation error ec2imds: GetMetadata, canceled, context deadline exceeded"

i am explicitly providing --aws-region and --aws-vpc-id via the helm chart and i can confirm the args are there. Why does it still try to get the metadata when I have explicitly provided?

The thing is, ive seen it work with pod identity. I'm trying this in a new aws account using same setup and now dont understand why its not leveraging the IAM role properly or trying to get metadata when i gave it region and vpc id (which worked before)

so I got past this error by doing

aws ec2 modify-instance-metadata-options --instance-id eks_node_id --http-put-response-hop-limit 3 for every EKS node. I'm confused why this was needed because it worked in govcloud without it. but in commercial i had to do it. not sure where the difference is but this worked for me.

perezjasonr avatar May 13 '25 16:05 perezjasonr

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle rotten
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

k8s-triage-robot avatar Jun 12 '25 21:06 k8s-triage-robot

This happened to me when I upgraded an EKS nodegroup from AL2_x86_64 to AL2023_x86_64_STANDARD. The former (older) was having hop limit of 2. The latter has hop limit of 3. I can't find the documentation about this anywhere, but stumbled upon this issue https://github.com/kubernetes-sigs/aws-load-balancer-controller/issues/3695. Setting the limit back to 2 (default for AL2 instance) solved the issue.

karunsiri avatar Jul 03 '25 09:07 karunsiri

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue with /reopen
  • Mark this issue as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

k8s-triage-robot avatar Aug 02 '25 10:08 k8s-triage-robot

@k8s-triage-robot: Closing this issue, marking it as "Not Planned".

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue with /reopen
  • Mark this issue as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

k8s-ci-robot avatar Aug 02 '25 10:08 k8s-ci-robot

/reopen /remove-lifecycle rotten

I also provided --aws-region and --aws-vpc-id but it still tries to use IMDS and I get "context deadline exceeded".

dataviruset avatar Aug 13 '25 12:08 dataviruset

@dataviruset: You can't reopen an issue/PR unless you authored it or you are a collaborator.

In response to this:

/reopen /remove-lifecycle rotten

I also provided --aws-region and --aws-vpc-id but it still tries to use IMDS and I get "context deadline exceeded".

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

k8s-ci-robot avatar Aug 13 '25 12:08 k8s-ci-robot

I don't know why no one mention that so I will. I had the same issue with EKS cluster in AWS and I solved it by creating an IRSA role. https://docs.aws.amazon.com/eks/latest/userguide/lbc-manifest.html After I had created the role, I recreated the targen group binding kinds (I have two for internal and external lb) and it worked.

Fokines avatar Nov 13 '25 16:11 Fokines