kiam error retrieving credentials: context canceled

We are observing the below errors, when we have high requests being served by KIAM for simple iam role listing s3 buckets. Error on kiam server

{"generation.metadata":0,"level":"error","msg":"error retrieving credentials: context canceled","pod.iam.requestedRole":"kiam-s3-full-access-test2","pod.iam.role":"kiam-s3-full-access-test2","pod.name":"kiam-test-api-7fd8ddcb9c-swhjj","pod.namespace":"default","pod.status.ip":"100.108.192.27","pod.status.phase":"Running","resource.version":"28288891","time":"2019-02-20T09:46:57Z"}
{"level":"error","msg":"error retrieving credentials in cache from future: context canceled. will delete","pod.iam.role":"kiam-s3-full-access-test2","time":"2019-02-20T09:53:21Z"}

Error on Kiam agent

{"addr":"100.108.192.24:59340","level":"error","method":"GET","msg":"error processing request: error fetching credentials: rpc error: code = Canceled desc = context canceled","path":"/latest/meta-data/iam/security-credentials/kiam-s3-full-access-test2","status":500,"time":"2019-02-20T09:55:23Z"}

No connectivity issues observed between the server and agent on logs(checked the agent and server logs with log level set to debug). tried with GRPC_GO_LOG_SEVERITY_LEVEL=info and GRPC_GO_LOG_VERBOSITY_LEVEL=8.

I see a relevant issue posted https://github.com/uswitch/kiam/issues/145. Can someone please suggest what timeout has to be increased? or anything that we are missing on config side?

Thanks

Feb 20 '19 09:02 sushsampath

The same issue. Also i added this vars:

            - name: AWS_METADATA_SERVICE_TIMEOUT
              value: "5"
            - name: AWS_METADATA_SERVICE_NUM_ATTEMPTS
              value: "20"

But issue still persists (time to time)

Apr 01 '19 09:04 Mitrofanov

Could you setup the Prometheus dashboard and keep an eye on the metrics. From the stuff you've reported it looks like the agent/server is slow in responding with credentials (having identified the role to be used by the pod), so my guess is that it's either a large influx of sts:AssumeRole requests (we measure around 500ms per call) or potentially some locking within the credentials cache that can't keep up with client expectations. Did you set those env vars for Kiam or within your client app?

Apr 01 '19 16:04 pingles

I had similar issues with contexts being cancelled, and keys not being returned successfully. I've solved these context cancelled errors by updating the resource limits in the daemonset of both the server and the agent. It seems like it is simply being starved for resources (in our case the cpu requests were set to 5m)

For example:

          limits:
            cpu: 500m
            memory: 200Mi
          requests:
            cpu: 100m
            memory: 100Mi

Apr 03 '19 07:04 jornargelo

@pingles I've set those vars for all my client apps. Also i completely removed limits for now - with no luck. Also i have grafana dash in-place. I observe about 2.5K cache hits, and about 50 cache misses in average.

Also i just tried to use KIAM 3.2 local STS endpoint- looks like it didn't works at all (see my questions in slack channel https://kubernetes.slack.com/messages/CBQLKVABH) Do you have other suggestions?

Apr 03 '19 07:04 Mitrofanov

Guys any thoughts so far? I really stucked for now. Going to fallback to kube2iam in case i can't solve this issue

Apr 05 '19 06:04 Mitrofanov

@Mitrofanov , can you post your exact logs here? I used to have different issues, maybe I could help you.

Jun 25 '19 04:06 axozoid

I'm hitting this too. Out of our 200 services, the only one hitting it is a cronjob that does some pretty heavy S3 work. No sts:AssumeRole going on though.

Jul 19 '19 13:07 jaygorrell

Facing the same issue

{"level":"debug","msg":"evicted credentials future had error: InvalidClientTokenId: The security token included in the request is invalid\n\tstatus code: 403, request id: 9ebc686e-b768-11e9-a5a1-d94beed402e5","pod.iam.role":"qa-james-kiam-testrole","time":"2019-08-05T10:05:57Z"}
{"generation.metadata":0,"level":"error","msg":"error retrieving credentials: context canceled","pod.iam.requestedRole":"qa-james-kiam-testrole","pod.iam.role":"qa-james-kiam-testrole","pod.name":"demo-8c4f9ffb-b9dl5","pod.namespace":"default","pod.status.ip":"10.244.0.149","pod.status.phase":"Running","resource.version":"328970","time":"2019-08-05T10:05:57Z"}```

Aug 05 '19 10:08 junaid18183

I am facing a simmilar issue, I am attempting to scale a deployment from 2 to 40 pods, I have AWS_METADATA_SERVICE_TIMEOUT=5 and AWS_METADATA_SERVICE_NUM_ATTEMPTS=5 which did not make any difference.

I see a flood of error finding role for pod: rpc error: code = DeadlineExceeded desc = context deadline exceeded in the logs

Aug 10 '19 19:08 stefansedich

Interesting, I run on EKS and was using interface=!eth15 as I found that recommended in a blog somewhere, just swapped over to interface=!eth0 as the docs recommend for the VPC CNI and scaled up and down from 1 to 50 multiple times and not a single failure in sight.

Update: turns out also using interface=eni+ also works fine for me, did a few scale ups to 100 pods and no issues, I may have to increase the max retries for the boto3 client as I found 5 is perhaps not enough during the scale-up to 100 pods but everything is behaving right now.

Aug 10 '19 20:08 stefansedich

+1 I have this issue in a kops cluster on a container that is constantly polling SQS. It only happens once or twice every few hours.

{"stream":"stderr","time":"2019-10-11T03:03:48.24488907Z","kubernetes":{"container_name":"kiam","namespace_name":"kube-system","pod_name":"kiam-server-c5hpk","container_image":"quay.io/uswitch/kiam:v3.0","host":"ixx"},"level":"error","msg":"error requesting credentials: RequestCanceled: request context canceled\ncaused by: context canceled","pod.iam.role":"xx-IAMRole-xx","service":"kiam","pod_name":"kiam-server-c5hpk","environment":"exp","log_id":"eeeb7331-a390-431f-a326-40c5fb787152"}

{"stream":"stderr","time":"2019-10-11T03:03:48.2449395Z","kubernetes":{"container_name":"kiam","namespace_name":"kube-system","pod_name":"kiam-server-c5hpk","container_image":"quay.io/uswitch/kiam:v3.0","host":"ixxrnal"},"generation.metadata":0,"level":"error","msg":"error retrieving credentials: context canceled","pod.iam.requestedRole":"xx-IAMRole-xx","pod.iam.role":"xx-IAMRole-xx","pod.name":"queue-worker-bf8dc8c9c-kptrg","pod.namespace":"lapis","pod.status.ip":"1xx4","pod.status.phase":"Running","resource.version":"415668","service":"kiam","pod_name":"kiam-server-c5hpk","environment":"exp","log_id":"e49556a6-494b-4293-9df0-a61c59fcddb1"}

~I figured I better update to the latest kiam version after posting this, which I've just done. I'll see if it continues to happen.~

Can confirm, still happens on latest version.

Update @ 24 Oct: went back to using kube2iam. Haven't had any issues :( It's really unfortunate, I love the extra features of kiam.

Oct 11 '19 03:10 callum-p

Any update on this? We're having the same issue on our cluster. Kiam pods are not limited (QoS is BestEffort) and we're running 2 replicas of the kiam-server on dedicated nodes.

Nov 01 '19 11:11 michelesr

I had similar issues with contexts being cancelled, and keys not being returned successfully. I've solved these context cancelled errors by updating the resource limits in the daemonset of both the server and the agent. It seems like it is simply being starved for resources (in our case the cpu requests were set to 5m)

For example:
          limits:
            cpu: 500m
            memory: 200Mi
          requests:
            cpu: 100m
            memory: 100Mi

This actually helped fix. For our cluster, the Kiam server needs ~200m CPU when it starts up.

Nov 15 '19 13:11 bhalothia

This is an ongoing problem for us as well. One of our S3 heavy services consistently gets the error shown below. This usually seems timed with heavier loads, but not necessarily:

kiam-agent-knxvb kiam {"addr":"100.96.3.154:58930","level":"error","method":"GET","msg":"error processing request: error fetching credentials: rpc error: code = Canceled desc = context canceled","path":"/latest/meta-data/iam/security-credentials/service","status":500,"time":"2021-09-30T11:15:36Z"}

We currently have the environmental variables set on server and agent:

            - name: AWS_METADATA_SERVICE_TIMEOUT
              value: "5"
            - name: AWS_METADATA_SERVICE_NUM_ATTEMPTS
              value: "20"

kiam-server resources:

    Limits:
      cpu:     800m
      memory:  1000Mi
    Requests:
      cpu:      200m
      memory:   250Mi

kiam-agent resources:

    Limits:
      cpu:     500m
      memory:  200Mi
    Requests:
      cpu:     100m
      memory:  100Mi

How else can we troubleshoot this issue?

Sep 30 '21 13:09 BuffaloWill

Is there a working solution for this issue? We are seeing this issue intermittently.

time="2024-02-22T22:43:12Z" level=error msg="error requesting credentials: RequestCanceled: request context canceled\ncaused by: context canceled"

{"addr":"","level":"error","method":"GET","msg":"error processing request: error fetching credentials: rpc error: code = Canceled desc = context canceled","path":"","status":500,"time":"2024-02-22T22:43:12Z"}

Mar 01 '24 07:03 Hari-Krish-na

@Hari-Krish-na this project is abandonware, unfortunately :/

Mar 01 '24 22:03 2rs2ts

Hi,

We’d no longer recommend using this and using AWS’ own solution. I’ll leave it to someone from the team that runs our infra to recommend options but we switched a while ago.

On Fri, 1 Mar 2024 at 22:06, Andrew G @.***> wrote:

@Hari-Krish-na https://github.com/Hari-Krish-na this project is abandonware, unfortunately :/

— Reply to this email directly, view it on GitHub https://github.com/uswitch/kiam/issues/220#issuecomment-1973991487, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAAAI7RYSTRCMWNR77G2223YWD3UXAVCNFSM4GYTITC2U5DIOJSWCZC7NNSXTN2JONZXKZKDN5WW2ZLOOQ5TCOJXGM4TSMJUHA3Q . You are receiving this because you were mentioned.Message ID: @.***>

Mar 01 '24 22:03 pingles

kiam kiam copied to clipboard

error retrieving credentials: context canceled

kiam
kiam copied to clipboard