kiam
kiam copied to clipboard
error retrieving credentials: context canceled
We are observing the below errors, when we have high requests being served by KIAM for simple iam role listing s3 buckets. Error on kiam server
{"generation.metadata":0,"level":"error","msg":"error retrieving credentials: context canceled","pod.iam.requestedRole":"kiam-s3-full-access-test2","pod.iam.role":"kiam-s3-full-access-test2","pod.name":"kiam-test-api-7fd8ddcb9c-swhjj","pod.namespace":"default","pod.status.ip":"100.108.192.27","pod.status.phase":"Running","resource.version":"28288891","time":"2019-02-20T09:46:57Z"}
{"level":"error","msg":"error retrieving credentials in cache from future: context canceled. will delete","pod.iam.role":"kiam-s3-full-access-test2","time":"2019-02-20T09:53:21Z"}
Error on Kiam agent
{"addr":"100.108.192.24:59340","level":"error","method":"GET","msg":"error processing request: error fetching credentials: rpc error: code = Canceled desc = context canceled","path":"/latest/meta-data/iam/security-credentials/kiam-s3-full-access-test2","status":500,"time":"2019-02-20T09:55:23Z"}
No connectivity issues observed between the server and agent on logs(checked the agent and server logs with log level set to debug).
tried with GRPC_GO_LOG_SEVERITY_LEVEL=info
and GRPC_GO_LOG_VERBOSITY_LEVEL=8
.
I see a relevant issue posted https://github.com/uswitch/kiam/issues/145. Can someone please suggest what timeout has to be increased? or anything that we are missing on config side?
Thanks
The same issue. Also i added this vars:
- name: AWS_METADATA_SERVICE_TIMEOUT
value: "5"
- name: AWS_METADATA_SERVICE_NUM_ATTEMPTS
value: "20"
But issue still persists (time to time)
Could you setup the Prometheus dashboard and keep an eye on the metrics. From the stuff you've reported it looks like the agent/server is slow in responding with credentials (having identified the role to be used by the pod), so my guess is that it's either a large influx of sts:AssumeRole
requests (we measure around 500ms per call) or potentially some locking within the credentials cache that can't keep up with client expectations. Did you set those env vars for Kiam or within your client app?
I had similar issues with contexts being cancelled, and keys not being returned successfully. I've solved these context cancelled errors by updating the resource limits in the daemonset of both the server and the agent. It seems like it is simply being starved for resources (in our case the cpu requests were set to 5m)
For example:
limits:
cpu: 500m
memory: 200Mi
requests:
cpu: 100m
memory: 100Mi
@pingles I've set those vars for all my client apps. Also i completely removed limits for now - with no luck. Also i have grafana dash in-place. I observe about 2.5K cache hits, and about 50 cache misses in average.
Also i just tried to use KIAM 3.2 local STS endpoint- looks like it didn't works at all (see my questions in slack channel https://kubernetes.slack.com/messages/CBQLKVABH) Do you have other suggestions?
Guys any thoughts so far? I really stucked for now. Going to fallback to kube2iam in case i can't solve this issue
@Mitrofanov , can you post your exact logs here? I used to have different issues, maybe I could help you.
I'm hitting this too. Out of our 200 services, the only one hitting it is a cronjob that does some pretty heavy S3 work. No sts:AssumeRole
going on though.
Facing the same issue
{"level":"debug","msg":"evicted credentials future had error: InvalidClientTokenId: The security token included in the request is invalid\n\tstatus code: 403, request id: 9ebc686e-b768-11e9-a5a1-d94beed402e5","pod.iam.role":"qa-james-kiam-testrole","time":"2019-08-05T10:05:57Z"}
{"generation.metadata":0,"level":"error","msg":"error retrieving credentials: context canceled","pod.iam.requestedRole":"qa-james-kiam-testrole","pod.iam.role":"qa-james-kiam-testrole","pod.name":"demo-8c4f9ffb-b9dl5","pod.namespace":"default","pod.status.ip":"10.244.0.149","pod.status.phase":"Running","resource.version":"328970","time":"2019-08-05T10:05:57Z"}```
I am facing a simmilar issue, I am attempting to scale a deployment from 2 to 40 pods, I have AWS_METADATA_SERVICE_TIMEOUT=5 and AWS_METADATA_SERVICE_NUM_ATTEMPTS=5 which did not make any difference.
I see a flood of error finding role for pod: rpc error: code = DeadlineExceeded desc = context deadline exceeded
in the logs
Interesting, I run on EKS and was using interface=!eth15
as I found that recommended in a blog somewhere, just swapped over to interface=!eth0
as the docs recommend for the VPC CNI and scaled up and down from 1 to 50 multiple times and not a single failure in sight.
Update: turns out also using interface=eni+
also works fine for me, did a few scale ups to 100 pods and no issues, I may have to increase the max retries for the boto3 client as I found 5 is perhaps not enough during the scale-up to 100 pods but everything is behaving right now.
+1 I have this issue in a kops cluster on a container that is constantly polling SQS. It only happens once or twice every few hours.
{"stream":"stderr","time":"2019-10-11T03:03:48.24488907Z","kubernetes":{"container_name":"kiam","namespace_name":"kube-system","pod_name":"kiam-server-c5hpk","container_image":"quay.io/uswitch/kiam:v3.0","host":"ixx"},"level":"error","msg":"error requesting credentials: RequestCanceled: request context canceled\ncaused by: context canceled","pod.iam.role":"xx-IAMRole-xx","service":"kiam","pod_name":"kiam-server-c5hpk","environment":"exp","log_id":"eeeb7331-a390-431f-a326-40c5fb787152"}
{"stream":"stderr","time":"2019-10-11T03:03:48.2449395Z","kubernetes":{"container_name":"kiam","namespace_name":"kube-system","pod_name":"kiam-server-c5hpk","container_image":"quay.io/uswitch/kiam:v3.0","host":"ixxrnal"},"generation.metadata":0,"level":"error","msg":"error retrieving credentials: context canceled","pod.iam.requestedRole":"xx-IAMRole-xx","pod.iam.role":"xx-IAMRole-xx","pod.name":"queue-worker-bf8dc8c9c-kptrg","pod.namespace":"lapis","pod.status.ip":"1xx4","pod.status.phase":"Running","resource.version":"415668","service":"kiam","pod_name":"kiam-server-c5hpk","environment":"exp","log_id":"e49556a6-494b-4293-9df0-a61c59fcddb1"}
~I figured I better update to the latest kiam version after posting this, which I've just done. I'll see if it continues to happen.~
Can confirm, still happens on latest version.
Update @ 24 Oct: went back to using kube2iam. Haven't had any issues :( It's really unfortunate, I love the extra features of kiam.
Any update on this? We're having the same issue on our cluster. Kiam pods are not limited (QoS is BestEffort) and we're running 2 replicas of the kiam-server on dedicated nodes.
I had similar issues with contexts being cancelled, and keys not being returned successfully. I've solved these context cancelled errors by updating the resource limits in the daemonset of both the server and the agent. It seems like it is simply being starved for resources (in our case the cpu requests were set to 5m)
For example:
limits: cpu: 500m memory: 200Mi requests: cpu: 100m memory: 100Mi
This actually helped fix. For our cluster, the Kiam server needs ~200m CPU when it starts up.
This is an ongoing problem for us as well. One of our S3 heavy services consistently gets the error shown below. This usually seems timed with heavier loads, but not necessarily:
kiam-agent-knxvb kiam {"addr":"100.96.3.154:58930","level":"error","method":"GET","msg":"error processing request: error fetching credentials: rpc error: code = Canceled desc = context canceled","path":"/latest/meta-data/iam/security-credentials/service","status":500,"time":"2021-09-30T11:15:36Z"}
We currently have the environmental variables set on server and agent:
- name: AWS_METADATA_SERVICE_TIMEOUT
value: "5"
- name: AWS_METADATA_SERVICE_NUM_ATTEMPTS
value: "20"
kiam-server resources:
Limits:
cpu: 800m
memory: 1000Mi
Requests:
cpu: 200m
memory: 250Mi
kiam-agent resources:
Limits:
cpu: 500m
memory: 200Mi
Requests:
cpu: 100m
memory: 100Mi
How else can we troubleshoot this issue?
Is there a working solution for this issue? We are seeing this issue intermittently.
time="2024-02-22T22:43:12Z" level=error msg="error requesting credentials: RequestCanceled: request context canceled\ncaused by: context canceled"
{"addr":"","level":"error","method":"GET","msg":"error processing request: error fetching credentials: rpc error: code = Canceled desc = context canceled","path":"","status":500,"time":"2024-02-22T22:43:12Z"}
@Hari-Krish-na this project is abandonware, unfortunately :/
Hi,
We’d no longer recommend using this and using AWS’ own solution. I’ll leave it to someone from the team that runs our infra to recommend options but we switched a while ago.
On Fri, 1 Mar 2024 at 22:06, Andrew G @.***> wrote:
@Hari-Krish-na https://github.com/Hari-Krish-na this project is abandonware, unfortunately :/
— Reply to this email directly, view it on GitHub https://github.com/uswitch/kiam/issues/220#issuecomment-1973991487, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAAAI7RYSTRCMWNR77G2223YWD3UXAVCNFSM4GYTITC2U5DIOJSWCZC7NNSXTN2JONZXKZKDN5WW2ZLOOQ5TCOJXGM4TSMJUHA3Q . You are receiving this because you were mentioned.Message ID: @.***>