kiam icon indicating copy to clipboard operation
kiam copied to clipboard

Ensuring the KIAM Endpoint is ready when a pod is scheduled

Open nirnanaaa opened this issue 5 years ago • 3 comments

Hey, so we're running KIAM 3 in our cluster. Lately we noticed that the AWS metadata endpoint is not immediately available after pod startup (it takes about 1-2s for the endpoint http://169.254.169.254/latest/meta-data/iam/security-credentials/<role_id> to be ready). For quick starting pods, that try to use this endpoint immediately after startup, this is an issue.

I thought this might be down to the pod cache taking some time (just a few ms) to get notified of the pod creation. What do you think?

/cc: @mseiwald @boynux

nirnanaaa avatar Jul 31 '19 14:07 nirnanaaa

The servers currently won't go healthy until the pod caches have been filled:

https://github.com/uswitch/kiam/blob/master/pkg/k8s/pod_cache.go#L170

Of course, once they're running it's possible that a pod would request credentials before the watcher has delivered the notification but Kiam deliberately prefetches credentials and tracks metadata as soon as Pods are pending and have an IP which I'd hope to mostly be before the container can execute and request credentials.

Maybe it's worth checking other metrics to ensure your API servers etc. aren't being overwhelmed or that Kiam servers aren't being throttled.

There's also metrics that would track whether pods aren't found when requested so may be worth watching those to understand what's happening.

pingles avatar Aug 12 '19 12:08 pingles

we've already added metrics and can't really see this issue anymore. It was just really obvious when you initialize a nodejs application and fetch something from AWS immediately right after starting up.

We're not sure this is related to the server, since we've also added delayed pod scheduling until the KIAM agent is ready on the node via taints at the same time.

nirnanaaa avatar Aug 13 '19 07:08 nirnanaaa

@pingles sorry for the late response. This got somewhat lost on our table. Since my last comment we've updated to v3.4, but are still facing problems when rolling out both groups of server-nodes and agent-nodes simultaneously. Our servers are started as deployments, so the GRPC server's SIGTERM hook should be respected, right?

We've also found a significant increase in kiam_metadata_find_role_errors_total (going from zero to >50) during exactly the time of the rollout, that also reflects in pods entering a crashlooping state.

Also; when not rolling out both groups at the same time, we can see loads context Cancelled of log messages inside the agents:

{"addr":"xxx.xxx.xxx.xxx:58608","level":"error","method":"GET","msg":"error processing request: rpc error: code = Canceled desc = context canceled","path":"/latest/meta-data/iam/security-credentials/","status":500,"time":"2019-10-28T07:23:34Z"}

{"addr":"xxx.xxx.xxx.xxx:58608","duration":1001,"headers":{"Content-Type":["text/plain; charset=utf-8"],"X-Content-Type-Options":["nosniff"]},"level":"info","method":"GET","msg":"processed request","path":"/latest/meta-data/iam/security-credentials/","status":500,"time":"2019-10-28T07:23:34Z"}

Do you have any suggestions how to debug this topic further, or what measurements could be taken to find out the root cause of this problem?

This might also very well be related to https://github.com/uswitch/kiam/issues/217

nirnanaaa avatar Oct 28 '19 06:10 nirnanaaa