kube2iam
kube2iam copied to clipboard
Timing issue causing failures in Pods created by Jobs
As per #46 I'm seeing similar issues where requests to kube2iam frequently fail with a 404 error from kube2iam soon after Job pod creation. I'm able to reliably reproduce the issue with a fairly simple Job definition in a couple of different Kubernetes clusters running Kubernetes v0.8.7.
I've tried with a number of different kube2iam versions: v0.8.2, v0.8.4, v0.10.0 and all exhibit similar behaviour.
Adding a delay to the beginning of the job's execution can help, but at other times a sleep of 5 seconds appears to still not be enough to get around the issue:
(Real IAM role arn replaced to protect the innocent)
Job definition:
apiVersion: batch/v1
kind: Job
metadata:
name: iam-role-test
spec:
completions: 50
parallelism: 5
backoffLimit: 2
template:
metadata:
annotations:
iam.amazonaws.com/role: arn:aws:iam::accountId:role/roleName
spec:
restartPolicy: Never
containers:
- name: test
image: governmentpaas/curl-ssl
command:
- sh
- -c
- "curl -v -f -H 'Accept: application/json' http://169.254.169.254/latest/meta-data/iam/security-credentials/arn:aws:iam::accountId:role/roleName"
Individual failed Pod log output appears as: (curl -v is messing with the formatting, but you get the idea):
* Trying 169.254.169.254...
* TCP_NODELAY set
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0* Connected to 169.254.169.254 (169.254.169.254) port 80 (#0)
> GET /latest/meta-data/iam/security-credentials/arn:aws:iam::accountId:role/roleName HTTP/1.1
> Host: 169.254.169.254
> User-Agent: curl/7.55.0
> Accept: application/json
>
* The requested URL returned error: 404 Not Found
0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0
* Closing connection 0
curl: (22) The requested URL returned error: 404 Not Found
Some of the 50 job executions (the success/failure rate seems to be fairly random) fail with the curl command simply receiving a 404 Not Found as described above. Adding a sleep 5
prior to the curl command fixes the issue most of the time.
I've enabled --log-level=debug and --debug options and the relevant log entries that mention the related role are listed in this gist: https://gist.github.com/damomurf/30468bfc1bd595720cb3c9e44946bc19
Hopefully this is sufficient detail provided on the issue, as requested in #46.
One other fact that may be helpful: the equivalent deployment (with a while loop to keep invoking the curl command and keep the pods running) does not exhibit the same behaviour:
apiVersion: extensions/v1beta1
kind: Deployment
metadata:
name: iam-role-test-deploy
spec:
replicas: 50
template:
metadata:
annotations:
iam.amazonaws.com/role: arn:aws:iam::accountId:role/roleName
labels:
app: iam-role-test
spec:
containers:
- name: test
image: governmentpaas/curl-ssl
command:
- /bin/sh
- -c
- "while [ true ]; do curl -v -f -H 'Accept: application/json' http://169.254.169.254/latest/meta-data/iam/security-credentials/iam.amazonaws.com/role: arn:aws:iam::accountId:role/roleName; sleep 60; done"
We are seeing very similar behaviour in a CronJob. Time to delve deep I guess.
Observing the same problem here but with pods in deployments, the first start in all deployments that begins with downloading stuff from S3 via aws-cli always fails the first start, after the first restart they come up fine.
common for all of them is a 404 in the kube2iam logs & error:
fatal error: Unable to locate credentials
Kube2iam: 0.10.0 Kubernetes 1.9.6 Kops 1.9.0
I've posted more info in https://github.com/jtblin/kube2iam/issues/122
I can reproduce this issue 99% of the time with a workload that generates a lot of pods (20-ish) all at once with kube2iam annotations. If I make the pod sleep for 30 seconds before trying to use my IAM role, it seems to work around the problem.
When I only spin up a few pods at a time, I don't see the problem.
We also see similar problems for all applications which for example want to read from S3 at start. Our workaround is to use an initContainer that tries to access the resource. The nature of a distributed system (like Kubernetes) is to have race conditions all over the place. The question is how to tackle these in general. Prefetching the credentials from AWS would be one way, I am not sure what kube2iam is doing.
We found this article, which does a good job of explaining the problem and talks about the kiam workaround:
https://medium.com/@pingles/kiam-iterating-for-security-and-reliability-5e793ab93ec3
#132 is indicating to our org that maybe we should look for alternatives. we plan on trying out kiam somewhat soon to see if it helps with this problem.
i've found that postStart hooks in k8s makes this problem worse with kube2iam since the pod wont go into a running state until postStart is executed
FYI: a discussion has been started in sig-aws about finding a common solution to this problem. See https://docs.google.com/document/d/1rn-v2TNH9k4Oz-VuaueP77ANE5p-5Ua89obK2JaArfg/edit?disco=AAAAB6DI_qM&ts=5b19085a for comparisons between existing projects.
For clarity, @mikkeloscar refers to the common solution about implementation and support for IAM in kubernetes on AWS (EKS or self run), not specifically, this issue around jobs and there race condition.
For clarity, @mikkeloscar refers to the common solution about implementation and support for IAM in kubernetes on AWS (EKS or self run), not specifically, this issue around jobs and there race condition.
Yup. The point of my comment back then was that it's very hard to fix these sort of race conditions with the architecture of kube2iam. The discussion was started to approach the problem differently.
I've been working on a Proof of Concept (https://github.com/mikkeloscar/kube-aws-iam-controller) to eliminate all of these kinds of race conditions. Since the AWS SDKs handle credentials differently it currently doesn't work for all of them. It works for python and Java for now and I'm working with AWS to add support to Go as well (https://github.com/aws/aws-sdk-go/issues/1993).
Any update to this issue?