kube2iam
kube2iam copied to clipboard
Interface conversion fails, causing pods not to get credentials.
Sometimes, pods in my cluster fail with errors like "fatal error: Unable to locate credentials". Checking the kube2iam logs for the instance on the same node, I see entries like this:
time="2018-11-20T18:40:26Z" level=error msg="PANIC error processing request: interface conversion: interface {} is nil, not *v1.Pod" req.method=GET req.path=/latest/meta-data/iam/security-credentials/ req.remote=172.19.97.48 res.status=500
This occurs on an EKS cluster running Kubernetes 1.10.3.
I've seen this same issue on Kubernetes 1.10.9 on a Kops-provisioned cluster. Using the latest 0.10.4 image. Very intermittent, I haven't been able to create a guaranteed repro case.
There are only 2 cases of *v1.Pod
type assertion:
https://github.com/jtblin/kube2iam/blob/master/k8s/k8s.go#L100-L106
The k8s.io/client-go
library is pinned to a version over 20 months old and there have been many major releases since then. Perhaps its worth updating?
We have seen this error as well. We upvoted to have this pulled in as well.
getting this on EKS 1.11.5
I will look at updating tomorrow
@jrnt30 Can we make sure to get a new build to? We will start experimenting whenever you release.
rolled back to 0.10.4 error is disappeared. i won't be using "latest" i won't be using "latest" repaet 10 times :)
You only started seeing this is 0.10.6? Logs would help. Are the failures for running or deleted pods?
On Sat, Jan 26, 2019 at 15:26 Sergii Shapovalenko [email protected] wrote:
rolled back to 0.10.4 error is disappeared. i won't be using "latest" i won't be using "latest" repaet 10 times :)
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/jtblin/kube2iam/issues/178#issuecomment-457867339, or mute the thread https://github.com/notifications/unsubscribe-auth/AAWanESWzlvvkiCszNJB3ewhKPaocHUjks5vHMf9gaJpZM4YrwBM .
@jrnt30 Yes, I used "latest" tag for docker image. this only affected the pods on the nodes that were created after 0.10.6 release. Then i set tag 0.10.4 and error disappeared.
I was getting
aws sts get-caller-identity
Unable to locate credentials. You can configure credentials by running "aws configure".
We were on 10.4 and were having the error where pod could not get creds after term. After the update to 10.6 that went away for the application that has very variable scaling and has been very stable. Another application, that uses kubernetes jobs on static non-autoscaling boxes was having intermittent errors
"unable to sign request without credentials set"
with error in kube2iam. We had seen this in 10.4 though.
{"level":"error","msg":"PANIC error processing request: interface conversion: interface {} is nil, not *v1.Pod","req.method":"GET","req.path":"/latest/meta-data/iam/security-credentials/","req.remote":"100.96.192.4","res.status":500,"time":"2019-01-26T21:30:33Z"}.
This would not be related to deleted pods in our case for the fix in 10.6. Running with kops deploy on v1.10.11
rolled back to 0.10.4 error is disappeared. i won't be using "latest" i won't be using "latest" repaet 10 times :)
Found out the same today, thanks to your comment I realized that I did the same.
Saw this as well in ours EKS.
Ditto. EKS 1.10.11
Also seeing this on EKS Server Version: version.Info{Major:"1", Minor:"11+", GitVersion:"v1.11.5-eks-6bad6d", GitCommit:"6bad6d9c768dc0864dab48a11653aa53b5a47043", GitTreeState:"clean", BuildDate:"2018-12-06T23:13:14Z", GoVersion:"go1.10.3", Compiler:"gc", Platform:"linux/amd64"}
, roll back to 0.10.4
fixes it.
The same on AWS EKS Server Version: version.Info{Major:"1", Minor:"10+", GitVersion:"v1.10.11-eks", GitCommit:"6bf27214b7e3e1e47dce27dcbd73ee1b27adadd0", GitTreeState:"clean", BuildDate:"2018-12-04T13:33:10Z", GoVersion:"go1.9.3", Compiler:"gc", Platform:"linux/amd64"}
Did we fix this in the latest image? We are seeing this in EKS.
Apologies for the delay, I have not provisioned nor have a current cluster where I have reproduced this. Is this emergent behavior JUST from the 0.10.6 upgrade or was there also a cluster upgrade to 1.11 from a previous version?
Can anyone send a set of debug
level logs that have this issue captured? What is not totally clear is if this interface conversation is occurring for newly deleted pod, new pods coming into the cluster, etc.
The PR that may be causing some of this #173 changed the semantics of what an active
pod is. There was some work done to:
- Update the K8 go client library
- Add debugging statements to potentially identify where/why the conversion is failing
Simply adjusting these and "hoping" that this fixes it seems somewhat lazy and dangerous to me, so I any additional information we could get about the state of the cluster from people actually running into the issue would be extremely helpful.
I released an RC tag to Docker Hub as 0.10.7-rc2
with the changes referenced in #189 (a merge of several different efforts).
As noted in my previous comment, I am unsure if this will address our interface conversion errors. They seem to be happening intermittently on 0.10.4
and more frequently for EKS users on 0.10.6
. If we could get a few people to run the RC version to determine if it alleviates the issue that would help determine the best path forward.
I've just run the RC version and always seeing intermittent error:
botocore.exceptions.NoCredentialsError: Unable to locate credentials
time="2019-02-05T15:46:25Z" level=debug msg="Pod OnUpdate" pod.iam.role="arn:aws:iam::ACCOUNT_ID:role/kube2iam-cluster-aws-client-france-fr-allow-ecr" pod.name=refresh-creds-7465d56d76-qccvt pod.namespace=default pod.status.ip=XXX.XXX.XXX.XXX pod.status.phase=Running
time="2019-02-05T15:46:25Z" level=error msg="pods[0] is not *v1.Pod! <nil>"
time="2019-02-05T15:46:26Z" level=error msg="pods[0] is not *v1.Pod! <nil>"
time="2019-02-05T15:46:26Z" level=debug msg="Pod OnUpdate" pod.iam.role="arn:aws:iam::ACCOUNT_ID:role/kube2iam-cluster-aws-client-france-fr-allow-ecr" pod.name=refresh-creds-7465d56d76-hxvqg pod.namespace=default pod.status.ip=XXX.XXX.XXX.XXX pod.status.phase=Running
time="2019-02-05T15:46:26Z" level=debug msg="Pod OnDelete" pod.iam.role="arn:aws:iam::ACCOUNT_ID:role/kube2iam-cluster-aws-client-france-fr-allow-ecr" pod.name=refresh-creds-7465d56d76-hxvqg pod.namespace=default pod.status.ip=XXX.XXX.XXX.XXX pod.status.phase=Running
time="2019-02-05T15:46:26Z" level=error msg="pods[0] is not *v1.Pod! <nil>"
time="2019-02-05T15:46:27Z" level=error msg="pods[0] is not *v1.Pod! <nil>"
time="2019-02-05T15:46:27Z" level=debug msg="Pod OnUpdate" pod.iam.role="arn:aws:iam::ACCOUNT_ID:role/kube2iam-cluster-aws-client-france-fr-allow-ecr" pod.name=refresh-creds-7465d56d76-qccvt pod.namespace=default pod.status.ip=XXX.XXX.XXX.XXX pod.status.phase=Running
time="2019-02-05T15:46:28Z" level=error msg="pods[0] is not *v1.Pod! <nil>"
time="2019-02-05T15:46:28Z" level=info msg="GET /latest/meta-data/iam/security-credentials/ (404) took 2173819697.000000 ns" req.method=GET req.path=/latest/meta-data/iam/security-credentials/ req.remote=XXX.XXX.XXX.XXX res.duration=2.173819697e+09 res.status=404
time="2019-02-05T15:46:29Z" level=debug msg="Pod OnUpdate" pod.iam.role="arn:aws:iam::ACCOUNT_ID:role/kube2iam-cluster-aws-client-france-fr-allow-ecr" pod.name=refresh-creds-7465d56d76-qccvt pod.namespace=default pod.status.ip=XXX.XXX.XXX.XXX pod.status.phase=Running
time="2019-02-05T15:46:29Z" level=error msg="pods[0] is not *v1.Pod! <nil>"
time="2019-02-05T15:46:30Z" level=error msg="pods[0] is not *v1.Pod! <nil>"
time="2019-02-05T15:46:30Z" level=debug msg="Pod OnUpdate" pod.iam.role= pod.name=kube-downscaler-1549307100-rnj4x pod.namespace=default pod.status.ip=XXX.XXX.XXX.XXX pod.status.phase=Pending
time="2019-02-05T15:46:31Z" level=error msg="pods[0] is not *v1.Pod! <nil>"
time="2019-02-05T15:46:31Z" level=debug msg="Pod OnUpdate" pod.iam.role="arn:aws:iam::ACCOUNT_ID:role/kube2iam-cluster-aws-client-france-fr-allow-ecr" pod.name=refresh-creds-7465d56d76-qccvt pod.namespace=default pod.status.ip=XXX.XXX.XXX.XXX pod.status.phase=Running
time="2019-02-05T15:46:31Z" level=error msg="pods[0] is not *v1.Pod! <nil>"
time="2019-02-05T15:46:31Z" level=info msg="GET /latest/meta-data/iam/security-credentials/ (404) took 2207612285.000000 ns" req.method=GET req.path=/latest/meta-data/iam/security-credentials/ req.remote=XXX.XXX.XXX.XXX res.duration=2.207612285e+09 res.status=404
time="2019-02-05T15:46:44Z" level=debug msg="Pod OnUpdate" pod.iam.role="arn:aws:iam::ACCOUNT_ID:role/kube2iam-cluster-aws-client-france-fr-allow-ecr" pod.name=refresh-creds-7465d56d76-qccvt pod.namespace=default pod.status.ip=XXX.XXX.XXX.XXX pod.status.phase=Running
time="2019-02-05T15:46:46Z" level=error msg="pods[0] is not *v1.Pod! <nil>"
time="2019-02-05T15:46:46Z" level=debug msg="Pod OnUpdate" pod.iam.role="arn:aws:iam::ACCOUNT_ID:role/kube2iam-cluster-aws-client-france-fr-allow-ecr" pod.name=refresh-creds-7465d56d76-qccvt pod.namespace=default pod.status.ip=XXX.XXX.XXX.XXX pod.status.phase=Running
time="2019-02-05T15:46:46Z" level=error msg="pods[0] is not *v1.Pod! <nil>"
time="2019-02-05T15:46:47Z" level=error msg="pods[0] is not *v1.Pod! <nil>"
time="2019-02-05T15:46:47Z" level=debug msg="Pod OnUpdate" pod.iam.role="arn:aws:iam::ACCOUNT_ID:role/kube2iam-cluster-aws-client-france-fr-allow-ecr" pod.name=refresh-creds-7465d56d76-qccvt pod.namespace=default pod.status.ip=XXX.XXX.XXX.XXX pod.status.phase=Running
time="2019-02-05T15:46:48Z" level=error msg="pods[0] is not *v1.Pod! <nil>"
time="2019-02-05T15:46:48Z" level=info msg="GET /latest/meta-data/iam/security-credentials/ (404) took 2023309635.000000 ns" req.method=GET req.path=/latest/meta-data/iam/security-credentials/ req.remote=XXX.XXX.XXX.XXX res.duration=2.023309635e+09 res.status=404```
At your disposal for further information.
Thanks for taking the time to collect those logs! If you have found a way to replicate this it would be very much appreciated.
A few questions:
- What platform and OS are running into this on?
- Would you mind masking the pod IPs in a way that we can see a relationship between the log entries?
- Are the pods that are failing "stable" or are they undergoing lifecycle changes?
- Are all of these GETs for the same pod?
- Are the
Pod OnUpdate
statements related to that pod?
Also if you have the /debug
endpoint enabled it might help us cross-reference the issue.
I will try and dig into the pod indexer logic to see if I can come up with anything but realistically it will be a few days.
Thanks again!
-
What platform and OS are running into this on? Kubernetes v1.10.12 Debian GNU/Linux 9 (stretch) 4.9.0-8-amd64 docker://17.3.2 Provisioned with KOPS 1.11
-
Would you mind masking the pod IPs in a way that we can see a relationship between the log entries?
kube2iam's log when not working:
time="2019-02-05T21:34:43Z" level=debug msg="Pod OnAdd" pod.iam.role="arn:aws:iam::ACCOUNT_ID:role/my-role" pod.name=refresh-creds-6c6cb8b588-rwzj9 pod.namespace=default pod.status.ip= pod.status.phase=Pending
time="2019-02-05T21:34:43Z" level=debug msg="Pod OnUpdate" pod.iam.role="arn:aws:iam::ACCOUNT_ID:role/my-role" pod.name=refresh-creds-6c6cb8b588-rwzj9 pod.namespace=default pod.status.ip= pod.status.phase=Pending
time="2019-02-05T21:34:45Z" level=debug msg="Pod OnUpdate" pod.iam.role="arn:aws:iam::ACCOUNT_ID:role/my-role" pod.name=refresh-creds-6c6cb8b588-rwzj9 pod.namespace=default pod.status.ip=XXX.YYY.0.1 pod.status.phase=Running
time="2019-02-05T21:34:51Z" level=error msg="pods[0] is not *v1.Pod! <nil>"
time="2019-02-05T21:34:51Z" level=error msg="pods[1] is not *v1.Pod! <nil>"
time="2019-02-05T21:34:51Z" level=error msg="pods[2] is not *v1.Pod! <nil>"
time="2019-02-05T21:34:51Z" level=error msg="pods[4] is not *v1.Pod! <nil>"
time="2019-02-05T21:34:51Z" level=error msg="pods[5] is not *v1.Pod! <nil>"
time="2019-02-05T21:34:52Z" level=error msg="pods[0] is not *v1.Pod! <nil>"
time="2019-02-05T21:34:52Z" level=error msg="pods[1] is not *v1.Pod! <nil>"
time="2019-02-05T21:34:52Z" level=error msg="pods[2] is not *v1.Pod! <nil>"
time="2019-02-05T21:34:52Z" level=error msg="pods[4] is not *v1.Pod! <nil>"
time="2019-02-05T21:34:52Z" level=error msg="pods[5] is not *v1.Pod! <nil>"
time="2019-02-05T21:34:52Z" level=debug msg="Pod OnUpdate" pod.iam.role="arn:aws:iam::ACCOUNT_ID:role/my-role" pod.name=refresh-creds-6c6cb8b588-rwzj9 pod.namespace=default pod.status.ip=XXX.YYY.0.1 pod.status.phase=Running
time="2019-02-05T21:34:53Z" level=error msg="pods[0] is not *v1.Pod! <nil>"
time="2019-02-05T21:34:53Z" level=error msg="pods[1] is not *v1.Pod! <nil>"
time="2019-02-05T21:34:53Z" level=error msg="pods[2] is not *v1.Pod! <nil>"
time="2019-02-05T21:34:53Z" level=error msg="pods[4] is not *v1.Pod! <nil>"
time="2019-02-05T21:34:53Z" level=error msg="pods[5] is not *v1.Pod! <nil>"
time="2019-02-05T21:34:54Z" level=error msg="pods[0] is not *v1.Pod! <nil>"
time="2019-02-05T21:34:54Z" level=error msg="pods[1] is not *v1.Pod! <nil>"
time="2019-02-05T21:34:54Z" level=error msg="pods[2] is not *v1.Pod! <nil>"
time="2019-02-05T21:34:54Z" level=error msg="pods[4] is not *v1.Pod! <nil>"
time="2019-02-05T21:34:54Z" level=error msg="pods[5] is not *v1.Pod! <nil>"
time="2019-02-05T21:34:54Z" level=info msg="GET /latest/meta-data/iam/security-credentials/ (404) took 2653312418.000000 ns" req.method=GET req.path=/latest/meta-data/iam/security-credentials/ req.remote=XXX.YYY.0.1 res.duration=2.653312418e+09 res.status=404
my pod's log when not working:
kubectl logs -n default -l run=refresh-creds -p
2019-02-05 21:34:45 INFO Loading configuration from cluster: kubernetes.default
2019-02-05 21:34:51 INFO doing something in a loop ...
2019-02-05 21:34:51 INFO Test logging #1
2019-02-05 21:34:52 INFO Client STS #1: <botocore.client.STS object at 0x7f1c1c36efd0>
2019-02-05 21:34:52 INFO ending loop ...
Traceback (most recent call last):
File "/app/main.py", line 246, in <module>
loop_controller()
File "/app/main.py", line 238, in loop_controller
raise e
File "/app/main.py", line 220, in loop_controller
docker, expire_at = ecr_login()
File "/app/main.py", line 85, in ecr_login
session = assume_role(ROLE_ARN, "default") # type: Session
File "/app/main.py", line 67, in assume_role
account_id = clt.get_caller_identity()["Account"]
File "/usr/local/lib/python3.7/site-packages/botocore/client.py", line 357, in _api_call
return self._make_api_call(operation_name, kwargs)
File "/usr/local/lib/python3.7/site-packages/botocore/client.py", line 648, in _make_api_call
operation_model, request_dict, request_context)
File "/usr/local/lib/python3.7/site-packages/botocore/client.py", line 667, in _make_request
return self._endpoint.make_request(operation_model, request_dict)
File "/usr/local/lib/python3.7/site-packages/botocore/endpoint.py", line 102, in make_request
return self._send_request(request_dict, operation_model)
File "/usr/local/lib/python3.7/site-packages/botocore/endpoint.py", line 132, in _send_request
request = self.create_request(request_dict, operation_model)
File "/usr/local/lib/python3.7/site-packages/botocore/endpoint.py", line 116, in create_request
operation_name=operation_model.name)
File "/usr/local/lib/python3.7/site-packages/botocore/hooks.py", line 356, in emit
return self._emitter.emit(aliased_event_name, **kwargs)
File "/usr/local/lib/python3.7/site-packages/botocore/hooks.py", line 228, in emit
return self._emit(event_name, kwargs)
File "/usr/local/lib/python3.7/site-packages/botocore/hooks.py", line 211, in _emit
response = handler(**kwargs)
File "/usr/local/lib/python3.7/site-packages/botocore/signers.py", line 90, in handler
return self.sign(operation_name, request)
File "/usr/local/lib/python3.7/site-packages/botocore/signers.py", line 157, in sign
auth.add_auth(request)
File "/usr/local/lib/python3.7/site-packages/botocore/auth.py", line 357, in add_auth
raise NoCredentialsError
botocore.exceptions.NoCredentialsError: Unable to locate credentials
debug/store endpoint when not working:
curl -s http://127.0.0.1:8181/debug/store | jq '.'
{
"namespaceByIP": {},
"rolesByIP": {},
"rolesByNamespace": {
"222-iop-correction-qual-ksl": null,
.....
"weave": null
}
}
kube2iam log when working:
time="2019-02-05T18:00:13Z" level=debug msg="Pod OnAdd" pod.iam.role="arn:aws:iam::ACCOUNT_ID:role/my-role" pod.name=refresh-creds-6c6cb8b588-44zvk pod.namespace=default pod.status.ip= pod.status.phase=Pending
time="2019-02-05T18:00:13Z" level=debug msg="Pod OnUpdate" pod.iam.role="arn:aws:iam::ACCOUNT_ID:role/my-role" pod.name=refresh-creds-6c6cb8b588-44zvk pod.namespace=default pod.status.ip= pod.status.phase=Pending
time="2019-02-05T18:00:16Z" level=debug msg="Pod OnUpdate" pod.iam.role="arn:aws:iam::ACCOUNT_ID:role/my-role" pod.name=refresh-creds-6c6cb8b588-44zvk pod.namespace=default pod.status.ip=XXX.YYY.0.1 pod.status.phase=Running
time="2019-02-05T18:00:22Z" level=info msg="GET /latest/meta-data/iam/security-credentials/ (200) took 36243.000000 ns" req.method=GET req.path=/latest/meta-data/iam/security-credentials/ req.remote=XXX.YYY.0.1 res.duration=36243 res.status=200
time="2019-02-05T18:00:22Z" level=debug msg="retrieved credentials from sts endpoint: sts.amazonaws.com" ns.name=default pod.iam.role="arn:aws:iam::ACCOUNT_ID:role/my-role" req.method=GET req.path=/latest/meta-data/iam/security-credentials/my-role req.remote=XXX.YYY.0.1
time="2019-02-05T18:00:22Z" level=info msg="GET /latest/meta-data/iam/security-credentials/my-role (200) took 427731917.000000 ns" req.method=GET req.path=/latest/meta-data/iam/security-credentials/my-role req.remote=XXX.YYY.0.1 res.duration=4.27731917e+08 res.status=200
time="2019-02-05T18:00:36Z" level=debug msg="Pod OnUpdate" pod.iam.role="arn:aws:iam::ACCOUNT_ID:role/my-role" pod.name=refresh-creds-6c6cb8b588-bmmzw pod.namespace=default pod.status.ip=XXX.YYY.0.3 pod.status.phase=Running
time="2019-02-05T18:00:36Z" level=debug msg="Pod OnDelete" pod.iam.role="arn:aws:iam::ACCOUNT_ID:role/my-role" pod.name=refresh-creds-6c6cb8b588-bmmzw pod.namespace=default pod.status.ip=XXX.YYY.0.3 pod.status.phase=Running
time="2019-02-05T18:01:50Z" level=debug msg="Pod OnUpdate" pod.iam.role="arn:aws:iam::ACCOUNT_ID:role/my-role" pod.name=refresh-creds-6c6cb8b588-44zvk pod.namespace=default pod.status.ip=XXX.YYY.0.1 pod.status.phase=Running
time="2019-02-05T18:01:52Z" level=debug msg="Pod OnUpdate" pod.iam.role="arn:aws:iam::ACCOUNT_ID:role/my-role" pod.name=refresh-creds-6c6cb8b588-44zvk pod.namespace=default pod.status.ip=XXX.YYY.0.1 pod.status.phase=Running
time="2019-02-05T18:01:58Z" level=info msg="GET /latest/meta-data/iam/security-credentials/ (200) took 32940.000000 ns" req.method=GET req.path=/latest/meta-data/iam/security-credentials/ req.remote=XXX.YYY.0.1 res.duration=32940 res.status=200
time="2019-02-05T18:01:58Z" level=debug msg="retrieved credentials from sts endpoint: sts.amazonaws.com" ns.name=default pod.iam.role="arn:aws:iam::ACCOUNT_ID:role/my-role" req.method=GET req.path=/latest/meta-data/iam/security-credentials/my-role req.remote=XXX.YYY.0.1
time="2019-02-05T18:01:58Z" level=info msg="GET /latest/meta-data/iam/security-credentials/my-role (200) took 75346.000000 ns" req.method=GET req.path=/latest/meta-data/iam/security-credentials/my-role req.remote=XXX.YYY.0.1 res.duration=75346 res.status=200
my pod log when working:
kubectl logs -n default -l run=refresh-creds -p
2019-02-05 18:05:42 INFO Loading configuration from cluster: kubernetes.default
2019-02-05 18:05:48 INFO doing something in a loop ...
2019-02-05 18:05:48 INFO Test logging #1
2019-02-05 18:05:48 INFO Client STS #1: <botocore.client.STS object at 0x7fd671f182b0>
2019-02-05 18:06:13 INFO Account ID #1: ACCOUNT_ID
2019-02-05 18:06:39 INFO Account ID #2: ACCOUNT_ID_2
2019-02-05 18:06:39 INFO ECR CLIENT: <botocore.client.ECR object at 0x7fd671d7e240>
debug/store endpoint when working:
curl -s http://127.0.0.1:8181/debug/store | jq '.'
{
"namespaceByIP": {
"XXX.YYY.0.1": "default"
},
"rolesByIP": {
"XXX.YYY.0.1": "arn:aws:iam::ACCOUNT_ID:role/my-role"
},
"rolesByNamespace": {
"222-iop-correction-qual-ksl": null,
....
"weave": null
}
}
-
Are the pods that are failing "stable" or are they undergoing lifecycle changes? They are transiting from Pending to Running. But that also seems to fail when running.
-
Are all of these GETs for the same pod? The logs in this comment are captured from a tainted node. There is only few daemonset running on this node and only my app's pod running why a kube2iam annotation role.
-
Are the
Pod OnUpdate
statements related to that pod? Yes, for this comment.
I'm getting the same error on 2 different EKS clusters. I had this error on k8s nodes 1.10.3 and now 1.10.11. This seems to work fine for some period of time and then all newly created pods fail. I.e. after 24 hours or so of a node running then newly fetched credentials fail.
We had one incident today with the 0.10.6, downgrading back to 0.10.4. Running EKS with 1.10.3.
I have an entrypoint script in the pod which polls the kube2iam metadata api until the credentials are associated and executes the actual pod command only after that. In our single case this bootstrap script never finished, so the bug was there "right from the pod startup" from my point of view.
I've implement a test self-destruction logic in my application POD (for testing). If it's not able to get credentials, it self-destructs.
And some times, this works...
But most of the time, it works immediately after kube2iam pod himself.
Unfortunately, I have to consider to switch for kiam.
We've began noticing issues with 0.10.6 as well. Downgrading to 0.10.4 seems to work.
Can also confirm in EKS seeing this issue, going from tag:latest to tag:0.10.4 100% fixed this for us.
Just experienced this on kube2iam 0.10.6, NOT running in EKS but with a kops cluster using weave.
Most of our deployments were okay, but two of them were getting 500's from kube2iam, kube2iam logs had a similar message for these 2 pods. What I noticed is that these are our deployments that are deployed with strategy Recreate and only 1 replica, which happens to keep the same pod IP address on the new pods. Is there a chance the pod IP lookup is broken? We ran for almost 10 minutes this way, and I was able to recover by deleting the kube2iam pods on the node of my affected pods (which would refresh whatever pod cache exists). Then I downgraded kube2iam to 0.10.4 and did another release without issues.
I released an 0.10.7-rc3 release that reverts the logic that was introduced in #173 which I believe is the main culprit.
The observation about the Recreate
may be a good lead and something to try however I am at a wedding this weekend and unable to invest enough time to test some of those scenarios (forgot my laptop charger). I'm going to push the next RC release to see if that resolves some of people's issues.
I've the same error with 0.10.7-rc3. I can't even get it works intermittently.
I've notice that error
curl http://169.254.169.254/latest/meta-data/iam/security-credentials/my-role
pod with specificed IP not found
When enabling verbose DEBUG log on my application, I've notice socket timeout.
2019-02-09 10:51:52 DEBUG Looking for credentials via: boto-config
2019-02-09 10:51:52 DEBUG Looking for credentials via: container-role
2019-02-09 10:51:52 DEBUG Looking for credentials via: iam-role
2019-02-09 10:51:53 DEBUG Caught retryable HTTP exception while making metadata service request to http://169.254.169.254/latest/meta-data/iam/security-credentials/: Read timeout on endpoint URL: "http://169.254.169.254/latest/meta-data/iam/security-credentials/"
Traceback (most recent call last):
File "/usr/local/lib/python3.7/site-packages/urllib3/connectionpool.py", line 384, in _make_request
six.raise_from(e, None)
File "<string>", line 2, in raise_from
File "/usr/local/lib/python3.7/site-packages/urllib3/connectionpool.py", line 380, in _make_request
httplib_response = conn.getresponse()
File "/usr/local/lib/python3.7/http/client.py", line 1321, in getresponse
response.begin()
File "/usr/local/lib/python3.7/http/client.py", line 296, in begin
version, status, reason = self._read_status()
File "/usr/local/lib/python3.7/http/client.py", line 257, in _read_status
line = str(self.fp.readline(_MAXLINE + 1), "iso-8859-1")
File "/usr/local/lib/python3.7/socket.py", line 589, in readinto
return self._sock.recv_into(b)
socket.timeout: timed out
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/local/lib/python3.7/site-packages/botocore/httpsession.py", line 258, in send
decode_content=False,
File "/usr/local/lib/python3.7/site-packages/urllib3/connectionpool.py", line 638, in urlopen
_stacktrace=sys.exc_info()[2])
File "/usr/local/lib/python3.7/site-packages/urllib3/util/retry.py", line 343, in increment
raise six.reraise(type(error), error, _stacktrace)
File "/usr/local/lib/python3.7/site-packages/urllib3/packages/six.py", line 686, in reraise
raise value
File "/usr/local/lib/python3.7/site-packages/urllib3/connectionpool.py", line 600, in urlopen
chunked=chunked)
File "/usr/local/lib/python3.7/site-packages/urllib3/connectionpool.py", line 386, in _make_request
self._raise_timeout(err=e, url=url, timeout_value=read_timeout)
File "/usr/local/lib/python3.7/site-packages/urllib3/connectionpool.py", line 306, in _raise_timeout
raise ReadTimeoutError(self, url, "Read timed out. (read timeout=%s)" % timeout_value)
urllib3.exceptions.ReadTimeoutError: AWSHTTPConnectionPool(host='169.254.169.254', port=80): Read timed out. (read timeout=1)
During handling of the above exception, another exception occurred:
The response time is pretty hight:
time curl http://169.254.169.254/latest/meta-data/iam/security-credentials/my-role
pod with specificed IP not found
real 0m2.278s
user 0m0.004s
sys 0m0.000s
And the debug/store:
curl -s http://127.0.0.1:8181/debug/store | jq '.'
{
"namespaceByIP": {},
"rolesByIP": {},
"rolesByNamespace": {
But, with kube2iam version 0.10.4:
debug/store is correctly populate:
{
"namespaceByIP": {
"100.114.0.1": "default",
"100.114.0.2": "default"
},
"rolesByIP": {
"100.114.0.1": "my-role",
"100.114.0.2": ""
},
"rolesByNamespace": {
My app retreive the role's creds:
2019-02-09 10:56:52 DEBUG Found credentials from IAM Role: my-role
2019-02-09 10:56:52 DEBUG Loading JSON file: /usr/local/lib/python3.7/site-packages/botocore/data/endpoints.json
2019-02-09 10:56:52 DEBUG Event choose-service-name: calling handler <function handle_service_name_alias at 0x7f537d039598>
2019-02-09 10:56:52 DEBUG Loading JSON file: /usr/local/lib/python3.7/site-packages/botocore/data/sts/2011-06-15/service-2.json
2019-02-09 10:56:52 DEBUG Event creating-client-class.sts: calling handler <function add_generate_presigned_url at 0x7f537d072e18>
2019-02-09 10:56:52 DEBUG The s3 config key is not a dictionary type, ignoring its value of: None
2019-02-09 10:56:52 DEBUG Setting sts timeout as (60, 60)
2019-02-09 10:56:52 DEBUG Loading JSON file: /usr/local/lib/python3.7/site-packages/botocore/data/_retry.json
2019-02-09 10:56:52 DEBUG Registering retry handlers for service: sts
2019-02-09 10:56:52 INFO Client STS #1: <botocore.client.STS object at 0x7f537b11c198>
2019-02-09 10:56:52 DEBUG Event before-parameter-build.sts.GetCallerIdentity: calling handler <function generate_idempotent_uuid at 0x7f537d055b70>
2019-02-09 10:56:52 DEBUG Event before-call.sts.GetCallerIdentity: calling handler <function inject_api_version_header_if_needed at 0x7f537cfd9a60>
2019-02-09 10:56:52 DEBUG Making request for OperationModel(name=GetCallerIdentity) with params: {'url_path': '/', 'query_string': '', 'method': 'POST', 'headers': {'Content-Type': 'application/x-www-form-urlencoded; charset=utf-8', 'User-Agent': 'Boto3/1.9.86 Python/3.7.2 Linux/4.9.0-8-amd64 Botocore/1.12.86'}, 'body': {'Action': 'GetCallerIdentity', 'Version': '2011-06-15'}, 'url': 'https://sts.amazonaws.com/', 'context': {'client_region': 'aws-global', 'client_config': <botocore.config.Config object at 0x7f537b11c470>, 'has_streaming_input': False, 'auth_type': None}}
2019-02-09 10:56:52 DEBUG Event request-created.sts.GetCallerIdentity: calling handler <bound method RequestSigner.handler of <botocore.signers.RequestSigner object at 0x7f537b11c438>>
2019-02-09 10:56:52 DEBUG Event choose-signer.sts.GetCallerIdentity: calling handler <function set_operation_specific_signer at 0x7f537d055a60>
2019-02-09 10:56:52 DEBUG Calculating signature using v4 auth.
2019-02-09 10:56:52 DEBUG CanonicalRequest:
POST
Saw this today on EKS GitVersion:"v1.10.11-eks"
. The kube2iam
pod was running 0.10.4
. I killed the pod, and the daemonset brought another back, this time with 0.10.6
and it is doing what it is supposed to.
All the other worker nodes running kube2iam
are "older" than the one that had the issue. It almost feels like AWS is ripping out something from under kube2iam
and it gets in this state?
We are experiencing this issue as well: EKS 1.10.11, aws-vpc-cni-k8s 1.30 0.10.4 seems to resolve the issue. Occurred on 0.10.6