kube2iam icon indicating copy to clipboard operation
kube2iam copied to clipboard

Interface conversion fails, causing pods not to get credentials.

Open kinghajj opened this issue 6 years ago • 49 comments

Sometimes, pods in my cluster fail with errors like "fatal error: Unable to locate credentials". Checking the kube2iam logs for the instance on the same node, I see entries like this:

time="2018-11-20T18:40:26Z" level=error msg="PANIC error processing request: interface conversion: interface {} is nil, not *v1.Pod" req.method=GET req.path=/latest/meta-data/iam/security-credentials/ req.remote=172.19.97.48 res.status=500

This occurs on an EKS cluster running Kubernetes 1.10.3.

kinghajj avatar Nov 20 '18 19:11 kinghajj

I've seen this same issue on Kubernetes 1.10.9 on a Kops-provisioned cluster. Using the latest 0.10.4 image. Very intermittent, I haven't been able to create a guaranteed repro case.

There are only 2 cases of *v1.Pod type assertion: https://github.com/jtblin/kube2iam/blob/master/k8s/k8s.go#L100-L106

The k8s.io/client-go library is pinned to a version over 20 months old and there have been many major releases since then. Perhaps its worth updating?

rifelpet avatar Nov 20 '18 19:11 rifelpet

We have seen this error as well. We upvoted to have this pulled in as well.

tmedford avatar Jan 26 '19 19:01 tmedford

getting this on EKS 1.11.5

admssa avatar Jan 26 '19 20:01 admssa

I will look at updating tomorrow

jrnt30 avatar Jan 26 '19 21:01 jrnt30

@jrnt30 Can we make sure to get a new build to? We will start experimenting whenever you release.

tmedford avatar Jan 26 '19 21:01 tmedford

rolled back to 0.10.4 error is disappeared. i won't be using "latest" i won't be using "latest" repaet 10 times :)

admssa avatar Jan 26 '19 21:01 admssa

You only started seeing this is 0.10.6? Logs would help. Are the failures for running or deleted pods?

On Sat, Jan 26, 2019 at 15:26 Sergii Shapovalenko [email protected] wrote:

rolled back to 0.10.4 error is disappeared. i won't be using "latest" i won't be using "latest" repaet 10 times :)

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/jtblin/kube2iam/issues/178#issuecomment-457867339, or mute the thread https://github.com/notifications/unsubscribe-auth/AAWanESWzlvvkiCszNJB3ewhKPaocHUjks5vHMf9gaJpZM4YrwBM .

jrnt30 avatar Jan 26 '19 21:01 jrnt30

@jrnt30 Yes, I used "latest" tag for docker image. this only affected the pods on the nodes that were created after 0.10.6 release. Then i set tag 0.10.4 and error disappeared.
I was getting

aws sts get-caller-identity

Unable to locate credentials. You can configure credentials by running "aws configure".

admssa avatar Jan 26 '19 22:01 admssa

We were on 10.4 and were having the error where pod could not get creds after term. After the update to 10.6 that went away for the application that has very variable scaling and has been very stable. Another application, that uses kubernetes jobs on static non-autoscaling boxes was having intermittent errors "unable to sign request without credentials set" with error in kube2iam. We had seen this in 10.4 though.

{"level":"error","msg":"PANIC error processing request: interface conversion: interface {} is nil, not *v1.Pod","req.method":"GET","req.path":"/latest/meta-data/iam/security-credentials/","req.remote":"100.96.192.4","res.status":500,"time":"2019-01-26T21:30:33Z"}.

This would not be related to deleted pods in our case for the fix in 10.6. Running with kops deploy on v1.10.11

tmedford avatar Jan 26 '19 22:01 tmedford

rolled back to 0.10.4 error is disappeared. i won't be using "latest" i won't be using "latest" repaet 10 times :)

Found out the same today, thanks to your comment I realized that I did the same.

hgranillo avatar Jan 30 '19 00:01 hgranillo

Saw this as well in ours EKS.

lumixen avatar Jan 30 '19 20:01 lumixen

Ditto. EKS 1.10.11

thealmightygrant avatar Jan 31 '19 01:01 thealmightygrant

Also seeing this on EKS Server Version: version.Info{Major:"1", Minor:"11+", GitVersion:"v1.11.5-eks-6bad6d", GitCommit:"6bad6d9c768dc0864dab48a11653aa53b5a47043", GitTreeState:"clean", BuildDate:"2018-12-06T23:13:14Z", GoVersion:"go1.10.3", Compiler:"gc", Platform:"linux/amd64"}, roll back to 0.10.4 fixes it.

cnelson avatar Feb 01 '19 17:02 cnelson

The same on AWS EKS Server Version: version.Info{Major:"1", Minor:"10+", GitVersion:"v1.10.11-eks", GitCommit:"6bf27214b7e3e1e47dce27dcbd73ee1b27adadd0", GitTreeState:"clean", BuildDate:"2018-12-04T13:33:10Z", GoVersion:"go1.9.3", Compiler:"gc", Platform:"linux/amd64"}

yevhenvolchenko avatar Feb 04 '19 12:02 yevhenvolchenko

Did we fix this in the latest image? We are seeing this in EKS.

tprakash17 avatar Feb 04 '19 12:02 tprakash17

Apologies for the delay, I have not provisioned nor have a current cluster where I have reproduced this. Is this emergent behavior JUST from the 0.10.6 upgrade or was there also a cluster upgrade to 1.11 from a previous version?

Can anyone send a set of debug level logs that have this issue captured? What is not totally clear is if this interface conversation is occurring for newly deleted pod, new pods coming into the cluster, etc.

The PR that may be causing some of this #173 changed the semantics of what an active pod is. There was some work done to:

  • Update the K8 go client library
  • Add debugging statements to potentially identify where/why the conversion is failing

Simply adjusting these and "hoping" that this fixes it seems somewhat lazy and dangerous to me, so I any additional information we could get about the state of the cluster from people actually running into the issue would be extremely helpful.

jrnt30 avatar Feb 05 '19 03:02 jrnt30

I released an RC tag to Docker Hub as 0.10.7-rc2 with the changes referenced in #189 (a merge of several different efforts).

As noted in my previous comment, I am unsure if this will address our interface conversion errors. They seem to be happening intermittently on 0.10.4 and more frequently for EKS users on 0.10.6. If we could get a few people to run the RC version to determine if it alleviates the issue that would help determine the best path forward.

jrnt30 avatar Feb 05 '19 05:02 jrnt30

I've just run the RC version and always seeing intermittent error:

botocore.exceptions.NoCredentialsError: Unable to locate credentials
time="2019-02-05T15:46:25Z" level=debug msg="Pod OnUpdate" pod.iam.role="arn:aws:iam::ACCOUNT_ID:role/kube2iam-cluster-aws-client-france-fr-allow-ecr" pod.name=refresh-creds-7465d56d76-qccvt pod.namespace=default pod.status.ip=XXX.XXX.XXX.XXX pod.status.phase=Running
time="2019-02-05T15:46:25Z" level=error msg="pods[0] is not *v1.Pod! <nil>"
time="2019-02-05T15:46:26Z" level=error msg="pods[0] is not *v1.Pod! <nil>"
time="2019-02-05T15:46:26Z" level=debug msg="Pod OnUpdate" pod.iam.role="arn:aws:iam::ACCOUNT_ID:role/kube2iam-cluster-aws-client-france-fr-allow-ecr" pod.name=refresh-creds-7465d56d76-hxvqg pod.namespace=default pod.status.ip=XXX.XXX.XXX.XXX pod.status.phase=Running
time="2019-02-05T15:46:26Z" level=debug msg="Pod OnDelete" pod.iam.role="arn:aws:iam::ACCOUNT_ID:role/kube2iam-cluster-aws-client-france-fr-allow-ecr" pod.name=refresh-creds-7465d56d76-hxvqg pod.namespace=default pod.status.ip=XXX.XXX.XXX.XXX pod.status.phase=Running
time="2019-02-05T15:46:26Z" level=error msg="pods[0] is not *v1.Pod! <nil>"
time="2019-02-05T15:46:27Z" level=error msg="pods[0] is not *v1.Pod! <nil>"
time="2019-02-05T15:46:27Z" level=debug msg="Pod OnUpdate" pod.iam.role="arn:aws:iam::ACCOUNT_ID:role/kube2iam-cluster-aws-client-france-fr-allow-ecr" pod.name=refresh-creds-7465d56d76-qccvt pod.namespace=default pod.status.ip=XXX.XXX.XXX.XXX pod.status.phase=Running
time="2019-02-05T15:46:28Z" level=error msg="pods[0] is not *v1.Pod! <nil>"
time="2019-02-05T15:46:28Z" level=info msg="GET /latest/meta-data/iam/security-credentials/ (404) took 2173819697.000000 ns" req.method=GET req.path=/latest/meta-data/iam/security-credentials/ req.remote=XXX.XXX.XXX.XXX res.duration=2.173819697e+09 res.status=404
time="2019-02-05T15:46:29Z" level=debug msg="Pod OnUpdate" pod.iam.role="arn:aws:iam::ACCOUNT_ID:role/kube2iam-cluster-aws-client-france-fr-allow-ecr" pod.name=refresh-creds-7465d56d76-qccvt pod.namespace=default pod.status.ip=XXX.XXX.XXX.XXX pod.status.phase=Running
time="2019-02-05T15:46:29Z" level=error msg="pods[0] is not *v1.Pod! <nil>"
time="2019-02-05T15:46:30Z" level=error msg="pods[0] is not *v1.Pod! <nil>"
time="2019-02-05T15:46:30Z" level=debug msg="Pod OnUpdate" pod.iam.role= pod.name=kube-downscaler-1549307100-rnj4x pod.namespace=default pod.status.ip=XXX.XXX.XXX.XXX pod.status.phase=Pending
time="2019-02-05T15:46:31Z" level=error msg="pods[0] is not *v1.Pod! <nil>"
time="2019-02-05T15:46:31Z" level=debug msg="Pod OnUpdate" pod.iam.role="arn:aws:iam::ACCOUNT_ID:role/kube2iam-cluster-aws-client-france-fr-allow-ecr" pod.name=refresh-creds-7465d56d76-qccvt pod.namespace=default pod.status.ip=XXX.XXX.XXX.XXX pod.status.phase=Running
time="2019-02-05T15:46:31Z" level=error msg="pods[0] is not *v1.Pod! <nil>"
time="2019-02-05T15:46:31Z" level=info msg="GET /latest/meta-data/iam/security-credentials/ (404) took 2207612285.000000 ns" req.method=GET req.path=/latest/meta-data/iam/security-credentials/ req.remote=XXX.XXX.XXX.XXX res.duration=2.207612285e+09 res.status=404
time="2019-02-05T15:46:44Z" level=debug msg="Pod OnUpdate" pod.iam.role="arn:aws:iam::ACCOUNT_ID:role/kube2iam-cluster-aws-client-france-fr-allow-ecr" pod.name=refresh-creds-7465d56d76-qccvt pod.namespace=default pod.status.ip=XXX.XXX.XXX.XXX pod.status.phase=Running
time="2019-02-05T15:46:46Z" level=error msg="pods[0] is not *v1.Pod! <nil>"
time="2019-02-05T15:46:46Z" level=debug msg="Pod OnUpdate" pod.iam.role="arn:aws:iam::ACCOUNT_ID:role/kube2iam-cluster-aws-client-france-fr-allow-ecr" pod.name=refresh-creds-7465d56d76-qccvt pod.namespace=default pod.status.ip=XXX.XXX.XXX.XXX pod.status.phase=Running
time="2019-02-05T15:46:46Z" level=error msg="pods[0] is not *v1.Pod! <nil>"
time="2019-02-05T15:46:47Z" level=error msg="pods[0] is not *v1.Pod! <nil>"
time="2019-02-05T15:46:47Z" level=debug msg="Pod OnUpdate" pod.iam.role="arn:aws:iam::ACCOUNT_ID:role/kube2iam-cluster-aws-client-france-fr-allow-ecr" pod.name=refresh-creds-7465d56d76-qccvt pod.namespace=default pod.status.ip=XXX.XXX.XXX.XXX pod.status.phase=Running
time="2019-02-05T15:46:48Z" level=error msg="pods[0] is not *v1.Pod! <nil>"
time="2019-02-05T15:46:48Z" level=info msg="GET /latest/meta-data/iam/security-credentials/ (404) took 2023309635.000000 ns" req.method=GET req.path=/latest/meta-data/iam/security-credentials/ req.remote=XXX.XXX.XXX.XXX res.duration=2.023309635e+09 res.status=404```

At your disposal for further information. 

JBOClara avatar Feb 05 '19 16:02 JBOClara

Thanks for taking the time to collect those logs! If you have found a way to replicate this it would be very much appreciated.

A few questions:

  • What platform and OS are running into this on?
  • Would you mind masking the pod IPs in a way that we can see a relationship between the log entries?
  • Are the pods that are failing "stable" or are they undergoing lifecycle changes?
  • Are all of these GETs for the same pod?
  • Are the Pod OnUpdate statements related to that pod?

Also if you have the /debug endpoint enabled it might help us cross-reference the issue.

I will try and dig into the pod indexer logic to see if I can come up with anything but realistically it will be a few days.

Thanks again!

jrnt30 avatar Feb 05 '19 16:02 jrnt30

  • What platform and OS are running into this on? Kubernetes v1.10.12 Debian GNU/Linux 9 (stretch) 4.9.0-8-amd64 docker://17.3.2 Provisioned with KOPS 1.11

  • Would you mind masking the pod IPs in a way that we can see a relationship between the log entries?

kube2iam's log when not working:

time="2019-02-05T21:34:43Z" level=debug msg="Pod OnAdd" pod.iam.role="arn:aws:iam::ACCOUNT_ID:role/my-role" pod.name=refresh-creds-6c6cb8b588-rwzj9 pod.namespace=default pod.status.ip= pod.status.phase=Pending
time="2019-02-05T21:34:43Z" level=debug msg="Pod OnUpdate" pod.iam.role="arn:aws:iam::ACCOUNT_ID:role/my-role" pod.name=refresh-creds-6c6cb8b588-rwzj9 pod.namespace=default pod.status.ip= pod.status.phase=Pending
time="2019-02-05T21:34:45Z" level=debug msg="Pod OnUpdate" pod.iam.role="arn:aws:iam::ACCOUNT_ID:role/my-role" pod.name=refresh-creds-6c6cb8b588-rwzj9 pod.namespace=default pod.status.ip=XXX.YYY.0.1 pod.status.phase=Running
time="2019-02-05T21:34:51Z" level=error msg="pods[0] is not *v1.Pod! <nil>"
time="2019-02-05T21:34:51Z" level=error msg="pods[1] is not *v1.Pod! <nil>"
time="2019-02-05T21:34:51Z" level=error msg="pods[2] is not *v1.Pod! <nil>"
time="2019-02-05T21:34:51Z" level=error msg="pods[4] is not *v1.Pod! <nil>"
time="2019-02-05T21:34:51Z" level=error msg="pods[5] is not *v1.Pod! <nil>"
time="2019-02-05T21:34:52Z" level=error msg="pods[0] is not *v1.Pod! <nil>"
time="2019-02-05T21:34:52Z" level=error msg="pods[1] is not *v1.Pod! <nil>"
time="2019-02-05T21:34:52Z" level=error msg="pods[2] is not *v1.Pod! <nil>"
time="2019-02-05T21:34:52Z" level=error msg="pods[4] is not *v1.Pod! <nil>"
time="2019-02-05T21:34:52Z" level=error msg="pods[5] is not *v1.Pod! <nil>"
time="2019-02-05T21:34:52Z" level=debug msg="Pod OnUpdate" pod.iam.role="arn:aws:iam::ACCOUNT_ID:role/my-role" pod.name=refresh-creds-6c6cb8b588-rwzj9 pod.namespace=default pod.status.ip=XXX.YYY.0.1 pod.status.phase=Running
time="2019-02-05T21:34:53Z" level=error msg="pods[0] is not *v1.Pod! <nil>"
time="2019-02-05T21:34:53Z" level=error msg="pods[1] is not *v1.Pod! <nil>"
time="2019-02-05T21:34:53Z" level=error msg="pods[2] is not *v1.Pod! <nil>"
time="2019-02-05T21:34:53Z" level=error msg="pods[4] is not *v1.Pod! <nil>"
time="2019-02-05T21:34:53Z" level=error msg="pods[5] is not *v1.Pod! <nil>"
time="2019-02-05T21:34:54Z" level=error msg="pods[0] is not *v1.Pod! <nil>"
time="2019-02-05T21:34:54Z" level=error msg="pods[1] is not *v1.Pod! <nil>"
time="2019-02-05T21:34:54Z" level=error msg="pods[2] is not *v1.Pod! <nil>"
time="2019-02-05T21:34:54Z" level=error msg="pods[4] is not *v1.Pod! <nil>"
time="2019-02-05T21:34:54Z" level=error msg="pods[5] is not *v1.Pod! <nil>"
time="2019-02-05T21:34:54Z" level=info msg="GET /latest/meta-data/iam/security-credentials/ (404) took 2653312418.000000 ns" req.method=GET req.path=/latest/meta-data/iam/security-credentials/ req.remote=XXX.YYY.0.1 res.duration=2.653312418e+09 res.status=404

my pod's log when not working:

kubectl logs -n default -l run=refresh-creds -p
2019-02-05 21:34:45 INFO     Loading configuration from cluster: kubernetes.default
2019-02-05 21:34:51 INFO     doing something in a loop ...
2019-02-05 21:34:51 INFO     Test logging #1
2019-02-05 21:34:52 INFO     Client STS #1: <botocore.client.STS object at 0x7f1c1c36efd0>
2019-02-05 21:34:52 INFO     ending loop ...
Traceback (most recent call last):
  File "/app/main.py", line 246, in <module>
    loop_controller()
  File "/app/main.py", line 238, in loop_controller
    raise e
  File "/app/main.py", line 220, in loop_controller
    docker, expire_at = ecr_login()
  File "/app/main.py", line 85, in ecr_login
    session = assume_role(ROLE_ARN, "default") # type: Session
  File "/app/main.py", line 67, in assume_role
    account_id = clt.get_caller_identity()["Account"]
  File "/usr/local/lib/python3.7/site-packages/botocore/client.py", line 357, in _api_call
    return self._make_api_call(operation_name, kwargs)
  File "/usr/local/lib/python3.7/site-packages/botocore/client.py", line 648, in _make_api_call
    operation_model, request_dict, request_context)
  File "/usr/local/lib/python3.7/site-packages/botocore/client.py", line 667, in _make_request
    return self._endpoint.make_request(operation_model, request_dict)
  File "/usr/local/lib/python3.7/site-packages/botocore/endpoint.py", line 102, in make_request
    return self._send_request(request_dict, operation_model)
  File "/usr/local/lib/python3.7/site-packages/botocore/endpoint.py", line 132, in _send_request
    request = self.create_request(request_dict, operation_model)
  File "/usr/local/lib/python3.7/site-packages/botocore/endpoint.py", line 116, in create_request
    operation_name=operation_model.name)
  File "/usr/local/lib/python3.7/site-packages/botocore/hooks.py", line 356, in emit
    return self._emitter.emit(aliased_event_name, **kwargs)
  File "/usr/local/lib/python3.7/site-packages/botocore/hooks.py", line 228, in emit
    return self._emit(event_name, kwargs)
  File "/usr/local/lib/python3.7/site-packages/botocore/hooks.py", line 211, in _emit
    response = handler(**kwargs)
  File "/usr/local/lib/python3.7/site-packages/botocore/signers.py", line 90, in handler
    return self.sign(operation_name, request)
  File "/usr/local/lib/python3.7/site-packages/botocore/signers.py", line 157, in sign
    auth.add_auth(request)
  File "/usr/local/lib/python3.7/site-packages/botocore/auth.py", line 357, in add_auth
    raise NoCredentialsError
botocore.exceptions.NoCredentialsError: Unable to locate credentials

debug/store endpoint when not working:

curl -s http://127.0.0.1:8181/debug/store | jq '.'
{
  "namespaceByIP": {},
  "rolesByIP": {},
  "rolesByNamespace": {
    "222-iop-correction-qual-ksl": null,
 	.....
    "weave": null
  }
}

kube2iam log when working:

time="2019-02-05T18:00:13Z" level=debug msg="Pod OnAdd" pod.iam.role="arn:aws:iam::ACCOUNT_ID:role/my-role" pod.name=refresh-creds-6c6cb8b588-44zvk pod.namespace=default pod.status.ip= pod.status.phase=Pending
time="2019-02-05T18:00:13Z" level=debug msg="Pod OnUpdate" pod.iam.role="arn:aws:iam::ACCOUNT_ID:role/my-role" pod.name=refresh-creds-6c6cb8b588-44zvk pod.namespace=default pod.status.ip= pod.status.phase=Pending
time="2019-02-05T18:00:16Z" level=debug msg="Pod OnUpdate" pod.iam.role="arn:aws:iam::ACCOUNT_ID:role/my-role" pod.name=refresh-creds-6c6cb8b588-44zvk pod.namespace=default pod.status.ip=XXX.YYY.0.1 pod.status.phase=Running
time="2019-02-05T18:00:22Z" level=info msg="GET /latest/meta-data/iam/security-credentials/ (200) took 36243.000000 ns" req.method=GET req.path=/latest/meta-data/iam/security-credentials/ req.remote=XXX.YYY.0.1 res.duration=36243 res.status=200
time="2019-02-05T18:00:22Z" level=debug msg="retrieved credentials from sts endpoint: sts.amazonaws.com" ns.name=default pod.iam.role="arn:aws:iam::ACCOUNT_ID:role/my-role" req.method=GET req.path=/latest/meta-data/iam/security-credentials/my-role req.remote=XXX.YYY.0.1
time="2019-02-05T18:00:22Z" level=info msg="GET /latest/meta-data/iam/security-credentials/my-role (200) took 427731917.000000 ns" req.method=GET req.path=/latest/meta-data/iam/security-credentials/my-role req.remote=XXX.YYY.0.1 res.duration=4.27731917e+08 res.status=200
time="2019-02-05T18:00:36Z" level=debug msg="Pod OnUpdate" pod.iam.role="arn:aws:iam::ACCOUNT_ID:role/my-role" pod.name=refresh-creds-6c6cb8b588-bmmzw pod.namespace=default pod.status.ip=XXX.YYY.0.3 pod.status.phase=Running
time="2019-02-05T18:00:36Z" level=debug msg="Pod OnDelete" pod.iam.role="arn:aws:iam::ACCOUNT_ID:role/my-role" pod.name=refresh-creds-6c6cb8b588-bmmzw pod.namespace=default pod.status.ip=XXX.YYY.0.3 pod.status.phase=Running
time="2019-02-05T18:01:50Z" level=debug msg="Pod OnUpdate" pod.iam.role="arn:aws:iam::ACCOUNT_ID:role/my-role" pod.name=refresh-creds-6c6cb8b588-44zvk pod.namespace=default pod.status.ip=XXX.YYY.0.1 pod.status.phase=Running
time="2019-02-05T18:01:52Z" level=debug msg="Pod OnUpdate" pod.iam.role="arn:aws:iam::ACCOUNT_ID:role/my-role" pod.name=refresh-creds-6c6cb8b588-44zvk pod.namespace=default pod.status.ip=XXX.YYY.0.1 pod.status.phase=Running
time="2019-02-05T18:01:58Z" level=info msg="GET /latest/meta-data/iam/security-credentials/ (200) took 32940.000000 ns" req.method=GET req.path=/latest/meta-data/iam/security-credentials/ req.remote=XXX.YYY.0.1 res.duration=32940 res.status=200
time="2019-02-05T18:01:58Z" level=debug msg="retrieved credentials from sts endpoint: sts.amazonaws.com" ns.name=default pod.iam.role="arn:aws:iam::ACCOUNT_ID:role/my-role" req.method=GET req.path=/latest/meta-data/iam/security-credentials/my-role req.remote=XXX.YYY.0.1
time="2019-02-05T18:01:58Z" level=info msg="GET /latest/meta-data/iam/security-credentials/my-role (200) took 75346.000000 ns" req.method=GET req.path=/latest/meta-data/iam/security-credentials/my-role req.remote=XXX.YYY.0.1 res.duration=75346 res.status=200

my pod log when working:

kubectl logs -n default -l run=refresh-creds -p
2019-02-05 18:05:42 INFO     Loading configuration from cluster: kubernetes.default
2019-02-05 18:05:48 INFO     doing something in a loop ...
2019-02-05 18:05:48 INFO     Test logging #1
2019-02-05 18:05:48 INFO     Client STS #1: <botocore.client.STS object at 0x7fd671f182b0>
2019-02-05 18:06:13 INFO     Account ID #1: ACCOUNT_ID
2019-02-05 18:06:39 INFO     Account ID #2: ACCOUNT_ID_2
2019-02-05 18:06:39 INFO     ECR CLIENT: <botocore.client.ECR object at 0x7fd671d7e240>

debug/store endpoint when working:

curl -s http://127.0.0.1:8181/debug/store | jq '.'
{
  "namespaceByIP": {
    "XXX.YYY.0.1": "default"
  },
  "rolesByIP": {
    "XXX.YYY.0.1": "arn:aws:iam::ACCOUNT_ID:role/my-role"
  },
  "rolesByNamespace": {
    "222-iop-correction-qual-ksl": null,
  	....
    "weave": null
  }
}
  • Are the pods that are failing "stable" or are they undergoing lifecycle changes? They are transiting from Pending to Running. But that also seems to fail when running.

  • Are all of these GETs for the same pod? The logs in this comment are captured from a tainted node. There is only few daemonset running on this node and only my app's pod running why a kube2iam annotation role.

  • Are the Pod OnUpdate statements related to that pod? Yes, for this comment.

JBOClara avatar Feb 05 '19 22:02 JBOClara

I'm getting the same error on 2 different EKS clusters. I had this error on k8s nodes 1.10.3 and now 1.10.11. This seems to work fine for some period of time and then all newly created pods fail. I.e. after 24 hours or so of a node running then newly fetched credentials fail.

JessieAMorris avatar Feb 07 '19 16:02 JessieAMorris

We had one incident today with the 0.10.6, downgrading back to 0.10.4. Running EKS with 1.10.3.

I have an entrypoint script in the pod which polls the kube2iam metadata api until the credentials are associated and executes the actual pod command only after that. In our single case this bootstrap script never finished, so the bug was there "right from the pod startup" from my point of view.

garo avatar Feb 07 '19 17:02 garo

I've implement a test self-destruction logic in my application POD (for testing). If it's not able to get credentials, it self-destructs.

And some times, this works...

But most of the time, it works immediately after kube2iam pod himself.

Unfortunately, I have to consider to switch for kiam.

JBOClara avatar Feb 07 '19 18:02 JBOClara

We've began noticing issues with 0.10.6 as well. Downgrading to 0.10.4 seems to work.

wmorgan6796 avatar Feb 07 '19 19:02 wmorgan6796

Can also confirm in EKS seeing this issue, going from tag:latest to tag:0.10.4 100% fixed this for us.

lanmalkieri avatar Feb 08 '19 04:02 lanmalkieri

Just experienced this on kube2iam 0.10.6, NOT running in EKS but with a kops cluster using weave.

Most of our deployments were okay, but two of them were getting 500's from kube2iam, kube2iam logs had a similar message for these 2 pods. What I noticed is that these are our deployments that are deployed with strategy Recreate and only 1 replica, which happens to keep the same pod IP address on the new pods. Is there a chance the pod IP lookup is broken? We ran for almost 10 minutes this way, and I was able to recover by deleting the kube2iam pods on the node of my affected pods (which would refresh whatever pod cache exists). Then I downgraded kube2iam to 0.10.4 and did another release without issues.

krjackso avatar Feb 08 '19 21:02 krjackso

I released an 0.10.7-rc3 release that reverts the logic that was introduced in #173 which I believe is the main culprit.

The observation about the Recreate may be a good lead and something to try however I am at a wedding this weekend and unable to invest enough time to test some of those scenarios (forgot my laptop charger). I'm going to push the next RC release to see if that resolves some of people's issues.

jrnt30 avatar Feb 09 '19 06:02 jrnt30

I've the same error with 0.10.7-rc3. I can't even get it works intermittently.

I've notice that error

curl http://169.254.169.254/latest/meta-data/iam/security-credentials/my-role
pod with specificed IP not found

When enabling verbose DEBUG log on my application, I've notice socket timeout.

2019-02-09 10:51:52 DEBUG    Looking for credentials via: boto-config
2019-02-09 10:51:52 DEBUG    Looking for credentials via: container-role
2019-02-09 10:51:52 DEBUG    Looking for credentials via: iam-role
2019-02-09 10:51:53 DEBUG    Caught retryable HTTP exception while making metadata service request to http://169.254.169.254/latest/meta-data/iam/security-credentials/: Read timeout on endpoint URL: "http://169.254.169.254/latest/meta-data/iam/security-credentials/"
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/site-packages/urllib3/connectionpool.py", line 384, in _make_request
    six.raise_from(e, None)
  File "<string>", line 2, in raise_from
  File "/usr/local/lib/python3.7/site-packages/urllib3/connectionpool.py", line 380, in _make_request
    httplib_response = conn.getresponse()
  File "/usr/local/lib/python3.7/http/client.py", line 1321, in getresponse
    response.begin()
  File "/usr/local/lib/python3.7/http/client.py", line 296, in begin
    version, status, reason = self._read_status()
  File "/usr/local/lib/python3.7/http/client.py", line 257, in _read_status
    line = str(self.fp.readline(_MAXLINE + 1), "iso-8859-1")
  File "/usr/local/lib/python3.7/socket.py", line 589, in readinto
    return self._sock.recv_into(b)
socket.timeout: timed out

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.7/site-packages/botocore/httpsession.py", line 258, in send
    decode_content=False,
  File "/usr/local/lib/python3.7/site-packages/urllib3/connectionpool.py", line 638, in urlopen
    _stacktrace=sys.exc_info()[2])
  File "/usr/local/lib/python3.7/site-packages/urllib3/util/retry.py", line 343, in increment
    raise six.reraise(type(error), error, _stacktrace)
  File "/usr/local/lib/python3.7/site-packages/urllib3/packages/six.py", line 686, in reraise
    raise value
  File "/usr/local/lib/python3.7/site-packages/urllib3/connectionpool.py", line 600, in urlopen
    chunked=chunked)
  File "/usr/local/lib/python3.7/site-packages/urllib3/connectionpool.py", line 386, in _make_request
    self._raise_timeout(err=e, url=url, timeout_value=read_timeout)
  File "/usr/local/lib/python3.7/site-packages/urllib3/connectionpool.py", line 306, in _raise_timeout
    raise ReadTimeoutError(self, url, "Read timed out. (read timeout=%s)" % timeout_value)
urllib3.exceptions.ReadTimeoutError: AWSHTTPConnectionPool(host='169.254.169.254', port=80): Read timed out. (read timeout=1)

During handling of the above exception, another exception occurred:

The response time is pretty hight:

time curl http://169.254.169.254/latest/meta-data/iam/security-credentials/my-role
pod with specificed IP not found

real	0m2.278s
user	0m0.004s
sys	0m0.000s

And the debug/store:

curl -s http://127.0.0.1:8181/debug/store | jq '.'
{
  "namespaceByIP": {},
  "rolesByIP": {},
  "rolesByNamespace": {

But, with kube2iam version 0.10.4:

debug/store is correctly populate:

{
  "namespaceByIP": {
    "100.114.0.1": "default",
    "100.114.0.2": "default"
  },
  "rolesByIP": {
    "100.114.0.1": "my-role",
    "100.114.0.2": ""
  },
  "rolesByNamespace": {

My app retreive the role's creds:

2019-02-09 10:56:52 DEBUG    Found credentials from IAM Role: my-role
2019-02-09 10:56:52 DEBUG    Loading JSON file: /usr/local/lib/python3.7/site-packages/botocore/data/endpoints.json
2019-02-09 10:56:52 DEBUG    Event choose-service-name: calling handler <function handle_service_name_alias at 0x7f537d039598>
2019-02-09 10:56:52 DEBUG    Loading JSON file: /usr/local/lib/python3.7/site-packages/botocore/data/sts/2011-06-15/service-2.json
2019-02-09 10:56:52 DEBUG    Event creating-client-class.sts: calling handler <function add_generate_presigned_url at 0x7f537d072e18>
2019-02-09 10:56:52 DEBUG    The s3 config key is not a dictionary type, ignoring its value of: None
2019-02-09 10:56:52 DEBUG    Setting sts timeout as (60, 60)
2019-02-09 10:56:52 DEBUG    Loading JSON file: /usr/local/lib/python3.7/site-packages/botocore/data/_retry.json
2019-02-09 10:56:52 DEBUG    Registering retry handlers for service: sts
2019-02-09 10:56:52 INFO     Client STS #1: <botocore.client.STS object at 0x7f537b11c198>
2019-02-09 10:56:52 DEBUG    Event before-parameter-build.sts.GetCallerIdentity: calling handler <function generate_idempotent_uuid at 0x7f537d055b70>
2019-02-09 10:56:52 DEBUG    Event before-call.sts.GetCallerIdentity: calling handler <function inject_api_version_header_if_needed at 0x7f537cfd9a60>
2019-02-09 10:56:52 DEBUG    Making request for OperationModel(name=GetCallerIdentity) with params: {'url_path': '/', 'query_string': '', 'method': 'POST', 'headers': {'Content-Type': 'application/x-www-form-urlencoded; charset=utf-8', 'User-Agent': 'Boto3/1.9.86 Python/3.7.2 Linux/4.9.0-8-amd64 Botocore/1.12.86'}, 'body': {'Action': 'GetCallerIdentity', 'Version': '2011-06-15'}, 'url': 'https://sts.amazonaws.com/', 'context': {'client_region': 'aws-global', 'client_config': <botocore.config.Config object at 0x7f537b11c470>, 'has_streaming_input': False, 'auth_type': None}}
2019-02-09 10:56:52 DEBUG    Event request-created.sts.GetCallerIdentity: calling handler <bound method RequestSigner.handler of <botocore.signers.RequestSigner object at 0x7f537b11c438>>
2019-02-09 10:56:52 DEBUG    Event choose-signer.sts.GetCallerIdentity: calling handler <function set_operation_specific_signer at 0x7f537d055a60>
2019-02-09 10:56:52 DEBUG    Calculating signature using v4 auth.
2019-02-09 10:56:52 DEBUG    CanonicalRequest:
POST

JBOClara avatar Feb 09 '19 10:02 JBOClara

Saw this today on EKS GitVersion:"v1.10.11-eks". The kube2iam pod was running 0.10.4. I killed the pod, and the daemonset brought another back, this time with 0.10.6 and it is doing what it is supposed to.

All the other worker nodes running kube2iam are "older" than the one that had the issue. It almost feels like AWS is ripping out something from under kube2iam and it gets in this state?

metacyclic avatar Feb 11 '19 20:02 metacyclic

We are experiencing this issue as well: EKS 1.10.11, aws-vpc-cni-k8s 1.30 0.10.4 seems to resolve the issue. Occurred on 0.10.6

conversicachrisr avatar Feb 15 '19 18:02 conversicachrisr