kube2iam
kube2iam copied to clipboard
Role not being picked up by a container in an annotated pod
A gitlab runner pod is running on a kubernetes cluster running kube2iam which spins builds pods with two containers "build" and "helper".
The "build" container AWS calls are intercepted correctly and the correct role is assumed but the "helper" container does not get intercepted or the annotation is not recognized and then kube2iam seems to default to the default role. Eventually this causes the cache functionality of gitlab to return a 403 due to the incorrect role being assumed.
Has anyone experienced this issue before?
kube2iam logs for build pod:
time="2019-08-07T13:36:02Z" level=debug msg="retrieved credentials from sts endpoint: https://sts.eu-west-1.amazonaws.com" ns.name=gitlab pod.iam.role="arn:aws:iam::XXXXXXXXXX:role/gitlab-GitLabRunnerStack-5IXLXS-RunnerInstanceRole-SB6GE6XITAKV" req.method=GET req.path=/latest/meta-data/iam/security-credentials/gitlab-GitLabRunnerStack-5IXLXS-RunnerInstanceRole-SB6GE6XITAKV req.remote=10.137.4.97
kube2iam logs for helper pod:
time="2019-08-07T13:37:29Z" level=warning msg="Using fallback role for IP 10.137.4.156"
time="2019-08-07T13:37:29Z" level=debug msg="retrieved credentials from sts endpoint: https://sts.eu-west-1.amazonaws.com" ns.name=gitlab pod.iam.role="arn:aws:iam::XXXXXXXXXX:role:role/kube2iam-default" req.method=GET req.path=/latest/meta-data/iam/security-credentials/kube2iam-default req.remote=10.137.4.156
time="2019-08-07T13:37:29Z" level=info msg="GET /latest/meta-data/iam/security-credentials/kube2iam-default (200) took 68986.000000 ns" req.method=GET req.path=/latest/meta-data/iam/security-credentials/kube2iam-default req.remote=10.137.4.156 res.duration=68986 res.status=200
The pod configuration:
`Name: runner-xqul42y4-project-149-concurrent-0b27w7
Namespace: gitlab
Priority: 0
Node: ip-10-137-4-189.eu-west-1.compute.internal/10.137.4.189
Start Time: Wed, 07 Aug 2019 15:35:14 +0200
Labels: pod=runner-xqul42y4-project-149-concurrent-0
Annotations: iam.amazonaws.com/role: arn:aws:iam::296095062504:role/gitlab-GitLabRunnerStack-5IXLXS-RunnerInstanceRole-SB6GE6XITAKV
kubernetes.io/psp: eks.privileged
Status: Running
IP: 10.137.4.97
Containers:
build:
Container ID: docker://13a1ef0798db732550e39035b62e77b866ffb4338696b2d08b696fa2c3344122
Image: XXXXXXXXXX:role.dkr.ecr.eu-west-1.amazonaws.com/platform/frontend-builder:7f2369e9-59903
Image ID: docker-pullable://XXXXXXXXXX:role.dkr.ecr.eu-west-1.amazonaws.com/platform/frontend-builder@sha256:086fbd257efba5e7fcbe432ced3ef36d8dc2424ae3e897d3f47f84572d382df9
Port: <none>
Host Port: <none>
Command:
sh
-c
if [ -x /usr/local/bin/bash ]; then
exec /usr/local/bin/bash
elif [ -x /usr/bin/bash ]; then
exec /usr/bin/bash
elif [ -x /bin/bash ]; then
exec /bin/bash
elif [ -x /usr/local/bin/sh ]; then
exec /usr/local/bin/sh
elif [ -x /usr/bin/sh ]; then
exec /usr/bin/sh
elif [ -x /bin/sh ]; then
exec /bin/sh
elif [ -x /busybox/sh ]; then
exec /busybox/sh
else
echo shell not found
exit 1
fi
State: Running
Started: Wed, 07 Aug 2019 15:35:15 +0200
Ready: True
Restart Count: 0
Environment: REMOVED
Mounts:
/builds from repo (rw)
/var/run/secrets/kubernetes.io/serviceaccount from default-token-lrtn9 (ro)
helper:
Container ID: docker://a2ac54aa03bccab1010d0595b0ad6958a2a448f44040663020f0466a7172674a
Image: gitlab/gitlab-runner-helper:x86_64-de7731dd
Image ID: docker-pullable://gitlab/gitlab-runner-helper@sha256:a68dc1b0468d5d01b2b70b85aa90acfbb13434e0ae84b1fea5bedccaa9847301
Port: <none>
Host Port: <none>
Command:
sh
-c
if [ -x /usr/local/bin/bash ]; then
exec /usr/local/bin/bash
elif [ -x /usr/bin/bash ]; then
exec /usr/bin/bash
elif [ -x /bin/bash ]; then
exec /bin/bash
elif [ -x /usr/local/bin/sh ]; then
exec /usr/local/bin/sh
elif [ -x /usr/bin/sh ]; then
exec /usr/bin/sh
elif [ -x /bin/sh ]; then
exec /bin/sh
elif [ -x /busybox/sh ]; then
exec /busybox/sh
else
echo shell not found
exit 1
fi
State: Running
Started: Wed, 07 Aug 2019 15:35:15 +0200
Ready: True
Restart Count: 0
Environment: REMOVED
Mounts:
/builds from repo (rw)
/var/run/secrets/kubernetes.io/serviceaccount from default-token-lrtn9 (ro)
Conditions:
Type Status
Initialized True
Ready True
ContainersReady True
PodScheduled True
Volumes:
repo:
Type: EmptyDir (a temporary directory that shares a pod's lifetime)
Medium:
SizeLimit: <unset>
default-token-lrtn9:
Type: Secret (a volume populated by a Secret)
SecretName: default-token-lrtn9
Optional: false
QoS Class: BestEffort
Node-Selectors: <none>
Tolerations: node.kubernetes.io/not-ready:NoExecute for 300s
node.kubernetes.io/unreachable:NoExecute for 300s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 100s default-scheduler Successfully assigned gitlab/runner-xqul42y4-project-149-concurrent-0b27w7 to ip-10-137-4-189.eu-west-1.compute.internal
Normal Pulled 100s kubelet, ip-10-137-4-189.eu-west-1.compute.internal Container image "296095062504.dkr.ecr.eu-west-1.amazonaws.com/platform/frontend-builder:7f2369e9-59903" already present on machine
Normal Created 100s kubelet, ip-10-137-4-189.eu-west-1.compute.internal Created container
Normal Started 99s kubelet, ip-10-137-4-189.eu-west-1.compute.internal Started container
Normal Pulled 99s kubelet, ip-10-137-4-189.eu-west-1.compute.internal Container image "gitlab/gitlab-runner-helper:x86_64-de7731dd" already present on machine
Normal Created 99s kubelet, ip-10-137-4-189.eu-west-1.compute.internal Created container
Normal Started 99s kubelet, ip-10-137-4-189.eu-west-1.compute.internal Started container
`
As . you can see the annotation is there and the pod has two containers.
The IP 10.137.4.156 in the logs corresponds to the parent runner IP (the one that launches the children pod with the 2 containers).
Name: gitlab-runner-shared-86ddbfdd59-fv9q9
Namespace: gitlab
Priority: 0
Node: ip-10-137-4-189.eu-west-1.compute.internal/10.137.4.189
Start Time: Wed, 07 Aug 2019 14:38:40 +0200
Labels: app.kubernetes.io/app=gitlab-runner-shared
pod-template-hash=86ddbfdd59
Annotations: kubernetes.io/psp: eks.privileged
prometheus.io/port: 9252
prometheus.io/scrape: true
Status: Running
IP: 10.137.4.156
Controlled By: ReplicaSet/gitlab-runner-shared-86ddbfdd59
Init Containers:
configure:
Container ID: docker://53b9ee3e1ff8a5209025912441a0157586548aa39f6249945a6677c1f91500fa
Image: gitlab/gitlab-runner:alpine
Image ID: docker-pullable://gitlab/gitlab-runner@sha256:efdf04d68586fa6a203b25354f7eafab37c2ef2ae7df2fe22a944fe6d0662085
Port: <none>
Host Port: <none>
Command:
sh
/config/configure
State: Terminated
Reason: Completed
Exit Code: 0
Started: Wed, 07 Aug 2019 14:38:41 +0200
Finished: Wed, 07 Aug 2019 14:38:41 +0200
Ready: True
Restart Count: 0
Environment:
CI_SERVER_URL: https://gitlab.leaseplan.io
CLONE_URL:
RUNNER_EXECUTOR: kubernetes
REGISTER_LOCKED: false
RUNNER_TAG_LIST: k8s, shared
KUBERNETES_IMAGE: alpine:latest
KUBERNETES_NAMESPACE: gitlab
KUBERNETES_CPU_LIMIT:
KUBERNETES_MEMORY_LIMIT:
KUBERNETES_CPU_REQUEST:
KUBERNETES_MEMORY_REQUEST:
KUBERNETES_SERVICE_ACCOUNT:
KUBERNETES_SERVICE_CPU_LIMIT:
KUBERNETES_SERVICE_MEMORY_LIMIT:
KUBERNETES_SERVICE_CPU_REQUEST:
KUBERNETES_SERVICE_MEMORY_REQUEST:
KUBERNETES_HELPER_CPU_LIMIT:
KUBERNETES_HELPER_MEMORY_LIMIT:
KUBERNETES_HELPER_CPU_REQUEST:
KUBERNETES_HELPER_MEMORY_REQUEST:
KUBERNETES_HELPER_IMAGE:
KUBERNETES_PULL_POLICY:
CACHE_TYPE: s3
CACHE_PATH:
CACHE_SHARED: true
CACHE_S3_SERVER_ADDRESS: s3.amazonaws.com
CACHE_S3_BUCKET_NAME: gitlab-gitlabrunnerstack-5ixlxs1i4isf-cachebucket-1fod5sdo77cht
CACHE_S3_BUCKET_LOCATION: eu-west-1
GITLAB_RUNNER_ROLE: gitlab-GitLabRunnerStack-5IXLXS-RunnerInstanceRole-SB6GE6XITAKV
AWS_ACCOUNT: 296095062504
RUN_UNTAGGED: true
Mounts:
/config from scripts (ro)
/init-secrets from init-runner-secrets (ro)
/secrets from runner-secrets (rw)
/var/run/secrets/kubernetes.io/serviceaccount from gitlab-runner-shared-token-hwnnz (ro)
Containers:
gitlab-runner:
Container ID: docker://b5f47b3805e91ae20d9a3661750d1f7a7619e2315237d9fa8de792e33ff4b530
Image: gitlab/gitlab-runner:alpine
Image ID: docker-pullable://gitlab/gitlab-runner@sha256:efdf04d68586fa6a203b25354f7eafab37c2ef2ae7df2fe22a944fe6d0662085
Port: 9252/TCP
Host Port: 0/TCP
Command:
/bin/bash
/scripts/entrypoint
State: Running
Started: Wed, 07 Aug 2019 14:38:43 +0200
Ready: True
Restart Count: 0
Liveness: exec [/bin/bash /scripts/check-live] delay=60s timeout=1s period=10s #success=1 #failure=3
Readiness: exec [/usr/bin/pgrep gitlab.*runner] delay=10s timeout=1s period=10s #success=1 #failure=3
Environment:
CI_SERVER_URL: https://gitlab.leaseplan.io
CLONE_URL:
RUNNER_EXECUTOR: kubernetes
REGISTER_LOCKED: false
RUNNER_TAG_LIST: k8s, shared
KUBERNETES_IMAGE: alpine:latest
KUBERNETES_NAMESPACE: gitlab
KUBERNETES_CPU_LIMIT:
KUBERNETES_MEMORY_LIMIT:
KUBERNETES_CPU_REQUEST:
KUBERNETES_MEMORY_REQUEST:
KUBERNETES_SERVICE_ACCOUNT:
KUBERNETES_SERVICE_CPU_LIMIT:
KUBERNETES_SERVICE_MEMORY_LIMIT:
KUBERNETES_SERVICE_CPU_REQUEST:
KUBERNETES_SERVICE_MEMORY_REQUEST:
KUBERNETES_HELPER_CPU_LIMIT:
KUBERNETES_HELPER_MEMORY_LIMIT:
KUBERNETES_HELPER_CPU_REQUEST:
KUBERNETES_HELPER_MEMORY_REQUEST:
KUBERNETES_HELPER_IMAGE:
KUBERNETES_PULL_POLICY:
CACHE_TYPE: s3
CACHE_PATH:
CACHE_SHARED: true
CACHE_S3_SERVER_ADDRESS: s3.amazonaws.com
CACHE_S3_BUCKET_NAME: gitlab-gitlabrunnerstack-5ixlxs1i4isf-cachebucket-1fod5sdo77cht
CACHE_S3_BUCKET_LOCATION: eu-west-1
GITLAB_RUNNER_ROLE: gitlab-GitLabRunnerStack-5IXLXS-RunnerInstanceRole-SB6GE6XITAKV
AWS_ACCOUNT: 296095062504
RUN_UNTAGGED: true
Mounts:
/home/gitlab-runner/.gitlab-runner from etc-gitlab-runner (rw)
/scripts from scripts (rw)
/secrets from runner-secrets (rw)
/var/run/secrets/kubernetes.io/serviceaccount from gitlab-runner-shared-token-hwnnz (ro)
Conditions:
Type Status
Initialized True
Ready True
ContainersReady True
PodScheduled True
Volumes:
init-runner-secrets:
Type: Projected (a volume that contains injected data from multiple sources)
SecretName: gitlab-runner-registration-token-shared-8c95cbthk6
SecretOptionalName: <nil>
runner-secrets:
Type: EmptyDir (a temporary directory that shares a pod's lifetime)
Medium: Memory
SizeLimit: <unset>
etc-gitlab-runner:
Type: EmptyDir (a temporary directory that shares a pod's lifetime)
Medium: Memory
SizeLimit: <unset>
scripts:
Type: ConfigMap (a volume populated by a ConfigMap)
Name: gitlab-runner-shared-42fg8df727
Optional: false
gitlab-runner-shared-token-hwnnz:
Type: Secret (a volume populated by a Secret)
SecretName: gitlab-runner-shared-token-hwnnz
Optional: false
QoS Class: BestEffort
Node-Selectors: <none>
Tolerations: node.kubernetes.io/not-ready:NoExecute for 300s
node.kubernetes.io/unreachable:NoExecute for 300s
Events: <none>
Yeah i'm having similar situation with Jenkins + k8s agent. Sometimes the pod got the corrected assumed role, sometimes it doesn't. Not sure why
Also experiencing this issue. Most of the time the correct role gets assumed but it is intermittently falling back to the worker group role. If anyone needs any more info please feel free to reach out.
Based on our current finding, mostly the root cause is coming from the case when we use a random name for pod label. When we let Jenkins to generate the label randomly, the error rate (the case in which the pod is not able to get the correct credentials) was very high, up to 50%. After that, when we change to a static label for each Jenkins pod specs per job, the error rate was reduced dramatically.
It is not a complete solution, since we still encountered this problem, even just a small percentage. Usually normal workloads (like application deployments) are working fine because it can tolerate a few seconds failure in getting iam credentials before retry again. I guess there are nothing we can do about this since this is depends on kube2iam. For now we just retry the job when it fails.