kube2iam icon indicating copy to clipboard operation
kube2iam copied to clipboard

What does exit code 2 mean?

Open Magicloud opened this issue 4 years ago • 6 comments

I am having a few kube2iam 0.10.0 containers running, and they all have been restarted due to exit code 2.

While I am troubleshooting, I'd like to know what does 2 mean?

Magicloud avatar Jan 01 '20 10:01 Magicloud

i've noticed the same problem in my installation and using kubectl get events -n <namespace> i can see that in my case the exit code is 2 as well and the containers are restarted because if the failing liveness probe:

kubectl get events --namespace=infra-system --sort-by='{.lastTimestamp}'
LAST SEEN   TYPE      REASON          OBJECT                                     MESSAGE
60m         Normal    Started         pod/kube2iam-tpw4v                         Started container kube2iam
60m         Normal    Created         pod/kube2iam-tpw4v                         Created container kube2iam
60m         Normal    Pulled          pod/kube2iam-tpw4v                         Container image "jtblin/kube2iam:0.10.9" already present on machine
60m         Normal    Killing         pod/kube2iam-tpw4v                         Container kube2iam failed liveness probe, will be restarted

Current liveness probe logic i believe it is wrong. A liveness probe should not check a dependency like kube2iam is doing: https://github.com/jtblin/kube2iam/blob/42d4453e7650859d14deedbe42ed5cf6b60ba020/server/server.go#L224-L260

@Jacobious52 @jtblin @mwhittington21 what do you think?

ltagliamonte-dd avatar Jun 08 '20 20:06 ltagliamonte-dd

i've also noticed that the default healthcheck period is 30s https://github.com/jtblin/kube2iam/blob/42d4453e7650859d14deedbe42ed5cf6b60ba020/server/server.go#L42 and the helm chart i'm using (from the stable chart repo) uses the following config (that can't be customized atm): Liveness: http-get http://:8181/healthz delay=30s timeout=1s period=5s #success=1 #failure=3

the liveness policy in the chart will check the endpoint every 5s and consider the pod unhealty after 3 times -> so after 15s. if the healthcheck fails 1 time the pod will be considered unhealthy and restarted.

ltagliamonte-dd avatar Jun 08 '20 20:06 ltagliamonte-dd

Current liveness probe logic i believe it is wrong. A liveness probe should not check a dependency like kube2iam is doing:

While a liveness probe ideally shouldn't check dependencies, when these dependencies cause the application to completely cease to function (and not in a transparent way) then I would argue that it's ok. Restarting the container may fix the issue, if it keeps restarting then you are alerted to a problem. There may be a better way to alert you to a problem than restarting though.

i've also noticed that the default healthcheck period is 30s and the helm chart i'm using (from the stable chart repo) uses the following config (that can't be customized atm)

This seems like the real issue here. One failed healthcheck shouldn't nuke your container. The discrepancy between the stable chart and the default healthcheck period of 30s seems like a bad configuration setup which should be fixed in the helm chart.

mwhittington21 avatar Jun 09 '20 04:06 mwhittington21

@mwhittington21 i've already submitted a PR to the helm repo that allows to customize the parameters of the the liveness probe. https://github.com/helm/charts/pull/22717

ltagliamonte-dd avatar Jun 09 '20 05:06 ltagliamonte-dd

from my splunk logs it is evident that the liveness probes fails: Screen Shot 2020-06-08 at 10 03 00 PM @mwhittington21 what to do you think about adding a retry logic in the check? like retry 3 times before setting the state to failed?

ltagliamonte-dd avatar Jun 09 '20 05:06 ltagliamonte-dd

The check itself probably shouldn't change, but we should add the ability to interpret it under a wider variety of situations. I think the following should address the situation and allow different use cases:

  • Allow customising how often the healthcheck runs inside kube2iam as 30s isn't suitable for all use cases
  • Fix the default helm chart to have more sane defaults, and allow customisation of the liveness so as not to terminate after 1 failure

mwhittington21 avatar Jun 09 '20 05:06 mwhittington21