concourse-chart
concourse-chart copied to clipboard
PreStop Hook exited with 137 blocking clean `kubectl delete pod`
Using the following command stucks for too much time:
smoke@rkirilov-work-pc ~ $ kubectl delete pod -n ci concourse-ci-worker-0
pod "concourse-ci-worker-0" deleted
When I describe the POD it is clear that the PreStop Hook did not exit clean:
smoke@rkirilov-work-pc ~ $ kubectl describe pod -n ci concourse-ci-worker-0 | cat | tail -n 12
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 79s default-scheduler Successfully assigned ci/concourse-ci-worker-0 to ip-10-200-3-38.ec2.internal
Normal Pulled 78s kubelet, ip-10-200-3-38.ec2.internal Container image "concourse/concourse:5.8.0" already present on machine
Normal Created 78s kubelet, ip-10-200-3-38.ec2.internal Created container concourse-ci-worker-init-rm
Normal Started 78s kubelet, ip-10-200-3-38.ec2.internal Started container concourse-ci-worker-init-rm
Normal Pulled 72s kubelet, ip-10-200-3-38.ec2.internal Container image "concourse/concourse:5.8.0" already present on machine
Normal Created 72s kubelet, ip-10-200-3-38.ec2.internal Created container concourse-ci-worker
Normal Started 72s kubelet, ip-10-200-3-38.ec2.internal Started container concourse-ci-worker
Normal Killing 54s kubelet, ip-10-200-3-38.ec2.internal Stopping container concourse-ci-worker
Warning FailedPreStopHook 11s kubelet, ip-10-200-3-38.ec2.internal Exec lifecycle hook ([/bin/bash /pre-stop-hook.sh]) for Container "concourse-ci-worker" in Pod "concourse-ci-worker-0_ci(8688f7aa-6444-11ea-9917-0ad140727ba9)" failed - error: command '/bin/bash /pre-stop-hook.sh' exited with 137: , message: ""
So the only workaround is to now force delete the pod:
smoke@rkirilov-work-pc ~ $ kubectl delete pod --force --grace-period=0 -n ci concourse-ci-worker-0
warning: Immediate deletion does not wait for confirmation that the running resource has been terminated. The resource may continue to run on the cluster indefinitely.
pod "concourse-ci-worker-0" force deleted
~~May be /pre-stop-hook.sh
should be patched to handle (trap) the relevant signals (e.g. SIGTERM, SIGINT, SIGHUP) and exit cleanly. I assume when the dumb-init is signaled, it on its own tries to cleanly terminate the /pre-stop-hook.sh
and given it does not terminate cleanly - it gets killed with the exit code 137 that then blocks K8S.~~
~~I will give it a try and will update the ticket, hopefully with a PR.~~
Actually K8S just waits for the PreStop Hook only for a terminationGracePeriodSeconds
amount of time and then sends a SIGTERM the containers and then SIGKILL all the running processes after 2 more seconds as per https://github.com/kubernetes/kubernetes/issues/39170#issuecomment-448195287 and https://kubernetes.io/docs/concepts/workloads/pods/pod/#termination-of-pods
However strange thing is the POD is left in terminating state for many more minutes and doesn't seem to restart.
So may be the best course of action would be to use timeout -k {.Values.worker.terminationGracePeriodSeconds} bash -c 'while [ -e /proc/1 ]; do sleep 1; done'
or something similar I guess. This way at least the delete command will not be blocked.
Also it is important to increase the .Values.worker.terminationGracePeriodSeconds to something that makes sense for your own Pipelines.
I tried a quick patch with your suggestion:
diff --git a/templates/worker-prestop-configmap.yaml b/templates/worker-prestop-configmap.yaml
index 9d5dd31..9f43a76 100644
--- a/templates/worker-prestop-configmap.yaml
+++ b/templates/worker-prestop-configmap.yaml
@@ -11,5 +11,5 @@ data:
pre-stop-hook.sh: |
#!/bin/bash
kill -s {{ .Values.concourse.worker.shutdownSignal }} 1
- while [ -e /proc/1 ]; do sleep 1; done
+ timeout -k {{ .Values.worker.terminationGracePeriodSeconds }} {{ .Values.worker.terminationGracePeriodSeconds }} /bin/bash -c 'while [ -e /proc/1 ]; do sleep 1; done'
The script still exits with a non-zero exit code, 124
in this case:
Warning FailedPreStopHook 1s kubelet, gke-topgun-topgun-worker-2c49df4e-qwh6 Exec lifecycle hook ([/bin/bash /pre-stop-hook.sh]) for Container "issue81-worker" in Pod "issue81-worker-0_issue81(4ad690c9-d362-48d8-9e5a-c5e873b5571e)" failed - error: command '/bin/bash /pre-stop-hook.sh' exited with 124: , message: ""
Not sure what a good solution for this one is 🤔
to reproduce this I installed the helm chart with default settings and started this long running job:
---
jobs:
- name: simple-job
plan:
- task: simple-task
config:
platform: linux
image_resource:
type: registry-image
source: {repository: busybox}
run:
path: /bin/sh
args:
- -c
- |
#!/bin/sh
sleep 1h
I then deleted the pod
$ kubectl delete pod -n issue81 issue81-worker-0
and kept describing the pod until I saw the relevant error:
$ k describe pod -n issue81 issue81-worker-0 | tail -n 10
@taylorsilva I confirm your findings and I don't have better workaround than increase the timeout and manually intervening when such things happen :(
Having same issue on Concourse v5.7.1.
Hi, I've the same error. I attached a pre stop hook script, containing sleep 10 seconds and deleted the pod. The pre-stop hook script ran but got FailedPreStopHook event with same exit code 137. This is in EKS with 1.25 k8s version.