cloud-on-k8s
cloud-on-k8s copied to clipboard
fix: cleanup pre-stop-hook-script.sh because NODE_ID can be empty
The main problem
When upgrading a image with some other plugin, the operator will terminate each pod and try to remove it from the ES-cluster.
This piece of code can be empty:
NODE_ID=$(grep "$POD_NAME" "$resp_body" | cut -f 1 -d ' ')
Result
- There is no NODE_ID and the request is broken between
_nodes
andshutdown
{"@timestamp": "2024-06-11T09:37:05+00:00", "message": "400 http://<cluster>-es-internal-http.<namespace>.svc:9200/_nodes//shutdown", "ecs.version": "1.2.0", "event.dataset": "elasticsearch.pre-stop-hook"}
- The pod that is restarting will crash each time with
error_exit "failed to call node shutdown API"
and a shutdown is never called. Thus resulting in recreating the same pod again and starting allover from the top.
What i still want to know
Is the node removed before calling _cat/nodes
When the node is terminated and the pre-stop-hook-script.sh is called, is it possible that the node is already removed from the _cat/nodes
query? Or is it possible that the query ends op on the terminated node and doesn't give a result.
This piece of code returns the list of nodes and i wonder if the pod is terminated the node is actually already not present in this list from active nodes. Still no basis for this claim, but i have not confirmed if the NODE_ID is empty because the other nodes in the cluster don't see the node that is terminated.
request -X GET "${ES_URL}/_cat/nodes?full_id=true&h=id,name"
Why is terminationGracePeriodSeconds way less then possible script run time?
The default terminationGracePeriodSeconds is 180 seconds.
The scripts has also has 2 retry 10
calls, witch has count ** 2
as wait.
This can result in alot of wait time:
round 1: 1 second
round 2: 1 second of previous round + 1 + 2 = 4 seconds
round 3: 4 seconds of previous rounds + 1 + 2 + 4 = 11 seconds
...
round 9: 502 seconds of the previous rounds + 1 + 2 + 4 + 8 + 16 + 32 + 64 + 128 + 256 seconds = +- 17 minutes
- should the terminationGracePeriodSeconds set to 30 minutes?
- should the
retry 10
be way less, something likeretry 8
and get "retry 8/8 exited 1, no more retries left"
What has been done
- After some debugging and trying to understand the code, i ended up cleaning it up a little and used shellcheck. I tried to not rewrite it all
- WONT: build a retry loop to get the NODE_ID
Want to know if this should use a
retry 3
or justerror_exit "failed to retrieve node ID"
After cleanup, looks like this was not needed. - Use spaces instead of \t, this will ensure a readability inside the configmap.
- Retry to 8
PoC Result
Added some debug information to prove that the script is working. Will add that it is not fun to debug the bash script without 'set -x'.
{"@timestamp": "2024-06-12T06:01:27+00:00", "message": "retrieving nodes", "ecs.version": "1.2.0", "event.dataset": "elasticsearch.pre-stop-hook"}
{"@timestamp": "2024-06-12T06:01:27+00:00", "message": "retrieving node id", "ecs.version": "1.2.0", "event.dataset": "elasticsearch.pre-stop-hook"}
{"@timestamp": "2024-06-12T06:01:27+00:00", "message": "resp_body: /tmp/tmp.k6cZwbNtph", "ecs.version": "1.2.0", "event.dataset": "elasticsearch.pre-stop-hook"}
{"@timestamp": "2024-06-12T06:01:27+00:00", "message": "NODE_ID: h3WUy....aTV9qjl7w", "ecs.version": "1.2.0", "event.dataset": "elasticsearch.pre-stop-hook"}
{"@timestamp": "2024-06-12T06:01:27+00:00", "message": "success to retrieve node id", "ecs.version": "1.2.0", "event.dataset": "elasticsearch.pre-stop-hook"}
{"@timestamp": "2024-06-12T06:01:27+00:00", "message": "retrieving shutdown request", "ecs.version": "1.2.0", "event.dataset": "elasticsearch.pre-stop-hook"}
{"@timestamp": "2024-06-12T06:01:27+00:00", "message": "check shutdown response", "ecs.version": "1.2.0", "event.dataset": "elasticsearch.pre-stop-hook"}
{"@timestamp": "2024-06-12T06:01:27+00:00", "message": "initiating node shutdown", "ecs.version": "1.2.0", "event.dataset": "elasticsearch.pre-stop-hook"}
{"@timestamp": "2024-06-12T06:01:27+00:00", "message": "waiting for node shutdown to complete", "ecs.version": "1.2.0", "event.dataset": "elasticsearch.pre-stop-hook"}
{"@timestamp": "2024-06-12T06:01:27+00:00", "message": "delaying termination for 50 seconds", "ecs.version": "1.2.0", "event.dataset": "elasticsearch.pre-stop-hook"}
What has not been done
...
- prepare-fs.sh
- readiness-probe-script.sh
- suspend.sh