fix: cleanup pre-stop-hook-script.sh because NODE_ID can be empty

Open BobVanB opened this issue 8 months ago • 8 comments

The main problem

When upgrading a image with some other plugin, the operator will terminate each pod and try to remove it from the ES-cluster.

This piece of code can be empty:

NODE_ID=$(grep "$POD_NAME" "$resp_body" | cut -f 1 -d ' ')

Result

There is no NODE_ID and the request is broken between _nodes and shutdown

{"@timestamp": "2024-06-11T09:37:05+00:00", "message": "400 http://<cluster>-es-internal-http.<namespace>.svc:9200/_nodes//shutdown", "ecs.version": "1.2.0", "event.dataset": "elasticsearch.pre-stop-hook"}

The pod that is restarting will crash each time with error_exit "failed to call node shutdown API" and a shutdown is never called. Thus resulting in recreating the same pod again and starting allover from the top.

What i still want to know

Is the node removed before calling _cat/nodes

When the node is terminated and the pre-stop-hook-script.sh is called, is it possible that the node is already removed from the _cat/nodes query? Or is it possible that the query ends op on the terminated node and doesn't give a result.

This piece of code returns the list of nodes and i wonder if the pod is terminated the node is actually already not present in this list from active nodes. Still no basis for this claim, but i have not confirmed if the NODE_ID is empty because the other nodes in the cluster don't see the node that is terminated.

request -X GET "${ES_URL}/_cat/nodes?full_id=true&h=id,name"

Why is terminationGracePeriodSeconds way less then possible script run time?

The default terminationGracePeriodSeconds is 180 seconds. The scripts has also has 2 retry 10 calls, witch has count ** 2 as wait. This can result in alot of wait time: round 1: 1 second round 2: 1 second of previous round + 1 + 2 = 4 seconds round 3: 4 seconds of previous rounds + 1 + 2 + 4 = 11 seconds ... round 9: 502 seconds of the previous rounds + 1 + 2 + 4 + 8 + 16 + 32 + 64 + 128 + 256 seconds = +- 17 minutes

should the terminationGracePeriodSeconds set to 30 minutes?
should the retry 10 be way less, something like retry 8 and get "retry 8/8 exited 1, no more retries left"

What has been done

After some debugging and trying to understand the code, i ended up cleaning it up a little and used shellcheck. I tried to not rewrite it all
WONT: build a retry loop to get the NODE_ID Want to know if this should use a retry 3 or just error_exit "failed to retrieve node ID" After cleanup, looks like this was not needed.
Use spaces instead of \t, this will ensure a readability inside the configmap.
Retry to 8

PoC Result

Added some debug information to prove that the script is working. Will add that it is not fun to debug the bash script without 'set -x'.

{"@timestamp": "2024-06-12T06:01:27+00:00", "message": "retrieving nodes", "ecs.version": "1.2.0", "event.dataset": "elasticsearch.pre-stop-hook"}
{"@timestamp": "2024-06-12T06:01:27+00:00", "message": "retrieving node id", "ecs.version": "1.2.0", "event.dataset": "elasticsearch.pre-stop-hook"}
{"@timestamp": "2024-06-12T06:01:27+00:00", "message": "resp_body: /tmp/tmp.k6cZwbNtph", "ecs.version": "1.2.0", "event.dataset": "elasticsearch.pre-stop-hook"}
{"@timestamp": "2024-06-12T06:01:27+00:00", "message": "NODE_ID: h3WUy....aTV9qjl7w", "ecs.version": "1.2.0", "event.dataset": "elasticsearch.pre-stop-hook"}
{"@timestamp": "2024-06-12T06:01:27+00:00", "message": "success to retrieve node id", "ecs.version": "1.2.0", "event.dataset": "elasticsearch.pre-stop-hook"}
{"@timestamp": "2024-06-12T06:01:27+00:00", "message": "retrieving shutdown request", "ecs.version": "1.2.0", "event.dataset": "elasticsearch.pre-stop-hook"}
{"@timestamp": "2024-06-12T06:01:27+00:00", "message": "check shutdown response", "ecs.version": "1.2.0", "event.dataset": "elasticsearch.pre-stop-hook"}
{"@timestamp": "2024-06-12T06:01:27+00:00", "message": "initiating node shutdown", "ecs.version": "1.2.0", "event.dataset": "elasticsearch.pre-stop-hook"}
{"@timestamp": "2024-06-12T06:01:27+00:00", "message": "waiting for node shutdown to complete", "ecs.version": "1.2.0", "event.dataset": "elasticsearch.pre-stop-hook"}
{"@timestamp": "2024-06-12T06:01:27+00:00", "message": "delaying termination for 50 seconds", "ecs.version": "1.2.0", "event.dataset": "elasticsearch.pre-stop-hook"}

What has not been done

...

prepare-fs.sh
readiness-probe-script.sh
suspend.sh

Jun 11 '24 11:06 BobVanB

cloud-on-k8s cloud-on-k8s copied to clipboard

fix: cleanup pre-stop-hook-script.sh because NODE_ID can be empty

The main problem

Result

What i still want to know

Is the node removed before calling _cat/nodes

Why is terminationGracePeriodSeconds way less then possible script run time?

What has been done

PoC Result

What has not been done

cloud-on-k8s
cloud-on-k8s copied to clipboard