terraform-aws-gitlab-runner icon indicating copy to clipboard operation
terraform-aws-gitlab-runner copied to clipboard

Option to gracefully terminate runner

Open long-wan-ep opened this issue 2 years ago • 10 comments
trafficstars

Describe the solution you'd like

When the terminate-agent-hook runs, workers are terminated and running jobs are interrupted. We would like an option to gracefully terminate runners, so that the running jobs are given a chance to complete

Describe alternatives you've considered

We previous disabled the creation of terminate-agent-hook and used our own hook + lambda to handle graceful termination, but terminate-agent-hook was made mandatory so we can no longer do this.

Suggest a solution

We suggest adding an option to gracefully terminate runners in the terminate-agent-hook lambda. We can contribute our graceful termination logic to terminate-agent-hook if it works for you. Here is a brief summary of our solution:

  1. Configure the gitlab-runner service to gracefully stop: ie.
cat <<EOF > /etc/systemd/system/gitlab-runner.service.d/kill.conf
[Service]
# Time to wait before stopping the service in seconds
TimeoutStopSec=600
KillSignal=SIGQUIT
EOF
  1. Send a message to a SQS queue when a runner's terminate lifecycle hook triggers
  2. Lambda triggers from SQS message
  3. Lambda sends command to runner EC2 instance to stop the gitlab-runner service, done via AWS SSM document
  4. Lambda waits for the SSM document to finish executing a. If the gitlab-runner service successfully stopped, lambda completes lifecycle hook b. If the gitlab-runner service has not successfully stopped, error is thrown and the SQS message goes back to the queue to be retried in the next run
  5. Lambda terminates any workers still running

long-wan-ep avatar Nov 06 '23 20:11 long-wan-ep

Yes and no, I think. The Lambda is executed in case the GitLab Runner (who started the worker) dies. In this case the Runners can continue with the current job, but they are not able to upload the logs, artifacts, ... to GitLab as this needs the GitLab Runner which is no longer there.

As the job might access external resources, ... it makes sense to wait until it is finished and kill the worker then.

kayman-mk avatar Nov 09 '23 09:11 kayman-mk

Some thoughts which popped up during checking your procedure described above:

  • I think the shutdown timeout of 10 minutes doesn't change anything because the GitLab Runner has been shutdown already and can't be contacted from the workers anymore.
  • there is no lifecycle hook for the Runner. But I guess you mean the worker instance, right? 2. Send a message to a SQS queue when a runner's terminate lifecycle hook triggers
  • having some doubts that it complicates the whole setup and might introduce problems. But at the moment I can't think of an easier solution to be honest.

If you could share your implementation, it would be wonderful.

kayman-mk avatar Nov 09 '23 09:11 kayman-mk

You're right, this wouldn't help the situation where the runner dies. We were intending this for situations where the runner is modified and requires a refresh.

Here is a slimmed down version of our implementation, I added it to the the examples folder: https://github.com/long-wan-ep/terraform-aws-gitlab-runner/tree/graceful-terminate-example/examples/graceful-terminate.

long-wan-ep avatar Nov 09 '23 20:11 long-wan-ep

This issue is stale because it has been open 60 days with no activity. Remove stale label or comment or this will be closed in 15 days.

github-actions[bot] avatar Jan 09 '24 02:01 github-actions[bot]

This issue was closed because it has been stalled for 15 days with no activity.

github-actions[bot] avatar Jan 24 '24 02:01 github-actions[bot]

Hi @kayman-mk, noticed this issue was auto-closed, could we re-open it?

Does our solution look ok? Or any other ideas we could try?

long-wan-ep avatar Feb 16 '24 22:02 long-wan-ep

Re-read everything ;-) Let's give it a try.

The teminate-agent-hook is used to kill the Workers in case the Runner (named parent in that module) dies. This should of course never happen. Better to wait until all Executory are finished, then shut-down the Runner.

Could you please propose a PR? Would be a good idea to make the TimeoutStopSec configurable. GitLab uses 7200s, we typically use 3600s for the job timeout.

kayman-mk avatar Feb 22 '24 10:02 kayman-mk

Graceful shutdown the Runner: https://gitlab.com/gitlab-com/runbooks/-/blob/258b29f088b2ad2d0ae955488958080f909d6a32/docs/ci-runners/linux/graceful-shutdown.md#:~:text=When%20Graceful%20Shutdown%20is%20initiated,GitLab%20side%2C%20the%20process%20exits.

Runner upgrade: https://gitlab.com/gitlab-cookbooks/cookbook-wrapper-gitlab-runner/-/blob/master/files/default/runner_upgrade.sh

kayman-mk avatar Feb 22 '24 10:02 kayman-mk

It seems that the terminate-agent-hook is good for removing dangling SSH keys, spot requests, ... but not for stopping the Runner itself. I guess ec2_client.terminate_instance(...) simply kills the instance, which is unwanted, because the Executors are simply killed and we do not wait until they have finished processing their jobs.

kayman-mk avatar Feb 22 '24 10:02 kayman-mk

Sounds good, we'll start working on a PR soon.

long-wan-ep avatar Feb 22 '24 19:02 long-wan-ep

#1117 will resolve this issue.

long-wan-ep avatar Apr 24 '24 23:04 long-wan-ep

My bad, I was meant to comment here, but somehow got lost. Original intent:

Hey @kayman-mk, @long-wan-ep I actually started implementing the proposal discussed in #1067: #1117. This MR still needs some polish, but based on my initial testing it seems to work. It basically makes the Runner Manager a bit more smart and aware of it's own desired state and acts accordingly.

@long-wan-ep definitely not meant to steal your thunder, but I do think #1117, if working, makes the implementation a bit simpler. Hope you do not mind! ❤️

tmeijn avatar Apr 25 '24 06:04 tmeijn

@tmeijn I don't mind at all, your implementation looks great, thanks for opening the PR.

long-wan-ep avatar Apr 25 '24 16:04 long-wan-ep