terraform-aws-gitlab-runner icon indicating copy to clipboard operation
terraform-aws-gitlab-runner copied to clipboard

Old runners not cleaned up after ASG update

Open martinhanzik opened this issue 5 years ago • 20 comments

If any change triggers a user_data update and an ASG recreation, the docker machine instance is not terminated, only the manager.

Configuration:

module "runner" {
  source  = "npalm/gitlab-runner/aws"
  version = "4.14.0"

  aws_region  = var.default_region
  environment = "runner"

  vpc_id                   = data.aws_vpc.base-vpc.id
  subnet_ids_gitlab_runner = [data.aws_subnet.base-vpc-private-eu-west-1a.id, data.aws_subnet.base-vpc-private-eu-west-1b.id]
  subnet_id_runners        = data.aws_subnet.base-vpc-private-eu-west-1a.id

  runners_name       = "runner"
  runners_gitlab_url = "https://gitlab.com"
  runners_request_spot_instance = false

  enable_runner_ssm_access = true

  runners_idle_count = 1

  runners_off_peak_periods = "[\"* * 0-7,19-23 * * mon-fri *\", \"* * * * * sat,sun *\"]"
  runners_off_peak_idle_count = 1
  runners_off_peak_idle_time  = 60

  cache_shared = true

  cache_bucket = {
    create = false
    policy = aws_iam_policy.runner-cache.arn
    bucket = aws_s3_bucket.runner-cache.arn
  }

  overrides = {
    name_sg = ""
    name_runner_agent_instance  = "runner-manager"
    name_docker_machine_runners = "runner"
  }

  gitlab_runner_registration_config = {
    registration_token = var.gitlab_runner_registration_token
    tag_list           = "docker"
    description        = "Runner"
    locked_to_project  = "false"
    run_untagged       = "false"
    maximum_timeout    = "3600"
  }

  ami_filter = {
    name = ["amzn2-ami-hvm-2.0.20200304.0-x86_64-ebs"]
  }

  runner_ami_filter = {
    name = ["ubuntu/images/hvm-ssd/ubuntu-eoan-19.10-amd64-server-20200325"]
  }

  userdata_pre_install     = "sudo yum install amazon-ecr-credential-helper -y"
}

martinhanzik avatar Apr 14 '20 16:04 martinhanzik

@martinhanzik I think you are right, personally I do not use the enable_forced_updates. But since this will stop the scaling group / terminate the instances in the scaling group it only terminates the agent. The docker-machine instances will not be cleared. Any suggestion for a fix?

npalm avatar Apr 14 '20 20:04 npalm

Maybe ASG events like https://docs.aws.amazon.com/autoscaling/ec2/userguide/lifecycle-hooks.html could be used - have a process trying to read the termination notification and perform a clean shutdown of all the docker-machine instances.

martinhanzik avatar Apr 16 '20 16:04 martinhanzik

@npalm updating the version of Terraform closed this issue?

patramsey avatar May 27 '20 17:05 patramsey

The problem is that the docker machines are managed by the agent running in the ASG. Simply terminating this instance by replacing the ASG leaves thoses instances un registered and they will be never removed. Did you noticed what happen to active builds?

This means there will no events to listing for, but having an lifecyle hook + lambda could potential clean up the docker machine left overs. Still the problem will remain that the machines could be busy with processing ci workloads

npalm avatar May 27 '20 20:05 npalm

I had this problem some days ago while updating the CI infrastructure. My terraform job suddenly died when the ASG was recreated. The lock on the state file was never released so I assume that the build was cancelled.

kayman-mk avatar Aug 13 '20 08:08 kayman-mk

Let me detail @martinhanzik and @npalm solution

Add an aws_autoscaling_lifecycle_hook (event: autoscaling:EC2_INSTANCE_TERMINATING). The event metadata is forwarded into a SQS queue. Trigger a lambda function which finds (via tags which have to be present in the autoscaling group as well) the associated runners and kill them.

Without the runner agent the runners shouldn't be able to proceed. May be they finish the built but they are not able not upload anything to the gitlab instance, are they?

Should we do this?

kayman-mk avatar Aug 13 '20 09:08 kayman-mk

Still the problem will remain that the machines could be busy with processing ci workloads

@npalm Just tried it and killed the runner agent while a java build was running. The jobs on the runner are immediately cancelled and gitlab shows the job as "runner system failure". The docker container on the runner disappeared. So all jobs on the runner are cancelled.

Thus the only problem is to remove the runner instances as described above.

kayman-mk avatar Aug 13 '20 11:08 kayman-mk

Yepz the runners are not managed by terraform, the default exampls contains some scripting to cancel spot requests and instances during destroy. I do not have a good idea for a fix.

npalm avatar Aug 13 '20 16:08 npalm

We are also facing this issue (though haven't put any time to fix it yet, everything else is working very well).

The additional piece of information is that the bundled script won't work for us because we have a multi account strategy and the box the script is running from is not in the same account as the target nodes, the script isn't taking a profile or role to assume as an argument so it seems like it is only intended for the same account.

DavidGamba avatar Aug 13 '20 18:08 DavidGamba

@DavidGamba Sorry, I didn't get it 100%. Which machines are running in different accounts?

We also have a multi account strategy. So the CI system runs in a different account than the product it builds. But the complete CI system runs in one account. So it should be possible to kill the runners when the runner agent is killed.

kayman-mk avatar Aug 13 '20 20:08 kayman-mk

@kayman-mk the machine we run our terraform plan/apply is in a different account from the one the runners are running from. So without a profile or assuming a role we can't kill the runners when the ASG gets rebuilt.

DavidGamba avatar Aug 13 '20 22:08 DavidGamba

Only starting with scaling gitlab in aws/terraform, but wouldn't it be possible to add a version-tag to the docker-machine runners and delete the instances by tag?

Similar to:

date "aws_instance" "previous" {

  filter {
    tag   = "asg"
    values = ["asg-name"]
  }

resource "aws_instance" "previous_runners" {
  ...
  instance_state = "Terminated"
}

strowi avatar Sep 29 '20 12:09 strowi

@strowi This is not possible as instance_state is read only. But using tags to identify those machines is a good idea. But it needs an additional script, lambda or something else.

kayman-mk avatar Sep 29 '20 17:09 kayman-mk

@kayman-mk ah right, thx! I misread the aws_instance-docs about instance_state.

strowi avatar Oct 01 '20 14:10 strowi

Hello guys,

I encountered the bug also even when using the example script https://github.com/npalm/terraform-aws-gitlab-runner/blob/develop/bin/cancel-spot-instances.sh

I saw it during destroy but it can occurs while updating ASG too , there is a small chance that before your main agent runner get removed/replaced, it spawns a Spot instance, since the cancel-spot.sh script is called before it is hard to handle. My current fix would be to add a depends_on for the asg destruction on the null ressource example https://github.com/npalm/terraform-aws-gitlab-runner/blob/8b15241b34c87d7b21c7a14fddf7d0b84f750f1b/examples/runner-default/main.tf#L106

But terraform does not support refering to a module ressource using depends_on unfortunately. In destroy case, we have some unwanted behaviors like Cloudwatch log group being created and not fully removed since an instance still exists.

@npalm I don't know if there is another way to specify your cancel-spot.sh script to trigger only after deletion of the main instance or its ASG group but it would solve many issues while waiting for a clean solution.

Regards

florian0410 avatar Mar 30 '21 18:03 florian0410

I am trying to implement a fully reproducible CI with N runners where each runner is recreated on a daily basis. This is currently the biggest issue I have in order to achieving complete reproducibility, as every time I recreate a runner, many workers are left orphaned.

dsalaza4 avatar Aug 18 '21 18:08 dsalaza4

@npalm What do you think about implementing a uuid tag for a runner and all its associated workers, so we can easily identify all workers belonging to a runner. this would allow us to execute a cleaning job every time a runner is either replaced or destroyed.

The only downside I can see with this approach is that jobs inevitably fail once the runner shuts down, although this is already happening.

The next step could be figuring out a way to stop the runner from grabbing new jobs, waiting for current jobs to finish/fail/whatever and then turn it off.

dsalaza4 avatar Aug 23 '21 17:08 dsalaza4

If it can help,

I push #359 which is the result of multiple tests.

It does not solve fully the issue but note that you have to think about removing the associated keypairs created for each Spot.

florian0410 avatar Aug 26 '21 15:08 florian0410

My apologies for the repeated messages on this thread - didn't realize every time I rebased and pushed my branch to my fork with a commit message that linked to this issue that it'd link up here. :)

I've submitted https://github.com/npalm/terraform-aws-gitlab-runner/pull/392 that uses a Lambda function to terminate 'orphaned' instances when the instance in the ASG is terminated/refreshed. I've been testing this for a bit with a couple of different runner deployments with success and welcome any feedback. Like @dsalaza4, I'd like for my GitLab runner deployments is to be ephemeral and refreshed regularly and this change, along with the auto refresh setting (https://github.com/npalm/terraform-aws-gitlab-runner/pull/385), seems to be doing that. However, the caveat still exists where a runner instance can still be terminated while it's actively running a job.

joshbeard avatar Oct 15 '21 20:10 joshbeard

This ☝️ feature (#392) was merged and included in the 4.40.0 release: https://github.com/npalm/terraform-aws-gitlab-runner/releases/tag/4.40.0

joshbeard avatar Mar 02 '22 03:03 joshbeard

@martinhanzik Should be fixed and the issue can be closed, can't it?

kayman-mk avatar Sep 01 '22 08:09 kayman-mk

I'm unable to say so myself, as I'm not using the module at this time, but I will ask my colleagues if it's working OK for them.

martinhanzik avatar Sep 01 '22 09:09 martinhanzik

Seems to work fine for them, thanks for the new feature! Closing.

martinhanzik avatar Sep 01 '22 11:09 martinhanzik