terraform-aws-gitlab-runner
terraform-aws-gitlab-runner copied to clipboard
Old runners not cleaned up after ASG update
If any change triggers a user_data update and an ASG recreation, the docker machine instance is not terminated, only the manager.
Configuration:
module "runner" {
source = "npalm/gitlab-runner/aws"
version = "4.14.0"
aws_region = var.default_region
environment = "runner"
vpc_id = data.aws_vpc.base-vpc.id
subnet_ids_gitlab_runner = [data.aws_subnet.base-vpc-private-eu-west-1a.id, data.aws_subnet.base-vpc-private-eu-west-1b.id]
subnet_id_runners = data.aws_subnet.base-vpc-private-eu-west-1a.id
runners_name = "runner"
runners_gitlab_url = "https://gitlab.com"
runners_request_spot_instance = false
enable_runner_ssm_access = true
runners_idle_count = 1
runners_off_peak_periods = "[\"* * 0-7,19-23 * * mon-fri *\", \"* * * * * sat,sun *\"]"
runners_off_peak_idle_count = 1
runners_off_peak_idle_time = 60
cache_shared = true
cache_bucket = {
create = false
policy = aws_iam_policy.runner-cache.arn
bucket = aws_s3_bucket.runner-cache.arn
}
overrides = {
name_sg = ""
name_runner_agent_instance = "runner-manager"
name_docker_machine_runners = "runner"
}
gitlab_runner_registration_config = {
registration_token = var.gitlab_runner_registration_token
tag_list = "docker"
description = "Runner"
locked_to_project = "false"
run_untagged = "false"
maximum_timeout = "3600"
}
ami_filter = {
name = ["amzn2-ami-hvm-2.0.20200304.0-x86_64-ebs"]
}
runner_ami_filter = {
name = ["ubuntu/images/hvm-ssd/ubuntu-eoan-19.10-amd64-server-20200325"]
}
userdata_pre_install = "sudo yum install amazon-ecr-credential-helper -y"
}
@martinhanzik I think you are right, personally I do not use the enable_forced_updates. But since this will stop the scaling group / terminate the instances in the scaling group it only terminates the agent. The docker-machine instances will not be cleared. Any suggestion for a fix?
Maybe ASG events like https://docs.aws.amazon.com/autoscaling/ec2/userguide/lifecycle-hooks.html could be used - have a process trying to read the termination notification and perform a clean shutdown of all the docker-machine instances.
@npalm updating the version of Terraform closed this issue?
The problem is that the docker machines are managed by the agent running in the ASG. Simply terminating this instance by replacing the ASG leaves thoses instances un registered and they will be never removed. Did you noticed what happen to active builds?
This means there will no events to listing for, but having an lifecyle hook + lambda could potential clean up the docker machine left overs. Still the problem will remain that the machines could be busy with processing ci workloads
I had this problem some days ago while updating the CI infrastructure. My terraform job suddenly died when the ASG was recreated. The lock on the state file was never released so I assume that the build was cancelled.
Let me detail @martinhanzik and @npalm solution
Add an aws_autoscaling_lifecycle_hook (event: autoscaling:EC2_INSTANCE_TERMINATING). The event metadata is forwarded into a SQS queue. Trigger a lambda function which finds (via tags which have to be present in the autoscaling group as well) the associated runners and kill them.
Without the runner agent the runners shouldn't be able to proceed. May be they finish the built but they are not able not upload anything to the gitlab instance, are they?
Should we do this?
Still the problem will remain that the machines could be busy with processing ci workloads
@npalm Just tried it and killed the runner agent while a java build was running. The jobs on the runner are immediately cancelled and gitlab shows the job as "runner system failure". The docker container on the runner disappeared. So all jobs on the runner are cancelled.
Thus the only problem is to remove the runner instances as described above.
Yepz the runners are not managed by terraform, the default exampls contains some scripting to cancel spot requests and instances during destroy. I do not have a good idea for a fix.
We are also facing this issue (though haven't put any time to fix it yet, everything else is working very well).
The additional piece of information is that the bundled script won't work for us because we have a multi account strategy and the box the script is running from is not in the same account as the target nodes, the script isn't taking a profile or role to assume as an argument so it seems like it is only intended for the same account.
@DavidGamba Sorry, I didn't get it 100%. Which machines are running in different accounts?
We also have a multi account strategy. So the CI system runs in a different account than the product it builds. But the complete CI system runs in one account. So it should be possible to kill the runners when the runner agent is killed.
@kayman-mk the machine we run our terraform plan/apply is in a different account from the one the runners are running from. So without a profile or assuming a role we can't kill the runners when the ASG gets rebuilt.
Only starting with scaling gitlab in aws/terraform, but wouldn't it be possible to add a version-tag to the docker-machine runners and delete the instances by tag?
Similar to:
date "aws_instance" "previous" {
filter {
tag = "asg"
values = ["asg-name"]
}
resource "aws_instance" "previous_runners" {
...
instance_state = "Terminated"
}
@strowi This is not possible as instance_state is read only. But using tags to identify those machines is a good idea. But it needs an additional script, lambda or something else.
@kayman-mk ah right, thx! I misread the aws_instance-docs about instance_state.
Hello guys,
I encountered the bug also even when using the example script https://github.com/npalm/terraform-aws-gitlab-runner/blob/develop/bin/cancel-spot-instances.sh
I saw it during destroy but it can occurs while updating ASG too , there is a small chance that before your main agent runner get removed/replaced, it spawns a Spot instance, since the cancel-spot.sh script is called before it is hard to handle. My current fix would be to add a depends_on for the asg destruction on the null ressource example https://github.com/npalm/terraform-aws-gitlab-runner/blob/8b15241b34c87d7b21c7a14fddf7d0b84f750f1b/examples/runner-default/main.tf#L106
But terraform does not support refering to a module ressource using depends_on unfortunately. In destroy case, we have some unwanted behaviors like Cloudwatch log group being created and not fully removed since an instance still exists.
@npalm I don't know if there is another way to specify your cancel-spot.sh script to trigger only after deletion of the main instance or its ASG group but it would solve many issues while waiting for a clean solution.
Regards
I am trying to implement a fully reproducible CI with N runners where each runner is recreated on a daily basis. This is currently the biggest issue I have in order to achieving complete reproducibility, as every time I recreate a runner, many workers are left orphaned.
@npalm What do you think about implementing a uuid tag for a runner and all its associated workers, so we can easily identify all workers belonging to a runner. this would allow us to execute a cleaning job every time a runner is either replaced or destroyed.
The only downside I can see with this approach is that jobs inevitably fail once the runner shuts down, although this is already happening.
The next step could be figuring out a way to stop the runner from grabbing new jobs, waiting for current jobs to finish/fail/whatever and then turn it off.
If it can help,
I push #359 which is the result of multiple tests.
It does not solve fully the issue but note that you have to think about removing the associated keypairs created for each Spot.
My apologies for the repeated messages on this thread - didn't realize every time I rebased and pushed my branch to my fork with a commit message that linked to this issue that it'd link up here. :)
I've submitted https://github.com/npalm/terraform-aws-gitlab-runner/pull/392 that uses a Lambda function to terminate 'orphaned' instances when the instance in the ASG is terminated/refreshed. I've been testing this for a bit with a couple of different runner deployments with success and welcome any feedback. Like @dsalaza4, I'd like for my GitLab runner deployments is to be ephemeral and refreshed regularly and this change, along with the auto refresh setting (https://github.com/npalm/terraform-aws-gitlab-runner/pull/385), seems to be doing that. However, the caveat still exists where a runner instance can still be terminated while it's actively running a job.
This ☝️ feature (#392) was merged and included in the 4.40.0 release: https://github.com/npalm/terraform-aws-gitlab-runner/releases/tag/4.40.0
@martinhanzik Should be fixed and the issue can be closed, can't it?
I'm unable to say so myself, as I'm not using the module at this time, but I will ask my colleagues if it's working OK for them.
Seems to work fine for them, thanks for the new feature! Closing.