terraform-aws-gitlab-runner
terraform-aws-gitlab-runner copied to clipboard
Duplicate runners are being created
current version ="4.39.0"
Whatever number I put in runners_idle_count
, number of runners created are double of that. Take a look at this screenshot, runners_idle_count
is 5
here but total 10
runners are being created:

And these extra machines don't have any names, as you can see there names are blank. Their tags are also empty.
Here's my code:
main.tf
locals {
name = "${var.prefix}-${terraform.workspace}"
common_tags = {
Terraform = "True"
env = "${var.environment}-${terraform.workspace}"
Owner = var.contact
}
}
data "aws_security_group" "default" {
name = "default"
vpc_id = module.vpc.vpc_id
}
module "vpc" {
source = "terraform-aws-modules/vpc/aws"
version = "3.12.0"
name = "${local.name}-vpc"
cidr = "10.0.0.0/16"
azs = ["eu-west-1a"]
private_subnets = ["10.0.1.0/24"]
public_subnets = ["10.0.101.0/24"]
enable_nat_gateway = true
single_nat_gateway = true
one_nat_gateway_per_az = false
tags = local.common_tags
}
module "runner" {
source = "npalm/gitlab-runner/aws"
version = "4.39.0"
aws_region = var.region
environment = var.environment
vpc_id = module.vpc.vpc_id
subnet_ids_gitlab_runner = module.vpc.private_subnets
subnet_id_runners = element(module.vpc.private_subnets, 0)
overrides = {
name_sg = "Gitlab-runner-autoscale-sg"
name_runner_agent_instance = "Gitlab-Runner-Agent"
name_docker_machine_runners = "Gitlab-docker-machine-runner"
}
enable_runner_ssm_access = true
gitlab_runner_security_group_ids = [data.aws_security_group.default.id]
# docker_machine_download_url = "https://gitlab-docker-machine-downloads.s3.amazonaws.com/v0.16.2-gitlab.2/docker-machine"
docker_machine_spot_price_bid = "0.04700"
docker_machine_instance_type = "m5.large"
# runners_executor = "docker"
instance_type = "t3.micro"
runners_name = "not-prefix"
runners_gitlab_url = "https://gitlab.com"
# runner_ami_filter = {
# name = ["amzn2-ami-hvm-2.*-x86_64-ebs"]
# }
# runner_ami_owners = ["amazon"]
#runners_limit = 0
runners_idle_time = 90
runners_idle_count = 5
gitlab_runner_registration_config = {
registration_token = var.registration_token
tag_list = "docker, autoscale"
description = "gitlab runner autoscale fleet"
locked_to_project = "false"
run_untagged = "true"
maximum_timeout = "3600"
}
tags = merge(
local.common_tags,
tomap({
"tf-aws-gitlab-runner:example" = "runner-default"
"tf-aws-gitlab-runner:instancelifecycle" = "spot:yes"
})
)
runners_privileged = "true"
runners_additional_volumes = ["/certs/client"]
runners_volumes_tmpfs = [
{
volume = "/var/opt/cache",
options = "rw,noexec"
}
]
runners_services_volumes_tmpfs = [
{
volume = "/var/lib/mysql",
options = "rw,noexec"
}
]
cache_bucket_prefix = local.name
}
resource "null_resource" "cancel_spot_requests" {
# Cancel active and open spot requests, terminate instances
triggers = {
environment = var.environment
}
provisioner "local-exec" {
when = destroy
command = "bin/cancel-spot-instances.sh ${self.triggers.environment}"
}
}
Thank You.
I have seen these machines without a name in earlier versions of the module too. But had no time to dig into it.
I did not have the issue. But most likely it is related to the GitLab runner agent or docker machine. In the latest release I have updated the default version for machine also to the latest patch level.
When you enable SSM you can access the agent ec2 instance and run a docker-machine ls
. I have only seen the problem years ago when we running on dedicated (on-prem) vm's and running out of IP addresses.
I have updated my code from 4.39.0
to 4.41.0
. But now the problem is no runners are being created and neither is runner visible in avaible runners
in the project CI/CD settings (project whose token I am using).

See in the image. I tried three times (destroy everything and terraform apply again) with runners_idle_count = 4
but no instances are created.
I have also added enable_runner_ssm_access = true
and now if I SSM into runner agent and run docker-machine ls
, it shows:
sh-4.2$ docker-machine ls
sh: docker-machine: command not found
Also now after terraform apply, there's a builds
folder which contains some lambda
realted zip file, is being generated in the folder where terraform files are.
If I change back to 4.39.0
it is working but duplicate runners are being created.
Here's my code:
main.tf
locals {
name = "${var.prefix}-${terraform.workspace}"
common_tags = {
Terraform = "True"
env = "${var.environment}-${terraform.workspace}"
Owner = var.contact
}
}
data "aws_security_group" "default" {
name = "default"
vpc_id = module.vpc.vpc_id
}
module "vpc" {
source = "terraform-aws-modules/vpc/aws"
version = "3.12.0"
name = "${local.name}-vpc"
cidr = "10.97.224.0/20"
azs = ["eu-west-1a"]
private_subnets = ["10.97.224.0/21"]
public_subnets = ["10.97.232.0/21"]
enable_nat_gateway = true
single_nat_gateway = true
one_nat_gateway_per_az = false
tags = local.common_tags
}
module "runner" {
source = "npalm/gitlab-runner/aws"
version = "4.41.0"
aws_region = var.region
environment = var.environment
vpc_id = module.vpc.vpc_id
subnet_ids_gitlab_runner = module.vpc.private_subnets
subnet_id_runners = element(module.vpc.private_subnets, 0)
overrides = {
name_sg = "Gitlab-runner-autoscale-sg"
name_runner_agent_instance = "Gitlab-Runner-Agent"
name_docker_machine_runners = "Gitlab-docker-machine-runner"
}
enable_runner_ssm_access = true
gitlab_runner_security_group_ids = [data.aws_security_group.default.id]
# docker_machine_download_url = "https://gitlab-docker-machine-downloads.s3.amazonaws.com/v0.16.2-gitlab.2/docker-machine"
docker_machine_spot_price_bid = "0.04700"
docker_machine_instance_type = "m5.large"
enable_docker_machine_ssm_access = true
# runners_executor = "docker"
instance_type = "t3.micro"
runners_name = var.runner_name
runners_gitlab_url = "https://gitlab.com"
# runner_ami_filter = {
# name = ["amzn2-ami-hvm-2.*-x86_64-ebs"]
# }
# runner_ami_owners = ["amazon"]
#runners_limit = 0
runners_idle_time = 2700
runners_idle_count = 4
gitlab_runner_registration_config = {
registration_token = "${var.registration_token}"
tag_list = "docker, autoscale"
description = "gitlab runner autoscale fleet"
locked_to_project = "false"
run_untagged = "true"
maximum_timeout = "3600"
}
tags = merge(
local.common_tags,
tomap({
"tf-aws-gitlab-runner:example" = "runner-default"
"tf-aws-gitlab-runner:instancelifecycle" = "spot:yes"
})
)
runners_privileged = "true"
runners_additional_volumes = ["/certs/client"]
runners_volumes_tmpfs = [
{
volume = "/var/opt/cache",
options = "rw,noexec"
}
]
runners_services_volumes_tmpfs = [
{
volume = "/var/lib/mysql",
options = "rw,noexec"
}
]
cache_bucket_prefix = local.name
}
resource "null_resource" "cancel_spot_requests" {
# Cancel active and open spot requests, terminate instances
triggers = {
environment = var.environment
}
provisioner "local-exec" {
when = destroy
command = "bin/cancel-spot-instances.sh ${self.triggers.environment}"
}
}
Thank you.
@imhardikj I had the same issue, For me, it works after upgrading to version 4.41.1, and deleting tribe-gitlab-runners-runner-token
from the AWS system manager.
Hey, everyone! I'm facing the same issue, was anyone able to solve it? Thank you very much!
I have seen these no name machines too and haven't found the reason. All tags are missing.
By the way: Idle = 5 means, if 3 machines are processing jobs there will be 5 idle instances in addition summing up to 8 in total. So in your case it could be that 5 executors are processing jobs. As there are no idle instances now the Runner creates 5 as requested by your configuration.
Running on version 4.41.1.
I've always seen these machines without tags, but now that we've switched to a scheduled auto scaling, even on the weekend these machines (spot instances) stayed around. Idle count during weekend should be 0 and no jobs were being processed during that time.
{
idle_count = 0
idle_time = 60
periods = ["* * * * * sat,sun *"]
timezone = "Europe/Berlin"
}
Even removing the schedule with
runners_idle_count = 1
runners_idle_time = 600
I still have one machine without tags running.
If the tags are missing the instances run forever I suppose. Even our Lambda function does not kill them as the tags are needed to identify them.
We should really tackel this problem.
Yes, that's what we're seeing as well. I tried to kill those machines manually (both through terminating them or cancelling the spot request), but that resulted in some strange behaviour that sometimes I had to even kill the agent or otherwise none of the machines would get a new job.
Killing the no name instances always works for me.
May be we can track it down to a specific version of the module?
Killing the no name instances always works for me.
Maybe it's something different I'm experiencing when killing them.
May be we can track it down to a specific version of the module?
Unfortunately, I can't help there, currently. I just know I've seen those instances a while, but I just know the version we're using now (4.41.1) and what OP said 4.39.0.
But since I'm working on something regarding the runners tomorrow, if I remember I could try some older versions as well, if you can perhaps give me a rough idea which version I could start with.
Today everything was fine. I have just installed the versions 4.31.0, 4.35.0 and 4.41.1. I guess it is a version in between.
I just tried all 3 versions and in every version I get this issue.
All 3 versions are running for a week now in parallel. No problems so far. The runners are configured to run in all 3 AZ (eu-central-1) with up to 13 machines per AZ.
I had the problems before running 4.41.1 in 3 AZ with up to 20 instances per AZ.
With autoscaling with different schedules, that should remove all machines during the week end or at night, they accumulated. Just using the global idle_count variable, it's 1 instance that doesn't go away. But I couldn't see any difference between the versions. Sometimes I had the tagless machine immediately sometimes a bit later. Seems kind of random.
Still not fixed. Just killed 20 machines. My AWS console was looking like the one of the thread creator.
I checked the logs on the Runners and the Executor but I didn't found anything suspicious.
I believe this isn't a problem with this Terraform Module but rather with docker-machine. Docker-machine has officially been discontinued and Gitlab created a fork that they themselves maintain. I feel like opening an issue there helps more... We're currently not experiencing it, but we don't have many machines running.
Yeah, I think so too. This issue should be opened with the GitLab Docker Machine project.
Looks like the chances are good that they will deal with it. According to the project page:
How does this change help reduce cost of usage?
https://gitlab.com/gitlab-org/ci-cd/docker-machine
@kayman-mk do you know if this issue was ever opened with the GitLab Docker Machine project?
@npalm when I run docker-machine ls
here's what I get:
NAME ACTIVE DRIVER STATE URL SWARM DOCKER ERRORS
runner-replac-cicd-review-1665510525-681c176c - amazonec2 Error Unknown MissingParameter: The request must contain the parameter InstanceId
status code: 400, request id: 62087932-85ec-4775-930e-b8db6af9504a
runner-xue8tdx1-cicd-review-1665510528-7e206979 - amazonec2 Running tcp://192.168.92.155:2376 v20.10.18
Like @AlexEndris , killing these machines manually results in very weird behavior - it pretty much borks everything and i have to destroy and re-apply.
No clue, have to check but time is currently very limited. Are those machines also created after you created a fresh deployment?
Yeah, I also see these *replac*
machines for some weeks now. It looks like that they appear when the Agent is started. As this does not happen very often, I kill the machines manually.
I think it has something to do with the initialization of the docker machine on the agent. replac
is a part of the token which is stored in config.toml
(__REPLACED_BY_USER_DATA__
) before the initialization script inserts the correct token from SSM. May be it would be helpful to
- replace the token and then
- initialize the docker machine
At the moment it is just the other way around.
If the tags are missing the instances run forever I suppose. Even our Lambda function does not kill them as the tags are needed to identify them.
We should really tackel this problem.
@npalm Yes they are still getting created so I ended up creating a lambda on a schedule to kill these using the KeyName
which fortunately is consistent as runner-replac-cicd*
.
@kayman-mk Is it possible that your idea to switch the initialization order could break my workaround? If you choose to go down this route, maybe you could tag the instances with one identifier pre-initialization and another after a successful initialization so that a lambda like mine would still have something to use to determine which instances to kill.
Fetching the GitLab token too late in the initialization process creates dangling docker machines as described here. It happens if the idle_count
is greater 0 only.
@mbuotidem #574 fixes the problem, so no specials Lambdas, .... needed to kill the machines. They are simply no longer created.
:tada: This issue has been resolved in version 5.5.0 :tada:
The release is available on GitHub release
Your semantic-release bot :package::rocket: