terraform-aws-gitlab-runner
terraform-aws-gitlab-runner copied to clipboard
docker+machine not working, pipelines won't run
Describe the bug
Upgraded from 6.5.2 to 9.2.2 (in fact, i deleted all the resources, and created all instances again serveral times) Worker is connected to gitlab, spot instances will start, but the pipeline won't run.
Pipeline:
Instances:
Cloudwatch:
To Reproduce
Steps to reproduce the behavior:
- Just install a fresh version of the gitlab runners with
docker+machine
Expected behavior
Spotinstances will keep up, pipeline will run
Additional context
runner.tf:
module "gitlab-runner" {
source = "cattle-ops/gitlab-runner/aws"
version = "9.2.2"
environment = lower(var.environment)
vpc_id = module.vpc.vpc_id
subnet_id = element(module.vpc.private_subnets, 0)
runner_instance = {
name = var.runner_name
collect_autoscaling_metrics = ["GroupDesiredCapacity", "GroupInServiceCapacity"]
ssm_access = true
docker_machine_type = "t3.xlarge"
}
runner_worker = {
type = "docker+machine"
}
runner_networking = {
allow_incoming_ping_security_group_ids = [data.aws_security_group.default.id]
}
runner_gitlab = {
url = var.gitlab_url
preregistered_runner_token_ssm_parameter_name = "name"
}
runner_cloudwatch = {
enable = true
retention_days = 7
}
runner_worker_docker_machine_autoscaling_options = [
# working 9 to 5 :)
{
periods = ["* * 0-9,17-23 * * mon-fri *", "* * * * * sat,sun *"]
idle_count = 0
idle_time = 3600
timezone = var.timezone
}
]
# runner_worker_docker_services = [
# {
# name = "docker:dind"
# alias = "docker"
# command = ["--registry-mirror", "https://mirror.gcr.io"]
# entrypoint = ["dockerd-entrypoint.sh"]
# }
# ]
runner_worker_docker_machine_instance = {
monitoring = true
}
runner_worker_docker_machine_instance_spot = {
max_price = "on-demand-price"
}
}
vpc.tf
module "vpc" {
source = "terraform-aws-modules/vpc/aws"
version = ">= 5.16.0"
name = "vpc-${var.runner_name}"
cidr = "10.0.0.0/16"
azs = [data.aws_availability_zones.available.names[0]]
private_subnets = ["10.0.1.0/24"]
public_subnets = ["10.0.101.0/24"]
map_public_ip_on_launch = false
enable_nat_gateway = true
single_nat_gateway = true
tags = {
Environment = var.environment
}
}
module "vpc_endpoints" {
source = "terraform-aws-modules/vpc/aws//modules/vpc-endpoints"
version = ">= 5.16.0"
vpc_id = module.vpc.vpc_id
endpoints = {
s3 = {
service = "s3"
tags = { Name = "s3-vpc-endpoint" }
}
}
tags = {
Environment = var.environment
}
}
Use an AMI from this list: https://gitlab.com/gitlab-org/ci-cd/docker-machine/-/blob/main/docs/drivers/aws.md
runner_worker_docker_machine_ami_id = "ami-00e7df8df28dfa791"
Use an AMI from this list: https://gitlab.com/gitlab-org/ci-cd/docker-machine/-/blob/main/docs/drivers/aws.md
runner_worker_docker_machine_ami_id = "ami-00e7df8df28dfa791"
Is this fixed the problem?
Use an AMI from this list: https://gitlab.com/gitlab-org/ci-cd/docker-machine/-/blob/main/docs/drivers/aws.md
runner_worker_docker_machine_ami_id = "ami-00e7df8df28dfa791"Is this fixed the problem?
Not for me
9.2.0 worked fine
9.2.2 is working fine here. Let's check what has been changed.
@wouter-toppy Did you resolved it?
Got the same problem the 23 june.
I changed ami with an "older" ami ubuntu image (101643774237/ubuntu/images/hvm-ssd/ubuntu-focal-20.04-amd64-server-20250603) but got the same problem again today this morning after 2 weeks without an issue :(
Ran into this issue today - we discovered that the ami version we were using was EOL and updates were no longer supported by Ubuntu. We changed the runner_worker_docker_machine_ami_filter to use a supported version 24.04 from 20.04, which resolved our issue.
module "runner" {
# ... other configuration ...
runner_worker_docker_machine_ami_filter = {
name = ["ubuntu/images/hvm-ssd-gp3/ubuntu-noble-24.04-amd64-server*"]
}
}
Key log lines from the manager:
{"driver":"amazonec2","level":"error","msg":"E: Could not open file /var/lib/apt/lists/archive.ubuntu.com_ubuntu_dists_focal-backports_main_cnf_Commands-amd64 - open (2: No such file or directory)","name":"runner-ehu4tnbm-spot-fleet-runne-1752157095-0950a742","operation":"create","time":"2025-07-10T14:18:52Z"}
{"driver":"amazonec2","level":"error","msg":"Traceback (most recent call last):","name":"runner-ehu4tnbm-spot-fleet-runne-1752157095-0950a742","operation":"create","time":"2025-07-10T14:18:52Z"}
{"driver":"amazonec2","level":"error","msg":" File \"/usr/lib/cnf-update-db\", line 27, in <module>","name":"runner-ehu4tnbm-spot-fleet-runne-1752157095-0950a742","operation":"create","time":"2025-07-10T14:18:52Z"}
{"driver":"amazonec2","level":"error","msg":" col.create(db)","name":"runner-ehu4tnbm-spot-fleet-runne-1752157095-0950a742","operation":"create","time":"2025-07-10T14:18:52Z"}
{"driver":"amazonec2","level":"error","msg":" File \"/usr/lib/python3/dist-packages/CommandNotFound/db/creator.py\", line 95, in create","name":"runner-ehu4tnbm-spot-fleet-runne-1752157095-0950a742","operation":"create","time":"2025-07-10T14:18:52Z"}
{"driver":"amazonec2","level":"error","msg":" self._fill_commands(con)","name":"runner-ehu4tnbm-spot-fleet-runne-1752157095-0950a742","operation":"create","time":"2025-07-10T14:18:52Z"}
{"driver":"amazonec2","level":"error","msg":" File \"/usr/lib/python3/dist-packages/CommandNotFound/db/creator.py\", line 141, in _fill_commands","name":"runner-ehu4tnbm-spot-fleet-runne-1752157095-0950a742","operation":"create","time":"2025-07-10T14:18:52Z"}
{"driver":"amazonec2","level":"error","msg":" raise subprocess.CalledProcessError(returncode=sub.returncode, ","name":"runner-ehu4tnbm-spot-fleet-runne-1752157095-0950a742","operation":"create","time":"2025-07-10T14:18:52Z"}
{"driver":"amazonec2","level":"error","msg":"subprocess.CalledProcessError: Command '/usr/lib/apt/apt-helper cat-file /var/lib/apt/lists/archive.ubuntu.com_ubuntu_dists_focal-backports_main_cnf_Commands-amd64' returned non-zero exit status 100.","name":"runner-ehu4tnbm-spot-fleet-runne-1752157095-0950a742","operation":"create","time":"2025-07-10T14:18:52Z"}
{"driver":"amazonec2","level":"error","msg":"Fetched 25.7 MB in 7s (3579 kB/s)","name":"runner-ehu4tnbm-spot-fleet-runne-1752157095-0950a742","operation":"create","time":"2025-07-10T14:18:52Z"}
{"driver":"amazonec2","level":"error","msg":"Reading package lists...","name":"runner-ehu4tnbm-spot-fleet-runne-1752157095-0950a742","operation":"create","time":"2025-07-10T14:18:52Z"}
{"driver":"amazonec2","level":"error","msg":"E: Problem executing scripts APT::Update::Post-Invoke-Success 'if /usr/bin/test -w /var/lib/command-not-found/ -a -e /usr/lib/cnf-update-db; then /usr/lib/cnf-update-db > /dev/null; fi'","name":"runner-ehu4tnbm-spot-fleet-runne-1752157095-0950a742","operation":"create","time":"2025-07-10T14:18:52Z"}
{"driver":"amazonec2","level":"error","msg":"E: Sub-process returned an error code","name":"runner-ehu4tnbm-spot-fleet-runne-1752157095-0950a742","operation":"create","time":"2025-07-10T14:18:52Z"}
{"error":"exit status 1","fields.time":36612317869,"level":"error","msg":"Machine creation failed","name":"runner-ehu4tnbm-spot-fleet-runne-1752157095-0950a742","time":"2025-07-10T14:18:52Z"}
I have encountered the same problem yesterday and i was getting almost the same logs, and module update from 7.4 to 9.2 resolved the issue for now. Besides that, there were logs indicating that apt-get was not working properly. If this helps - I do not have any specific information about AMI in my TF code, just updated the module and it worked.
Confirming I have the same issue with module version 7.8.0. Following @decafdev 's recommendation solved it for me. Thanks!
i believe i found the issue. our runners (one for arm and one for amd) used names that started with runner-... the key pairs created by the module use this as a name too. the terminate agent hook uses the same prefix to identify keys https://github.com/cattle-ops/terraform-aws-gitlab-runner/blob/main/modules/terminate-agent-hook/lambda/lambda_function.py#L200 once a change triggers an update to the asg, the shutdown of the old vm triggers the hook and deletes all keys starting with runner-. without the key new worker vms do not start up.
as a fix we renamed our runners to gitlab-runner-....
for the future i suggest to add
validation {
condition = !startswith(var.environment, "runner-")
error_message = "Environment name cannot begin with 'runner-' because it breaks the naming convention for ssh key pairs within the terminate-agent-hook lambda function."
}
to the environment variable https://github.com/cattle-ops/terraform-aws-gitlab-runner/blob/main/variables.tf#L47
ps: we use v9.2.2