terraform-aws-gitlab-runner icon indicating copy to clipboard operation
terraform-aws-gitlab-runner copied to clipboard

docker+machine not working, pipelines won't run

Open wouter-toppy opened this issue 5 months ago • 1 comments
trafficstars

Describe the bug

Upgraded from 6.5.2 to 9.2.2 (in fact, i deleted all the resources, and created all instances again serveral times) Worker is connected to gitlab, spot instances will start, but the pipeline won't run.

Pipeline: Image Instances: Image Cloudwatch: Image

To Reproduce

Steps to reproduce the behavior:

  1. Just install a fresh version of the gitlab runners with docker+machine

Expected behavior

Spotinstances will keep up, pipeline will run

Additional context

runner.tf:


module "gitlab-runner" {
  source = "cattle-ops/gitlab-runner/aws"
  version = "9.2.2"

  environment = lower(var.environment)

  vpc_id = module.vpc.vpc_id
  subnet_id = element(module.vpc.private_subnets, 0)

  runner_instance = {
    name                = var.runner_name
    collect_autoscaling_metrics = ["GroupDesiredCapacity", "GroupInServiceCapacity"]

    ssm_access          = true
    docker_machine_type = "t3.xlarge"
  }

  runner_worker = {
    type = "docker+machine"
  }

  runner_networking = {
    allow_incoming_ping_security_group_ids = [data.aws_security_group.default.id]
  }

  runner_gitlab = {
    url                                           = var.gitlab_url
    preregistered_runner_token_ssm_parameter_name = "name"
  }

  runner_cloudwatch = {
    enable = true
    retention_days = 7
  }

  runner_worker_docker_machine_autoscaling_options = [
    # working 9 to 5 :)
    {
      periods = ["* * 0-9,17-23 * * mon-fri *", "* * * * * sat,sun *"]
      idle_count = 0
      idle_time  = 3600
      timezone   = var.timezone
    }
  ]

  # runner_worker_docker_services = [
  #   {
  #     name  = "docker:dind"
  #     alias = "docker"
  #     command = ["--registry-mirror", "https://mirror.gcr.io"]
  #     entrypoint = ["dockerd-entrypoint.sh"]
  #   }
  # ]

  runner_worker_docker_machine_instance = {
    monitoring = true
  }

  runner_worker_docker_machine_instance_spot = {
    max_price = "on-demand-price"
  }
}

vpc.tf

module "vpc" {
  source  = "terraform-aws-modules/vpc/aws"
  version = ">= 5.16.0"

  name = "vpc-${var.runner_name}"
  cidr = "10.0.0.0/16"

  azs = [data.aws_availability_zones.available.names[0]]
  private_subnets = ["10.0.1.0/24"]
  public_subnets = ["10.0.101.0/24"]
  map_public_ip_on_launch = false

  enable_nat_gateway = true
  single_nat_gateway = true

  tags = {
    Environment = var.environment
  }
}

module "vpc_endpoints" {
  source  = "terraform-aws-modules/vpc/aws//modules/vpc-endpoints"
  version = ">= 5.16.0"

  vpc_id = module.vpc.vpc_id

  endpoints = {
    s3 = {
      service = "s3"
      tags = { Name = "s3-vpc-endpoint" }
    }
  }

  tags = {
    Environment = var.environment
  }
}

wouter-toppy avatar Jun 18 '25 08:06 wouter-toppy

Use an AMI from this list: https://gitlab.com/gitlab-org/ci-cd/docker-machine/-/blob/main/docs/drivers/aws.md

runner_worker_docker_machine_ami_id = "ami-00e7df8df28dfa791"

atree-support avatar Jun 18 '25 18:06 atree-support

Use an AMI from this list: https://gitlab.com/gitlab-org/ci-cd/docker-machine/-/blob/main/docs/drivers/aws.md

runner_worker_docker_machine_ami_id = "ami-00e7df8df28dfa791"

Is this fixed the problem?

KuyaGit avatar Jun 30 '25 12:06 KuyaGit

Use an AMI from this list: https://gitlab.com/gitlab-org/ci-cd/docker-machine/-/blob/main/docs/drivers/aws.md

runner_worker_docker_machine_ami_id = "ami-00e7df8df28dfa791"

Is this fixed the problem?

Not for me

aadamovich avatar Jun 30 '25 21:06 aadamovich

9.2.0 worked fine

aadamovich avatar Jun 30 '25 21:06 aadamovich

9.2.2 is working fine here. Let's check what has been changed.

kayman-mk avatar Jul 02 '25 10:07 kayman-mk

@wouter-toppy Did you resolved it?

RuslanMigory avatar Jul 10 '25 09:07 RuslanMigory

Got the same problem the 23 june.

I changed ami with an "older" ami ubuntu image (101643774237/ubuntu/images/hvm-ssd/ubuntu-focal-20.04-amd64-server-20250603) but got the same problem again today this morning after 2 weeks without an issue :(

thomas-alkaige avatar Jul 10 '25 13:07 thomas-alkaige

Ran into this issue today - we discovered that the ami version we were using was EOL and updates were no longer supported by Ubuntu. We changed the runner_worker_docker_machine_ami_filter to use a supported version 24.04 from 20.04, which resolved our issue.

module "runner" {
  # ... other configuration ...
  runner_worker_docker_machine_ami_filter = {
    name = ["ubuntu/images/hvm-ssd-gp3/ubuntu-noble-24.04-amd64-server*"]
  }
}

Key log lines from the manager:

{"driver":"amazonec2","level":"error","msg":"E: Could not open file /var/lib/apt/lists/archive.ubuntu.com_ubuntu_dists_focal-backports_main_cnf_Commands-amd64 - open (2: No such file or directory)","name":"runner-ehu4tnbm-spot-fleet-runne-1752157095-0950a742","operation":"create","time":"2025-07-10T14:18:52Z"}
{"driver":"amazonec2","level":"error","msg":"Traceback (most recent call last):","name":"runner-ehu4tnbm-spot-fleet-runne-1752157095-0950a742","operation":"create","time":"2025-07-10T14:18:52Z"}
{"driver":"amazonec2","level":"error","msg":"  File \"/usr/lib/cnf-update-db\", line 27, in <module>","name":"runner-ehu4tnbm-spot-fleet-runne-1752157095-0950a742","operation":"create","time":"2025-07-10T14:18:52Z"}
{"driver":"amazonec2","level":"error","msg":"    col.create(db)","name":"runner-ehu4tnbm-spot-fleet-runne-1752157095-0950a742","operation":"create","time":"2025-07-10T14:18:52Z"}
{"driver":"amazonec2","level":"error","msg":"  File \"/usr/lib/python3/dist-packages/CommandNotFound/db/creator.py\", line 95, in create","name":"runner-ehu4tnbm-spot-fleet-runne-1752157095-0950a742","operation":"create","time":"2025-07-10T14:18:52Z"}
{"driver":"amazonec2","level":"error","msg":"    self._fill_commands(con)","name":"runner-ehu4tnbm-spot-fleet-runne-1752157095-0950a742","operation":"create","time":"2025-07-10T14:18:52Z"}
{"driver":"amazonec2","level":"error","msg":"  File \"/usr/lib/python3/dist-packages/CommandNotFound/db/creator.py\", line 141, in _fill_commands","name":"runner-ehu4tnbm-spot-fleet-runne-1752157095-0950a742","operation":"create","time":"2025-07-10T14:18:52Z"}
{"driver":"amazonec2","level":"error","msg":"    raise subprocess.CalledProcessError(returncode=sub.returncode, ","name":"runner-ehu4tnbm-spot-fleet-runne-1752157095-0950a742","operation":"create","time":"2025-07-10T14:18:52Z"}
{"driver":"amazonec2","level":"error","msg":"subprocess.CalledProcessError: Command '/usr/lib/apt/apt-helper cat-file /var/lib/apt/lists/archive.ubuntu.com_ubuntu_dists_focal-backports_main_cnf_Commands-amd64' returned non-zero exit status 100.","name":"runner-ehu4tnbm-spot-fleet-runne-1752157095-0950a742","operation":"create","time":"2025-07-10T14:18:52Z"}
{"driver":"amazonec2","level":"error","msg":"Fetched 25.7 MB in 7s (3579 kB/s)","name":"runner-ehu4tnbm-spot-fleet-runne-1752157095-0950a742","operation":"create","time":"2025-07-10T14:18:52Z"}
{"driver":"amazonec2","level":"error","msg":"Reading package lists...","name":"runner-ehu4tnbm-spot-fleet-runne-1752157095-0950a742","operation":"create","time":"2025-07-10T14:18:52Z"}
{"driver":"amazonec2","level":"error","msg":"E: Problem executing scripts APT::Update::Post-Invoke-Success 'if /usr/bin/test -w /var/lib/command-not-found/ -a -e /usr/lib/cnf-update-db; then /usr/lib/cnf-update-db > /dev/null; fi'","name":"runner-ehu4tnbm-spot-fleet-runne-1752157095-0950a742","operation":"create","time":"2025-07-10T14:18:52Z"}
{"driver":"amazonec2","level":"error","msg":"E: Sub-process returned an error code","name":"runner-ehu4tnbm-spot-fleet-runne-1752157095-0950a742","operation":"create","time":"2025-07-10T14:18:52Z"}
{"error":"exit status 1","fields.time":36612317869,"level":"error","msg":"Machine creation failed","name":"runner-ehu4tnbm-spot-fleet-runne-1752157095-0950a742","time":"2025-07-10T14:18:52Z"}

decafdev avatar Jul 10 '25 15:07 decafdev

I have encountered the same problem yesterday and i was getting almost the same logs, and module update from 7.4 to 9.2 resolved the issue for now. Besides that, there were logs indicating that apt-get was not working properly. If this helps - I do not have any specific information about AMI in my TF code, just updated the module and it worked.

seekin4u avatar Jul 11 '25 07:07 seekin4u

Confirming I have the same issue with module version 7.8.0. Following @decafdev 's recommendation solved it for me. Thanks!

Maxence-Perrin avatar Jul 11 '25 17:07 Maxence-Perrin

i believe i found the issue. our runners (one for arm and one for amd) used names that started with runner-... the key pairs created by the module use this as a name too. the terminate agent hook uses the same prefix to identify keys https://github.com/cattle-ops/terraform-aws-gitlab-runner/blob/main/modules/terminate-agent-hook/lambda/lambda_function.py#L200 once a change triggers an update to the asg, the shutdown of the old vm triggers the hook and deletes all keys starting with runner-. without the key new worker vms do not start up.

as a fix we renamed our runners to gitlab-runner-....

for the future i suggest to add

  validation {
    condition     = !startswith(var.environment, "runner-")
    error_message = "Environment name cannot begin with 'runner-' because it breaks the naming convention for ssh key pairs within the terminate-agent-hook lambda function."
  }

to the environment variable https://github.com/cattle-ops/terraform-aws-gitlab-runner/blob/main/variables.tf#L47

ps: we use v9.2.2

damoon avatar Aug 25 '25 13:08 damoon