terraform-aws-gitlab-runner icon indicating copy to clipboard operation
terraform-aws-gitlab-runner copied to clipboard

Runner fleeting implementation

Open OliPou opened this issue 1 year ago • 4 comments
trafficstars

Hi team,

Describe the bug

I'm trying to implement Runner fleeting from the exemple https://github.com/cattle-ops/terraform-aws-gitlab-runner/tree/main/examples/runner-fleeting-plugin. But after the implementation the gitlab runner does appear in Never contacted

To Reproduce

So I register a ssm Parameter Store where I stored my runner authentication token (called gitlab-runner-token)

Then I copy paste all file from https://github.com/cattle-ops/terraform-aws-gitlab-runner/tree/main/examples/runner-fleeting-plugin and juste add a default value for :

variable "preregistered_runner_token_ssm_parameter_name" {
  description = "The name of the SSM parameter to read the preregistered GitLab Runner token from."
  type        = string
  default     = "gitlab-runner-token"
}

I must have missed a step, but I don't understand which one. I don't see anything in the cloud-init log. It looks like nothing has been initialized.

After the initialization I also try to add the run manually it works. But I still have weird logs in my gitlab-runner service : gitlab-runner.service - GitLab Runner Loaded: loaded (/etc/systemd/system/gitlab-runner.service; enabled; preset: disabled) Drop-In: /etc/systemd/system/gitlab-runner.service.d └─kill.conf Active: active (running) since Mon 2024-09-16 18:34:50 UTC; 1h 18min ago Main PID: 25762 (gitlab-runner) Tasks: 17 (limit: 1059) Memory: 60.9M CPU: 7.855s CGroup: /system.slice/gitlab-runner.service ├─25762 /usr/bin/gitlab-runner run --working-directory /home/gitlab-runner --config /etc/gitlab-runner/config.toml --service gitlab-runner --user gitlab-runner └─25778 fleeting-plugin-aws

Sep 16 19:53:21 ip-10-0-1-12.eu-west-3.compute.internal gitlab-runner[25762]: 2024-09-16T19:53:21.991Z [INFO] increasing instances: amount=3 group=aws/eu-west-3/runners-default-asg Sep 16 19:53:22 ip-10-0-1-12.eu-west-3.compute.internal gitlab-runner[25762]: 2024-09-16T19:53:22.195Z [ERROR] increase instances: group=aws/eu-west-3/runners-default-asg num_requested=3 num_successful=0 err="rpc error: code = Unknown desc = increase instances: operation error Aut> Sep 16 19:53:27 ip-10-0-1-12.eu-west-3.compute.internal gitlab-runner[25762]: 2024-09-16T19:53:27.062Z [INFO] increasing instances: amount=3 group=aws/eu-west-3/runners-default-asg Sep 16 19:53:27 ip-10-0-1-12.eu-west-3.compute.internal gitlab-runner[25762]: 2024-09-16T19:53:27.265Z [ERROR] increase instances: group=aws/eu-west-3/runners-default-asg num_requested=3 num_successful=0 err="rpc error: code = Unknown desc = increase instances: operation error Aut> Sep 16 19:53:32 ip-10-0-1-12.eu-west-3.compute.internal gitlab-runner[25762]: 2024-09-16T19:53:32.088Z [INFO] increasing instances: amount=3 group=aws/eu-west-3/runners-default-asg Sep 16 19:53:32 ip-10-0-1-12.eu-west-3.compute.internal gitlab-runner[25762]: 2024-09-16T19:53:32.209Z [ERROR] increase instances: group=aws/eu-west-3/runners-default-asg num_requested=3 num_successful=0 err="rpc error: code = Unknown desc = increase instances: operation error Aut> Sep 16 19:53:37 ip-10-0-1-12.eu-west-3.compute.internal gitlab-runner[25762]: 2024-09-16T19:53:37.038Z [INFO] increasing instances: amount=3 group=aws/eu-west-3/runners-default-asg Sep 16 19:53:37 ip-10-0-1-12.eu-west-3.compute.internal gitlab-runner[25762]: 2024-09-16T19:53:37.240Z [ERROR] increase instances: group=aws/eu-west-3/runners-default-asg num_requested=3 num_successful=0 err="rpc error: code = Unknown desc = increase instances: operation error Aut> Sep 16 19:53:42 ip-10-0-1-12.eu-west-3.compute.internal gitlab-runner[25762]: 2024-09-16T19:53:42.062Z [INFO] increasing instances: amount=3 group=aws/eu-west-3/runners-default-asg Sep 16 19:53:42 ip-10-0-1-12.eu-west-3.compute.internal gitlab-runner[25762]: 2024-09-16T19:53:42.246Z [ERROR] increase instances: group=aws/eu-west-3/runners-default-asg num_requested=3 num_successful=0 err="rpc error: code = Unknown desc = increase instances: operation error Aut>

OliPou avatar Sep 16 '24 18:09 OliPou

Hi, last week I also try to set up fleet runner, but also stuck with the following error message:

Sep 19 15:35:36 ip-10-0-101-156.eu-central-1.compute.internal gitlab-runner[36430]: {"amount":1,"group":"aws/eu-central-1/d7-de-fleet-manager-asg","level":"info","msg":"increasing instances","runner":"nu_w_Cwzy","subsystem":"taskscaler","time":"2024-09-19T15:35:36Z"}
Sep 19 15:35:36 ip-10-0-101-156.eu-central-1.compute.internal gitlab-runner[36430]: {"group":"aws/eu-central-1/d7-de-fleet-manager-asg","level":"info","msg":"increasing instances response","num_requested":1,"num_successful":0,"runner":"nu_w_Cwzy","subsystem":"taskscaler","time":"2024-09-19T15:35:36Z"}
Sep 19 15:35:36 ip-10-0-101-156.eu-central-1.compute.internal gitlab-runner[36430]: {"err":"rpc error: code = Unknown desc = increase instances: operation error Auto Scaling: SetDesiredCapacity, https response error StatusCode: 400, RequestID: 4f65874f-2ca1-4d17-abe4-0bc0d2d22e30, api error ValidationError: New SetDesiredCapacity value 1 is above max value 0 for the AutoScalingGroup.","group":"aws/eu-central-1/d7-de-fleet-manager-asg","level":"error","msg":"increasing instances failure","num_requested":1,"num_successful":0,"runner":"nu_w_Cwzy","subsystem":"taskscaler","time":"2024-09-19T15:35:36Z"}

Here is my terraform configuration:

data "aws_availability_zones" "available" {
  state = "available"
}

data "aws_security_group" "default" {
  name   = "default"
  vpc_id = module.vpc.vpc_id
}

# VPC Flow logs are not needed here
# kics-scan ignore-line
module "vpc" {
  source  = "terraform-aws-modules/vpc/aws"
  version = "5.13.0"

  name = "vpc-${var.environment}"
  cidr = "10.0.0.0/16"

  azs = [data.aws_availability_zones.available.names[0]]
  private_subnets = ["10.0.1.0/24"]
  public_subnets = ["10.0.101.0/24"]
  map_public_ip_on_launch = true

  tags = {
    Environment = var.environment
  }
}

module "vpc_endpoints" {
  source  = "terraform-aws-modules/vpc/aws//modules/vpc-endpoints"
  version = "5.13.0"

  vpc_id = module.vpc.vpc_id

  endpoints = {
    s3 = {
      service = "s3"
      tags = { Name = "s3-vpc-endpoint" }
    }
  }

  tags = {
    Environment = var.environment
  }
}

module "runner" {
  source = "cattle-ops/gitlab-runner/aws"

  environment = var.environment

  vpc_id = module.vpc.vpc_id
  subnet_id = element(module.vpc.public_subnets, 0)

  runner_cloudwatch = {
    enable = false
  }

  runner_instance = {
    collect_autoscaling_metrics = ["GroupDesiredCapacity", "GroupInServiceCapacity"]
    name                 = var.runner_name
    type                 = "t3.small"
    ssm_access           = true
    monitoring           = true
    private_address_only = false
  }

  runner_networking = {
    allow_incoming_ping_security_group_ids = [data.aws_security_group.default.id]
  }

  runner_gitlab = {
    url = var.gitlab_url

    preregistered_runner_token_ssm_parameter_name = var.preregistered_runner_token_ssm_parameter_name
  }

  runner_worker = {
    type       = "docker-autoscaler"
    ssm_access = true
  }

  runner_worker_docker_autoscaler = {
    fleeting_plugin_version = "1.0.0"
  }

  runner_worker_docker_autoscaler_ami_owners = ["591542846629"]
  runner_worker_docker_autoscaler_ami_filter = {
    name = ["al2023-ami-ecs-hvm-2023.0.20240905-kernel-6.1-x86_64"]
  }

  runner_worker_docker_machine_instance = {
    monitoring           = true
    private_address_only = false
    subnet_ids           = module.vpc.public_subnets
  }

  runner_worker_docker_autoscaler_instance = {
    root_size            = 16
    monitoring           = true
    private_address_only = false
  }

  runner_worker_docker_autoscaler_asg = {
    subnet_ids                               = module.vpc.public_subnets
    types = ["m5.large", "m5.xlarge"]
    enable_mixed_instances_policy            = true
    on_demand_base_capacity                  = 1
    on_demand_percentage_above_base_capacity = 0
    max_growth_rate                          = 6
  }

  runner_worker_docker_autoscaler_autoscaling_options = [
    {
      periods = ["* * * * *"]
      timezone     = var.timezone
      idle_count   = 0
      idle_time    = "0s"
      scale_factor = 0
    }, {
      periods = ["* 8-17 * * mon-fri"]
      timezone     = var.timezone
      idle_count   = 0
      idle_time    = "1m"
      scale_factor = 0
    }
  ]

  runner_worker_docker_options = {
    privileged = true,
    image      = "docker:24.0.6",
    volumes = ["/cache", "/certs/client", "/var/run/docker.sock:/var/run/docker.sock"]
  }

  tags = {
    "tf-aws-gitlab-runner:example"           = "runner-default"
    "tf-aws-gitlab-runner:instancelifecycle" = "spot:yes"
  }
}

Wohlie avatar Sep 23 '24 06:09 Wohlie

I had the same issue a few weeks ago. I discovered that AWS EC2 Instance Connect wasn't installed in the Amazon Linux 2023 ECS Amazon Machine Image.

The fleeting implementation uses EC2 Instance Connect to make a temporary SSH public key available in the EC2 metadata service, which SSH should check against. Unfortunately, it doesn't work without EC2 Instance Connect installed and properly configured in the SSH daemon config.

I managed to fix it with a custom start script to install EC2 Instance Connect.

  runner_worker_docker_autoscaler_instance = {
    start_script = <<EOF
#cloud-config
repo_update: true
packages:
- ec2-instance-connect
EOF
  }

I hope this helps, Daniel

Dan1el42 avatar Sep 30 '24 13:09 Dan1el42

I usually recommend to use the pre-defines AMIs from variables.tf. Just to make sure that everything is working. Afterwards change to your specific AMI.

kayman-mk avatar Oct 21 '24 12:10 kayman-mk

Anyone has been able to solve this yet?

EDIT: i was able to solve it by updating the Maximum capacity in the auto scaling group for the runners manually

nestorFigliuolo avatar Oct 22 '24 15:10 nestorFigliuolo

I'm encountering the same issue. For me also setting Maximum capacity in the autoscaling group manually resolved the issue. However, now I see 3-4 nodes just idling around for no apparent reason. Is there something I need to do to get rid of those?

Ideally it should be 0 if no jobs are running.

marvin-w avatar Nov 16 '24 13:11 marvin-w

I'm having the same issue with the ASG having a max size of 0 by default.

Manually adjusting this works to get things going.

william00179 avatar Dec 10 '24 20:12 william00179

Also seeing this issue

trudesea avatar Dec 18 '24 00:12 trudesea

Can we work on the max_size=0 issue? The code says

https://github.com/cattle-ops/terraform-aws-gitlab-runner/blob/6f684d3bb9f023e79f2e6fdcf562937111705ee9/docker_autoscaler.tf#L179-L181

Any idea why it is 0 in your setup?

kayman-mk avatar Jan 16 '25 08:01 kayman-mk

It looks like the default is 0 here: https://github.com/cattle-ops/terraform-aws-gitlab-runner/blob/6f684d3bb9f023e79f2e6fdcf562937111705ee9/variables.tf#L411

I guess it happens if you are not overriding it: https://github.com/cattle-ops/terraform-aws-gitlab-runner/issues/1185#issuecomment-2367336513 (here max_jobs is not set so its 0)

marvin-w avatar Jan 16 '25 08:01 marvin-w

Oh yes, of course. Reading the docs, I feel, that 1 is the better default. 0 makes no sense here.

kayman-mk avatar Jan 16 '25 08:01 kayman-mk

This is what I had to do to resolve the issue without manual configuration after the fact:

runner_worker = { type = "docker-autoscaler" max_jobs = 100 }

This setting is in none of the examples, perhaps in addition to changing the default to something other than 0, it could be documented there.

trudesea avatar Jan 16 '25 13:01 trudesea

Many of the issues in comments will be resolved by #1221 @kayman-mk

jonmcewen avatar Feb 12 '25 08:02 jonmcewen

Oh yes, of course. Reading the docs, I feel, that 1 is the better default. 0 makes no sense here.

I think this is catching people out because the zero used to mean unlimited with docker-machine (from memory - could be wrong), and that's probably why zero was the default

jonmcewen avatar Feb 12 '25 08:02 jonmcewen

This issue is stale because it has been open 60 days with no activity. Remove stale label or comment or this will be closed in 15 days.

github-actions[bot] avatar Apr 14 '25 03:04 github-actions[bot]

This issue was closed because it has been stalled for 15 days with no activity.

github-actions[bot] avatar Apr 30 '25 03:04 github-actions[bot]