terraform-aws-gitlab-runner icon indicating copy to clipboard operation
terraform-aws-gitlab-runner copied to clipboard

Duplicate runners are being created

Open imhardikj opened this issue 3 years ago • 21 comments

current version ="4.39.0"

Whatever number I put in runners_idle_count, number of runners created are double of that. Take a look at this screenshot, runners_idle_count is 5 here but total 10 runners are being created:

Screenshot 2022-02-25 at 5 22 17 PM

And these extra machines don't have any names, as you can see there names are blank. Their tags are also empty.

Here's my code:

main.tf
locals {
  name = "${var.prefix}-${terraform.workspace}"

  common_tags = {
    Terraform = "True"
    env       = "${var.environment}-${terraform.workspace}"
    Owner     = var.contact
  }
}

data "aws_security_group" "default" {
  name   = "default"
  vpc_id = module.vpc.vpc_id
}

module "vpc" {
  source  = "terraform-aws-modules/vpc/aws"
  version = "3.12.0"

  name = "${local.name}-vpc"
  cidr = "10.0.0.0/16"

  azs             = ["eu-west-1a"]
  private_subnets = ["10.0.1.0/24"]
  public_subnets  = ["10.0.101.0/24"]

  enable_nat_gateway     = true
  single_nat_gateway     = true
  one_nat_gateway_per_az = false

  tags = local.common_tags
}

module "runner" {
  source  = "npalm/gitlab-runner/aws"
  version = "4.39.0"

  aws_region  = var.region
  environment = var.environment

  vpc_id                   = module.vpc.vpc_id
  subnet_ids_gitlab_runner = module.vpc.private_subnets
  subnet_id_runners        = element(module.vpc.private_subnets, 0)

  overrides = {
    name_sg                     = "Gitlab-runner-autoscale-sg"
    name_runner_agent_instance  = "Gitlab-Runner-Agent"
    name_docker_machine_runners = "Gitlab-docker-machine-runner"
  }

  enable_runner_ssm_access         = true
  gitlab_runner_security_group_ids = [data.aws_security_group.default.id]

  # docker_machine_download_url   = "https://gitlab-docker-machine-downloads.s3.amazonaws.com/v0.16.2-gitlab.2/docker-machine"
  docker_machine_spot_price_bid = "0.04700"
  docker_machine_instance_type  = "m5.large"

  # runners_executor = "docker"
  instance_type = "t3.micro"

  runners_name       = "not-prefix"
  runners_gitlab_url = "https://gitlab.com"

  # runner_ami_filter = {
  #   name = ["amzn2-ami-hvm-2.*-x86_64-ebs"]
  # }
  # runner_ami_owners = ["amazon"]

  #runners_limit     = 0
  runners_idle_time  = 90
  runners_idle_count = 5

  gitlab_runner_registration_config = {
    registration_token = var.registration_token
    tag_list           = "docker, autoscale"
    description        = "gitlab runner autoscale fleet"
    locked_to_project  = "false"
    run_untagged       = "true"
    maximum_timeout    = "3600"
  }

  tags = merge(
    local.common_tags,
    tomap({
      "tf-aws-gitlab-runner:example"           = "runner-default"
      "tf-aws-gitlab-runner:instancelifecycle" = "spot:yes"
    })
  )

  runners_privileged         = "true"
  runners_additional_volumes = ["/certs/client"]

  runners_volumes_tmpfs = [
    {
      volume  = "/var/opt/cache",
      options = "rw,noexec"
    }
  ]

  runners_services_volumes_tmpfs = [
    {
      volume  = "/var/lib/mysql",
      options = "rw,noexec"
    }
  ]

  cache_bucket_prefix = local.name
}

resource "null_resource" "cancel_spot_requests" {
  # Cancel active and open spot requests, terminate instances
  triggers = {
    environment = var.environment
  }

  provisioner "local-exec" {
    when    = destroy
    command = "bin/cancel-spot-instances.sh ${self.triggers.environment}"
  }
}

Thank You.

imhardikj avatar Feb 25 '22 12:02 imhardikj

I have seen these machines without a name in earlier versions of the module too. But had no time to dig into it.

kayman-mk avatar Feb 25 '22 22:02 kayman-mk

I did not have the issue. But most likely it is related to the GitLab runner agent or docker machine. In the latest release I have updated the default version for machine also to the latest patch level.

When you enable SSM you can access the agent ec2 instance and run a docker-machine ls. I have only seen the problem years ago when we running on dedicated (on-prem) vm's and running out of IP addresses.

npalm avatar Feb 27 '22 13:02 npalm

I have updated my code from 4.39.0 to 4.41.0. But now the problem is no runners are being created and neither is runner visible in avaible runners in the project CI/CD settings (project whose token I am using).

Screenshot 2022-02-28 at 3 21 27 PM

See in the image. I tried three times (destroy everything and terraform apply again) with runners_idle_count = 4 but no instances are created. I have also added enable_runner_ssm_access = true and now if I SSM into runner agent and run docker-machine ls, it shows:

sh-4.2$ docker-machine ls
sh: docker-machine: command not found

Also now after terraform apply, there's a builds folder which contains some lambda realted zip file, is being generated in the folder where terraform files are. If I change back to 4.39.0 it is working but duplicate runners are being created. Here's my code:

main.tf
locals {
  name = "${var.prefix}-${terraform.workspace}"

  common_tags = {
    Terraform = "True"
    env       = "${var.environment}-${terraform.workspace}"
    Owner     = var.contact
  }
}

data "aws_security_group" "default" {
  name   = "default"
  vpc_id = module.vpc.vpc_id
}

module "vpc" {
  source  = "terraform-aws-modules/vpc/aws"
  version = "3.12.0"

  name = "${local.name}-vpc"
  cidr = "10.97.224.0/20"

  azs             = ["eu-west-1a"]
  private_subnets = ["10.97.224.0/21"]
  public_subnets  = ["10.97.232.0/21"]

  enable_nat_gateway     = true
  single_nat_gateway     = true
  one_nat_gateway_per_az = false

  tags = local.common_tags
}

module "runner" {
  source  = "npalm/gitlab-runner/aws"
  version = "4.41.0"

  aws_region  = var.region
  environment = var.environment

  vpc_id                   = module.vpc.vpc_id
  subnet_ids_gitlab_runner = module.vpc.private_subnets
  subnet_id_runners        = element(module.vpc.private_subnets, 0)

  overrides = {
    name_sg                     = "Gitlab-runner-autoscale-sg"
    name_runner_agent_instance  = "Gitlab-Runner-Agent"
    name_docker_machine_runners = "Gitlab-docker-machine-runner"
  }

  enable_runner_ssm_access         = true
  gitlab_runner_security_group_ids = [data.aws_security_group.default.id]

  # docker_machine_download_url   = "https://gitlab-docker-machine-downloads.s3.amazonaws.com/v0.16.2-gitlab.2/docker-machine"
  docker_machine_spot_price_bid    = "0.04700"
  docker_machine_instance_type     = "m5.large"
  enable_docker_machine_ssm_access = true

  # runners_executor = "docker"
  instance_type = "t3.micro"

  runners_name       = var.runner_name
  runners_gitlab_url = "https://gitlab.com"

  # runner_ami_filter = {
  #   name = ["amzn2-ami-hvm-2.*-x86_64-ebs"]
  # }
  # runner_ami_owners = ["amazon"]

  #runners_limit     = 0
  runners_idle_time  = 2700
  runners_idle_count = 4

  gitlab_runner_registration_config = {
    registration_token = "${var.registration_token}"
    tag_list           = "docker, autoscale"
    description        = "gitlab runner autoscale fleet"
    locked_to_project  = "false"
    run_untagged       = "true"
    maximum_timeout    = "3600"
  }

  tags = merge(
    local.common_tags,
    tomap({
      "tf-aws-gitlab-runner:example"           = "runner-default"
      "tf-aws-gitlab-runner:instancelifecycle" = "spot:yes"
    })
  )

  runners_privileged         = "true"
  runners_additional_volumes = ["/certs/client"]

  runners_volumes_tmpfs = [
    {
      volume  = "/var/opt/cache",
      options = "rw,noexec"
    }
  ]

  runners_services_volumes_tmpfs = [
    {
      volume  = "/var/lib/mysql",
      options = "rw,noexec"
    }
  ]

  cache_bucket_prefix = local.name
}

resource "null_resource" "cancel_spot_requests" {
  # Cancel active and open spot requests, terminate instances
  triggers = {
    environment = var.environment
  }

  provisioner "local-exec" {
    when    = destroy
    command = "bin/cancel-spot-instances.sh ${self.triggers.environment}"
  }
}

Thank you.

imhardikj avatar Feb 28 '22 10:02 imhardikj

@imhardikj I had the same issue, For me, it works after upgrading to version 4.41.1, and deleting tribe-gitlab-runners-runner-token from the AWS system manager.

ermiaqasemi avatar Mar 23 '22 00:03 ermiaqasemi

Hey, everyone! I'm facing the same issue, was anyone able to solve it? Thank you very much!

erickfaustino avatar Apr 30 '22 22:04 erickfaustino

I have seen these no name machines too and haven't found the reason. All tags are missing.

By the way: Idle = 5 means, if 3 machines are processing jobs there will be 5 idle instances in addition summing up to 8 in total. So in your case it could be that 5 executors are processing jobs. As there are no idle instances now the Runner creates 5 as requested by your configuration.

kayman-mk avatar May 01 '22 08:05 kayman-mk

Running on version 4.41.1.

I've always seen these machines without tags, but now that we've switched to a scheduled auto scaling, even on the weekend these machines (spot instances) stayed around. Idle count during weekend should be 0 and no jobs were being processed during that time.

{
      idle_count = 0
      idle_time  = 60
      periods    = ["* * * * * sat,sun *"]
      timezone   = "Europe/Berlin"
}

Even removing the schedule with

  runners_idle_count = 1
  runners_idle_time = 600

I still have one machine without tags running.

AlexEndris avatar May 09 '22 07:05 AlexEndris

If the tags are missing the instances run forever I suppose. Even our Lambda function does not kill them as the tags are needed to identify them.

We should really tackel this problem.

kayman-mk avatar May 09 '22 15:05 kayman-mk

Yes, that's what we're seeing as well. I tried to kill those machines manually (both through terminating them or cancelling the spot request), but that resulted in some strange behaviour that sometimes I had to even kill the agent or otherwise none of the machines would get a new job.

AlexEndris avatar May 09 '22 16:05 AlexEndris

Killing the no name instances always works for me.

May be we can track it down to a specific version of the module?

kayman-mk avatar May 09 '22 16:05 kayman-mk

Killing the no name instances always works for me.

Maybe it's something different I'm experiencing when killing them.

May be we can track it down to a specific version of the module?

Unfortunately, I can't help there, currently. I just know I've seen those instances a while, but I just know the version we're using now (4.41.1) and what OP said 4.39.0.

But since I'm working on something regarding the runners tomorrow, if I remember I could try some older versions as well, if you can perhaps give me a rough idea which version I could start with.

AlexEndris avatar May 09 '22 16:05 AlexEndris

Today everything was fine. I have just installed the versions 4.31.0, 4.35.0 and 4.41.1. I guess it is a version in between.

kayman-mk avatar May 09 '22 18:05 kayman-mk

I just tried all 3 versions and in every version I get this issue.

AlexEndris avatar May 10 '22 10:05 AlexEndris

All 3 versions are running for a week now in parallel. No problems so far. The runners are configured to run in all 3 AZ (eu-central-1) with up to 13 machines per AZ.

I had the problems before running 4.41.1 in 3 AZ with up to 20 instances per AZ.

kayman-mk avatar May 16 '22 07:05 kayman-mk

With autoscaling with different schedules, that should remove all machines during the week end or at night, they accumulated. Just using the global idle_count variable, it's 1 instance that doesn't go away. But I couldn't see any difference between the versions. Sometimes I had the tagless machine immediately sometimes a bit later. Seems kind of random.

AlexEndris avatar May 16 '22 07:05 AlexEndris

Still not fixed. Just killed 20 machines. My AWS console was looking like the one of the thread creator.

I checked the logs on the Runners and the Executor but I didn't found anything suspicious.

kayman-mk avatar Jun 23 '22 09:06 kayman-mk

I believe this isn't a problem with this Terraform Module but rather with docker-machine. Docker-machine has officially been discontinued and Gitlab created a fork that they themselves maintain. I feel like opening an issue there helps more... We're currently not experiencing it, but we don't have many machines running.

AlexEndris avatar Jun 23 '22 14:06 AlexEndris

Yeah, I think so too. This issue should be opened with the GitLab Docker Machine project.

Looks like the chances are good that they will deal with it. According to the project page:

How does this change help reduce cost of usage?

https://gitlab.com/gitlab-org/ci-cd/docker-machine

kayman-mk avatar Jul 02 '22 20:07 kayman-mk

@kayman-mk do you know if this issue was ever opened with the GitLab Docker Machine project?

@npalm when I run docker-machine ls here's what I get:

NAME                                              ACTIVE   DRIVER      STATE   URL   SWARM   DOCKER    ERRORS
runner-replac-cicd-review-1665510525-681c176c     -        amazonec2   Error                 Unknown   MissingParameter: The request must contain the parameter InstanceId
                                                  status code: 400, request id: 62087932-85ec-4775-930e-b8db6af9504a
runner-xue8tdx1-cicd-review-1665510528-7e206979   -    amazonec2   Running   tcp://192.168.92.155:2376        v20.10.18

Like @AlexEndris , killing these machines manually results in very weird behavior - it pretty much borks everything and i have to destroy and re-apply.

mbuotidem avatar Oct 11 '22 18:10 mbuotidem

No clue, have to check but time is currently very limited. Are those machines also created after you created a fresh deployment?

npalm avatar Oct 11 '22 18:10 npalm

Yeah, I also see these *replac* machines for some weeks now. It looks like that they appear when the Agent is started. As this does not happen very often, I kill the machines manually.

I think it has something to do with the initialization of the docker machine on the agent. replac is a part of the token which is stored in config.toml (__REPLACED_BY_USER_DATA__) before the initialization script inserts the correct token from SSM. May be it would be helpful to

  1. replace the token and then
  2. initialize the docker machine

At the moment it is just the other way around.

kayman-mk avatar Oct 13 '22 08:10 kayman-mk

If the tags are missing the instances run forever I suppose. Even our Lambda function does not kill them as the tags are needed to identify them.

We should really tackel this problem.

@npalm Yes they are still getting created so I ended up creating a lambda on a schedule to kill these using the KeyName which fortunately is consistent as runner-replac-cicd*.

@kayman-mk Is it possible that your idea to switch the initialization order could break my workaround? If you choose to go down this route, maybe you could tag the instances with one identifier pre-initialization and another after a successful initialization so that a lambda like mine would still have something to use to determine which instances to kill.

mbuotidem avatar Nov 14 '22 15:11 mbuotidem

Fetching the GitLab token too late in the initialization process creates dangling docker machines as described here. It happens if the idle_count is greater 0 only.

@mbuotidem #574 fixes the problem, so no specials Lambdas, .... needed to kill the machines. They are simply no longer created.

kayman-mk avatar Nov 17 '22 08:11 kayman-mk

:tada: This issue has been resolved in version 5.5.0 :tada:

The release is available on GitHub release

Your semantic-release bot :package::rocket:

semantic-releaser[bot] avatar Nov 27 '22 12:11 semantic-releaser[bot]