terraform-aws-github-runner icon indicating copy to clipboard operation
terraform-aws-github-runner copied to clipboard

Runners scaling up and not used by Github

Open mike-potter opened this issue 2 years ago • 7 comments

We recently upgraded to the latest version of terraform-aws-github-runner and are seeing an odd behavior:

When a Github Action requests a runner, the scale-up function is triggering and causing a new instance to be created even when existing instances are already available. Github ends up assigning the job to the existing instance, but terraform still creates a new instance that doesn't get used.

We have the new enable_job_queued_check set to True and are using spot instances. We have the minimum_running_time_in_minutes set to 60min to keep runners around for reuse.

Let me know what other information you need and if we are missing some new option to control the scale-up behavior.

mike-potter avatar Aug 02 '22 21:08 mike-potter

Maybe a bug, but how long do y9u keep messages on the queue? When setting the queue invisible time to 0 messages will processed by the lambda before the job is even started by GitHub. Which can cause creation of new instances.

npalm avatar Aug 03 '22 20:08 npalm

Where is the "queue invisible time" set? I'm not seeing anything in the Terraform config for that. Is it something set elsewhere in AWS? Definitely sounds like the potential issue.

mike-potter avatar Aug 03 '22 22:08 mike-potter

Search in the REAMDE for delay_webhook_event

npalm avatar Aug 04 '22 06:08 npalm

observing the same behavior with inputs below:

locals {
  runners = {
    linux-x64 = {
      github_app_id         = "<REDACTED>"
      instance_types        = ["t3.xlarge"]
      runner_architecture   = "x64"
      runner_os             = "linux"
      runner_extra_labels   = ["amzn2"]
      runners_maximum_count = 10
    }
    linux-arm64 = {
      github_app_id         = "<REDACTED>"
      instance_types        = ["t4g.xlarge"]
      runner_architecture   = "arm64"
      runner_os             = "linux"
      runner_extra_labels   = ["amzn2"]
      runners_maximum_count = 10
    }
  }
}

data "aws_availability_zones" "this" {}

module "vpc" {
  source = "../../../../../../../modules/aws/vpc/"

  name                            = "${local.namespace}-${local.stack}"
  cidr_block                      = "192.168.0.0/16"
  map_public_ip_on_launch         = true
  enable_ipv6                     = true
  assign_ipv6_address_on_creation = true

  public_subnets = {
    for i, availability_zone in data.aws_availability_zones.this.names :
    trimprefix(availability_zone, data.aws_availability_zones.this.id) => {
      availability_zone       = availability_zone
      cidr_block_newbits      = ceil(log(length(data.aws_availability_zones.this.names), 2))
      cidr_block_netnum       = i
      ipv6_cidr_block_newbits = 8
      ipv6_cidr_block_netnum  = i
    }
  }
}

resource "random_id" "random" {
  for_each    = local.runners
  byte_length = 20
}


module "github_runner" {
  source   = "philips-labs/github-runner/aws"
  version  = "1.5.0"
  for_each = local.runners

  prefix = "${local.stack}-${each.key}"

  aws_region = local.region
  vpc_id     = module.vpc.id
  subnet_ids = [for subnet in module.vpc.public_subnets : subnet.id]

  enable_organization_runners = true
  instance_types              = each.value.instance_types
  runner_architecture         = each.value.runner_architecture
  runner_os                   = each.value.runner_os
  runners_maximum_count       = each.value.runners_maximum_count

  delay_webhook_event            = 0
  enable_ephemeral_runners       = true
  enable_job_queued_check        = true
  scale_down_schedule_expression = "cron(* * * * ? *)"

  github_app = {
    id             = each.value.github_app_id
    key_base64     = base64encode(file("github-app-${each.key}.pem"))
    webhook_secret = random_id.random[each.key].hex
  }

  runner_binaries_syncer_lambda_zip = "lambdas/runner-binaries-syncer.zip"
  runners_lambda_zip                = "lambdas/runners.zip"
  webhook_lambda_zip                = "lambdas/webhook.zip"
}

ohmer avatar Aug 04 '22 12:08 ohmer

We haven't changed the delay_webhook_event so should still be the default of 30 sec, which seems like plenty of time. Not sure I'm seeing it waiting that long but I'll dig into the logs to see if I can figure out the timing.

mike-potter avatar Aug 08 '22 15:08 mike-potter

Hmm, well I thought we were at the default, but looks like it's set to 5 seconds. I'll try increasing that.

mike-potter avatar Aug 08 '22 17:08 mike-potter

@mike-potter FYI, I got this combination of setting and it just works:


locals {
  runners = {
    linux-amzn2-x64 = {
      github_app_id         = "<REDACTED>"
      instance_types        = ["t3.xlarge"]
      runner_architecture   = "x64"
      runner_os             = "linux"
      runner_extra_labels   = ["amzn2"]
      runners_maximum_count = 10
    }
    linux-amzn2-arm64 = {
      github_app_id         = "<REDACTED>"
      instance_types        = ["t4g.xlarge"]
      runner_architecture   = "arm64"
      runner_os             = "linux"
      runner_extra_labels   = ["amzn2"]
      runners_maximum_count = 10
    }
  }
}

module "github_runner" {
  source   = "philips-labs/github-runner/aws"
  version  = "1.5.0"
  for_each = local.runners

  prefix = "${local.stack}-${each.key}"

  aws_region = local.region
  vpc_id     = module.vpc.id
  subnet_ids = [for subnet in module.vpc.public_subnets : subnet.id]

  enable_organization_runners = true
  instance_types              = each.value.instance_types
  runner_architecture         = each.value.runner_architecture
  runner_os                   = each.value.runner_os
  runners_maximum_count       = each.value.runners_maximum_count

  delay_webhook_event                         = 0
  enable_ephemeral_runners                    = true
  enable_job_queued_check                     = false
  runner_enable_workflow_job_labels_check     = true
  runner_enable_workflow_job_labels_check_all = true
  runner_extra_labels                         = join(",", each.value.runner_extra_labels)
  scale_down_schedule_expression              = "cron(* * * * ? *)"

  github_app = {
    id             = each.value.github_app_id
    key_base64     = base64encode(file("github-app-${each.key}.pem"))
    webhook_secret = random_id.random[each.key].hex
  }

  runner_binaries_syncer_lambda_zip = "lambdas/runner-binaries-syncer.zip"
  runners_lambda_zip                = "lambdas/runners.zip"
  webhook_lambda_zip                = "lambdas/webhook.zip"
}

ohmer avatar Aug 09 '22 10:08 ohmer

This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed if no further activity occurs. Thank you for your contributions.

github-actions[bot] avatar Sep 09 '22 02:09 github-actions[bot]