terraform-aws-github-runner
terraform-aws-github-runner copied to clipboard
Runners scaling up and not used by Github
We recently upgraded to the latest version of terraform-aws-github-runner and are seeing an odd behavior:
When a Github Action requests a runner, the scale-up function is triggering and causing a new instance to be created even when existing instances are already available. Github ends up assigning the job to the existing instance, but terraform still creates a new instance that doesn't get used.
We have the new enable_job_queued_check
set to True and are using spot
instances. We have the minimum_running_time_in_minutes
set to 60min to keep runners around for reuse.
Let me know what other information you need and if we are missing some new option to control the scale-up behavior.
Maybe a bug, but how long do y9u keep messages on the queue? When setting the queue invisible time to 0 messages will processed by the lambda before the job is even started by GitHub. Which can cause creation of new instances.
Where is the "queue invisible time" set? I'm not seeing anything in the Terraform config for that. Is it something set elsewhere in AWS? Definitely sounds like the potential issue.
Search in the REAMDE for delay_webhook_event
observing the same behavior with inputs below:
locals {
runners = {
linux-x64 = {
github_app_id = "<REDACTED>"
instance_types = ["t3.xlarge"]
runner_architecture = "x64"
runner_os = "linux"
runner_extra_labels = ["amzn2"]
runners_maximum_count = 10
}
linux-arm64 = {
github_app_id = "<REDACTED>"
instance_types = ["t4g.xlarge"]
runner_architecture = "arm64"
runner_os = "linux"
runner_extra_labels = ["amzn2"]
runners_maximum_count = 10
}
}
}
data "aws_availability_zones" "this" {}
module "vpc" {
source = "../../../../../../../modules/aws/vpc/"
name = "${local.namespace}-${local.stack}"
cidr_block = "192.168.0.0/16"
map_public_ip_on_launch = true
enable_ipv6 = true
assign_ipv6_address_on_creation = true
public_subnets = {
for i, availability_zone in data.aws_availability_zones.this.names :
trimprefix(availability_zone, data.aws_availability_zones.this.id) => {
availability_zone = availability_zone
cidr_block_newbits = ceil(log(length(data.aws_availability_zones.this.names), 2))
cidr_block_netnum = i
ipv6_cidr_block_newbits = 8
ipv6_cidr_block_netnum = i
}
}
}
resource "random_id" "random" {
for_each = local.runners
byte_length = 20
}
module "github_runner" {
source = "philips-labs/github-runner/aws"
version = "1.5.0"
for_each = local.runners
prefix = "${local.stack}-${each.key}"
aws_region = local.region
vpc_id = module.vpc.id
subnet_ids = [for subnet in module.vpc.public_subnets : subnet.id]
enable_organization_runners = true
instance_types = each.value.instance_types
runner_architecture = each.value.runner_architecture
runner_os = each.value.runner_os
runners_maximum_count = each.value.runners_maximum_count
delay_webhook_event = 0
enable_ephemeral_runners = true
enable_job_queued_check = true
scale_down_schedule_expression = "cron(* * * * ? *)"
github_app = {
id = each.value.github_app_id
key_base64 = base64encode(file("github-app-${each.key}.pem"))
webhook_secret = random_id.random[each.key].hex
}
runner_binaries_syncer_lambda_zip = "lambdas/runner-binaries-syncer.zip"
runners_lambda_zip = "lambdas/runners.zip"
webhook_lambda_zip = "lambdas/webhook.zip"
}
We haven't changed the delay_webhook_event
so should still be the default of 30 sec, which seems like plenty of time. Not sure I'm seeing it waiting that long but I'll dig into the logs to see if I can figure out the timing.
Hmm, well I thought we were at the default, but looks like it's set to 5 seconds. I'll try increasing that.
@mike-potter FYI, I got this combination of setting and it just works:
locals {
runners = {
linux-amzn2-x64 = {
github_app_id = "<REDACTED>"
instance_types = ["t3.xlarge"]
runner_architecture = "x64"
runner_os = "linux"
runner_extra_labels = ["amzn2"]
runners_maximum_count = 10
}
linux-amzn2-arm64 = {
github_app_id = "<REDACTED>"
instance_types = ["t4g.xlarge"]
runner_architecture = "arm64"
runner_os = "linux"
runner_extra_labels = ["amzn2"]
runners_maximum_count = 10
}
}
}
module "github_runner" {
source = "philips-labs/github-runner/aws"
version = "1.5.0"
for_each = local.runners
prefix = "${local.stack}-${each.key}"
aws_region = local.region
vpc_id = module.vpc.id
subnet_ids = [for subnet in module.vpc.public_subnets : subnet.id]
enable_organization_runners = true
instance_types = each.value.instance_types
runner_architecture = each.value.runner_architecture
runner_os = each.value.runner_os
runners_maximum_count = each.value.runners_maximum_count
delay_webhook_event = 0
enable_ephemeral_runners = true
enable_job_queued_check = false
runner_enable_workflow_job_labels_check = true
runner_enable_workflow_job_labels_check_all = true
runner_extra_labels = join(",", each.value.runner_extra_labels)
scale_down_schedule_expression = "cron(* * * * ? *)"
github_app = {
id = each.value.github_app_id
key_base64 = base64encode(file("github-app-${each.key}.pem"))
webhook_secret = random_id.random[each.key].hex
}
runner_binaries_syncer_lambda_zip = "lambdas/runner-binaries-syncer.zip"
runners_lambda_zip = "lambdas/runners.zip"
webhook_lambda_zip = "lambdas/webhook.zip"
}
This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed if no further activity occurs. Thank you for your contributions.