terraform-aws-github-runner Ephemeral autoscaling does not work, idle runners always at maximum

We are trying to setup autoscaling using Ephemeral runners flag, It seems that the lambda function never scales runners down and always keeps the runners at the value of runners_maximum_count that we have set.

for example, if we set runners_maximum_count to 20, then exactly 20 will be idle all the time, no matter if we have any pending jobs or not

version is v1.8.1, using prebuilt AWS AMI ubuntu-jammy-22.04-amd64-server

related configuration below

  # Uncomment to enable ephemeral runners
  delay_webhook_event      = 0
  enable_ephemeral_runners = true
  enabled_userdata         = false
  minimum_running_time_in_minutes = 10
  runners_maximum_count = 20
  scale_down_schedule_expression = "cron(* * * * ? *)"
  enable_job_queued_check = true
  idle_config = [{
    cron      = "* * 9-17 * * 1-5"
    timeZone  = "Europe/Amsterdam"
    idleCount = 3
  }]

could you please advice if this is a bug or misconfiguration from our side? let me know if more information is required

Sep 22 '22 11:09 arthur-telia

idle_config is not meant for ephemeral runners. But for non ephemeral runners. The idle config is used by the scale down runner to decide to kill a runner or not. You can use the pool for having some warm runners in combination with ephemeral.

Sep 22 '22 19:09 npalm

@npalm I've tried the pool block as well, but it always stays at the runners_maximum_count + the pool_config

  pool_runner_owner = "company"                  # Org to which the runners are added
  pool_config = [{
    size                = 5                    # size of the pool
    schedule_expression = "cron(* * * * ? *)"   # cron expression to trigger the adjustment of the pool
  }]

how do I scale the runners down when I have no pending jobs in the queue?

Sep 23 '22 08:09 arthur-telia

Creating a pool will create every time the function is triggered based on the schedule runners. It first looks up the number of idle runners, and next top up in your case to 5. The scale down lambda can shutdown you pool runners. Based on an minimal running time.

Sep 23 '22 12:09 npalm

Creating a pool will create every time the function is triggered based on the schedule runners. It first looks up the number of idle runners, and next top up in your case to 5. The scale down lambda can shutdown you pool runners. Based on an minimal running time.

Could you please rephrase your comment?

Please take a look at the config that we are using:

 # Uncomment to enable ephemeral runners
  delay_webhook_event      = 0
  enable_ephemeral_runners = true
  enabled_userdata         = false
  minimum_running_time_in_minutes = 5
  runners_maximum_count = 3
  enable_job_queued_check = true

  # Uncommet idle config to have idle runners from 9 to 5 in time zone Amsterdam
  #idle_config = [{
  #  cron      = "* * 9-17 * * 1-5"
  #  timeZone  = "Europe/Amsterdam"
  #  idleCount = 3
  #}]

  pool_runner_owner = "<redacted>"                  # Org to which the runners are added
  pool_config = [{
    size                = 3                    # size of the pool
    schedule_expression = "cron(* * * * ? *)"   # cron expression to trigger the adjustment of the pool
  }]
  scale_down_schedule_expression = "cron(*/5 * * * ? *)"

Result:

Can you clarify what cause 6 runners to be spawned if we set runners_maximum_count = 3? There are no jobs in queue, nothing. We don't see any reason why should be have 6 runners when we expect 0 (or at max 3).

Can you take a look at above provided config and clarify that this is valid?

We also understand that enable_ephemeral feature is in beta, but it seems it does not work completely in our case. We are not sure what might cause this behavior.

Sep 23 '22 13:09 erkexzcx

I've found that the pool lambda does not take into account the minimum start time. When it runs every minute, if the runner has not registered with Github in time, it'll "top up" the pool again. We've found that this happens 2 - 3 times, leading to the initial pool deployment to deploy 2x - 3x the pool size while ignoring runners_maximum_count

We were able to fix this by updating the pool lambda to use the minimum boot time when determining if the pool should be topped up, it's had limited testing, but I can submit a PR for review

I've attached a screenshot I have from the other day(Note that one of the arrows points to the wrong line, but the count and number of invocations is still notated properly) CleanShot 2022-10-03 at 13 27 17

Oct 03 '22 20:10 M1kep

M1kep Would be great if you have time to submit a PR.

I thought the issue is the pool lambda is not seeing instances that are created but not ready yet. So assuming you running the pool lambda every minute, and creating an instance takes 2 minutes you can easy end up with 2 or 3 times the size of the pool

Oct 04 '22 06:10 npalm

@M1kep @npalm faced the same issue. I had a monitor that controls the amount of EC2 spawn and never honored the runners_maximum_count. I did check the code, I was about to open the issue, but I found this one.

It will be nice if can understand/sum the running instances + the pending ones in order to understand the maximum you wanna run and keep costs under control

Oct 07 '22 12:10 piscue

Was this fixed by https://github.com/philips-labs/terraform-aws-github-runner/pull/2801 ?

Mar 04 '24 20:03 guicaulada

terraform-aws-github-runner terraform-aws-github-runner copied to clipboard

Ephemeral autoscaling does not work, idle runners always at maximum

terraform-aws-github-runner
terraform-aws-github-runner copied to clipboard