terraform-aws-github-runner
terraform-aws-github-runner copied to clipboard
Ephemeral autoscaling does not work, idle runners always at maximum
We are trying to setup autoscaling using Ephemeral runners flag, It seems that the lambda function never scales runners down and always keeps the runners at the value of runners_maximum_count
that we have set.
for example, if we set runners_maximum_count
to 20
, then exactly 20
will be idle all the time, no matter if we have any pending jobs or not
version is v1.8.1
,
using prebuilt AWS AMI ubuntu-jammy-22.04-amd64-server
related configuration below
# Uncomment to enable ephemeral runners
delay_webhook_event = 0
enable_ephemeral_runners = true
enabled_userdata = false
minimum_running_time_in_minutes = 10
runners_maximum_count = 20
scale_down_schedule_expression = "cron(* * * * ? *)"
enable_job_queued_check = true
idle_config = [{
cron = "* * 9-17 * * 1-5"
timeZone = "Europe/Amsterdam"
idleCount = 3
}]
could you please advice if this is a bug or misconfiguration from our side? let me know if more information is required
idle_config is not meant for ephemeral runners. But for non ephemeral runners. The idle config is used by the scale down runner to decide to kill a runner or not. You can use the pool for having some warm runners in combination with ephemeral.
@npalm I've tried the pool block as well, but it always stays at the runners_maximum_count
+ the pool_config
pool_runner_owner = "company" # Org to which the runners are added
pool_config = [{
size = 5 # size of the pool
schedule_expression = "cron(* * * * ? *)" # cron expression to trigger the adjustment of the pool
}]
how do I scale the runners down when I have no pending jobs in the queue?
Creating a pool will create every time the function is triggered based on the schedule runners. It first looks up the number of idle runners, and next top up in your case to 5. The scale down lambda can shutdown you pool runners. Based on an minimal running time.
Creating a pool will create every time the function is triggered based on the schedule runners. It first looks up the number of idle runners, and next top up in your case to 5. The scale down lambda can shutdown you pool runners. Based on an minimal running time.
Could you please rephrase your comment?
Please take a look at the config that we are using:
# Uncomment to enable ephemeral runners
delay_webhook_event = 0
enable_ephemeral_runners = true
enabled_userdata = false
minimum_running_time_in_minutes = 5
runners_maximum_count = 3
enable_job_queued_check = true
# Uncommet idle config to have idle runners from 9 to 5 in time zone Amsterdam
#idle_config = [{
# cron = "* * 9-17 * * 1-5"
# timeZone = "Europe/Amsterdam"
# idleCount = 3
#}]
pool_runner_owner = "<redacted>" # Org to which the runners are added
pool_config = [{
size = 3 # size of the pool
schedule_expression = "cron(* * * * ? *)" # cron expression to trigger the adjustment of the pool
}]
scale_down_schedule_expression = "cron(*/5 * * * ? *)"
Result:
Can you clarify what cause 6 runners to be spawned if we set runners_maximum_count = 3
? There are no jobs in queue, nothing. We don't see any reason why should be have 6 runners when we expect 0 (or at max 3).
Can you take a look at above provided config and clarify that this is valid?
We also understand that enable_ephemeral
feature is in beta, but it seems it does not work completely in our case. We are not sure what might cause this behavior.
I've found that the pool lambda does not take into account the minimum start time. When it runs every minute, if the runner has not registered with Github in time, it'll "top up" the pool again. We've found that this happens 2 - 3 times, leading to the initial pool deployment to deploy 2x - 3x the pool size while ignoring runners_maximum_count
We were able to fix this by updating the pool lambda to use the minimum boot time when determining if the pool should be topped up, it's had limited testing, but I can submit a PR for review
I've attached a screenshot I have from the other day(Note that one of the arrows points to the wrong line, but the count and number of invocations is still notated properly)
M1kep Would be great if you have time to submit a PR.
I thought the issue is the pool lambda is not seeing instances that are created but not ready yet. So assuming you running the pool lambda every minute, and creating an instance takes 2 minutes you can easy end up with 2 or 3 times the size of the pool
@M1kep @npalm faced the same issue. I had a monitor that controls the amount of EC2 spawn and never honored the runners_maximum_count
. I did check the code, I was about to open the issue, but I found this one.
It will be nice if can understand/sum the running instances + the pending ones in order to understand the maximum you wanna run and keep costs under control
Was this fixed by https://github.com/philips-labs/terraform-aws-github-runner/pull/2801 ?