Make self-hosted GitHub runners ephemeral (and gracefully deregister)
Our self-hosted GitHub runners autoscale in a Managed Instance Group, but are currently not ephemeral. This means that state from one build can influence a subsequent build. We take some steps to mitigate this (deleting the workdir), but this is incomplete. It also doesn't help for security purposes, which means we've still got GitHub actions set to require approval for non-collaborators.
To make the runners ephemeral, we need to make the VM shut down when the runner exits. If the ephemeral runner performs a job, it will deregister itself, but if shutdown another way (e.g. via MIG scale-in) it won't (https://github.com/actions/runner/issues/1364). We can attach deregistration to a shutdown hook (calling the token proxy to fetch a deregistration token). Right now, I'm just periodically deregistering offline runners via the API so we keep the UI clean (GitHub only does it after 30 days).
Here's a gist I stole from a coworker that contains approximately the setup we want. It needs to be integrated into our runners and the current setup.sh script: https://gist.github.com/GMNGeoffrey/26d1546ccb5563d715fcaa47e826f36d
(out of curiosity, what's the latency of a shutdown/startup? would it be worth having a pool that doesn't suicide after every build for collaborators vs ones for public?)
(out of curiosity, what's the latency of a shutdown/startup? would it be worth having a pool that doesn't suicide after every build for collaborators vs ones for public?)
Probably, but we'd need to convince security. GitHub doesn't have very good tools for isolation between runners, so we're rolling our own. I'd rather start with them uniformly ephemeral and then add back a pool of persistent runners later. I think it would also be better to use shared caches rather than persistent runners. My thought was that we could have postsubmit runners that have read/write access to the caches and presubmit could be read-only.
Another issue is that MIG autoscaling doesn't play very nicely with the way actions are triggered. It's been occasionally killing runners while they're executing a job. So far, I've only seen it during less cpu-intensive job startup. So given that, I'm thinking we might want to move to only allowing scale-out, which means the runners have to be responsible for shutting themselves down.
Thanks for filing this with the links! I'll start looking into implementing similar systemd services to handle this.
So can we close this issue since it's marked as Done?
There's actually an issue on the GPU runners and they keep shutting down as soon as they start. I need to investigate what's going on.
The issue is that the runner on these machines is out of date. They're running 2.293.0 and it's been more than 30 days since 2.294.0 was released. That means that GitHub won't let them connect:
Aug 15 22:09:24 github-runner-presubmit-gpu-us-central1-j2ss start.sh[1511]: An error occurred: Runner version v2.293.0 is deprecated and cannot receive messages. Aug 15 22:09:24 github-runner-presubmit-gpu-us-central1-j2ss systemd[1]: github-actions-runner-start.service: Main process exited, code=exited, status=2/INVALIDARGUMENT
So that's annoying... So the runner service fails, triggering shutdown, which causes the MIG to replace the instance and we bootloop. Even if we remove it from the MIG, the service for the runner will run on startup, fail, and cause shutdown. I was only able to get in to see the logs by removing an instance from the instance group and editing the startup script to disable the GitHub actions runner service. We're going to need to find a better way to monitor the versions. I'm subscribed to release notifications on the GitHub actions runner repo, but I missed the email. Having configurations for our images checked in rather than being just manually configured would help with this...
For now, I'll just update and create a new image