fix: clean up ssh agent after ourselves
Context
The ssh connection to a discovery host leaves a dangling ssh agent. If the workflow runner itself runs on a persistent runner, that process might just accumulate.
Solution
Clean up after ourselves. Unfortunately, only js actions allow to trigger a post action which is why a slight refactor needed to be done to wrap the bash action inside a js action.
Is this something you have observed in practice?
Is this something you have observed in practice?
Yes, this had reportedly happend in some scenarios. Although, now, on second thought, I doubt it since the discovery is the only persistent runner and it regularly would not set up a ssh connection to itself.
Workers are regularly not persistent so they shouldn't suffer from dangling ssh agents.
Only in the scenario that a worker is serviced by a hosted runner can those be left dangling.
@michalrus do you happen to have more context on what we had observed, in this context?
If the workflow runner itself runs on a persistent runner, that process might just accumulate.
I don’t think this is the case, since if an ssh-agent stayed running from a previous job, it will be reused
@michalrus do you happen to have more context on what we had observed, in this context?
Yes, so we’re using NixOS on the discovery host, and the standard GitHub Runner service (services.github-runners.runnerN = { … } – defined here).
We’re not setting the .ephemeral option, which causes systemd service restart (and clean-up of not only processes, but also of the work dir), because several other runners are faster without cleaning their work dir.
So the ssh-agent process that this action (used to) launch in the background stayed there between runs, and for whatever reason didn’t react to SIGTERM from systemd when I requested the restart of the systemd service, and I had to kill it manually (or wait 90 s for SIGKILL – this value could also be tweaked).
But now that I think of it, I could just set .ephemeral = true in discovery runners?