std-action icon indicating copy to clipboard operation
std-action copied to clipboard

fix: clean up ssh agent after ourselves

Open blaggacao opened this issue 2 years ago • 3 comments

Context

The ssh connection to a discovery host leaves a dangling ssh agent. If the workflow runner itself runs on a persistent runner, that process might just accumulate.

Solution

Clean up after ourselves. Unfortunately, only js actions allow to trigger a post action which is why a slight refactor needed to be done to wrap the bash action inside a js action.

blaggacao avatar Aug 31 '23 10:08 blaggacao

Is this something you have observed in practice?

nrdxp avatar Aug 31 '23 16:08 nrdxp

Is this something you have observed in practice?

Yes, this had reportedly happend in some scenarios. Although, now, on second thought, I doubt it since the discovery is the only persistent runner and it regularly would not set up a ssh connection to itself.

Workers are regularly not persistent so they shouldn't suffer from dangling ssh agents.


Only in the scenario that a worker is serviced by a hosted runner can those be left dangling.

@michalrus do you happen to have more context on what we had observed, in this context?

blaggacao avatar Aug 31 '23 18:08 blaggacao

If the workflow runner itself runs on a persistent runner, that process might just accumulate.

I don’t think this is the case, since if an ssh-agent stayed running from a previous job, it will be reused

@michalrus do you happen to have more context on what we had observed, in this context?

Yes, so we’re using NixOS on the discovery host, and the standard GitHub Runner service (services.github-runners.runnerN = { … }defined here).

We’re not setting the .ephemeral option, which causes systemd service restart (and clean-up of not only processes, but also of the work dir), because several other runners are faster without cleaning their work dir.

So the ssh-agent process that this action (used to) launch in the background stayed there between runs, and for whatever reason didn’t react to SIGTERM from systemd when I requested the restart of the systemd service, and I had to kill it manually (or wait 90 s for SIGKILL – this value could also be tweaked).

But now that I think of it, I could just set .ephemeral = true in discovery runners?

michalrus avatar Sep 01 '23 11:09 michalrus