actions-runner-controller Dealing with jobs failing with "lost communication with the server" errors

I think I have not yet encountered this myself, but I believe any jobs on self-hosted GitHub runners are subject to get this error due to the race condition between the runner agent and GitHub.

This isn't specific to actions-runner-controller and I believe it's an upstream issue. But I'd still like to gather voices and knowledge around it and hopefully find a work-around.

Please see the related issues for more information.

https://github.com/actions/runner/issues/510
https://github.com/microsoft/azure-pipelines-agent/issues/2261
https://github.com/apache/airflow/issues/14337
https://github.com/actions/runner/issues/921#issuecomment-821118769

This issue is mainly to gather experiences from whoever has been affected by the error. I appreciate it if you could share your stories, workarounds, fixes, etc. around the issue so that it would ideally be fixed upstream or in actions-runner-controller.

Verifying if you're affected by this problem

Note that the error can also happen when:

The runner container got OOM-killed due to that your runner pod has insufficient resource. Set higher resource requests/limits.
The runner container got OOM-killed due to that your node has insufficient resource and your runner pod had low priority. Use a more resourceful machine as your node.

If you encounter the error even after tweaking your pod and node resources, it is likely that it's due to the race between the runner agent and GitHub.

Information

Even GitHub support seems to say that stopping the runner and using --once are the goto solutions. But I believe both are subject to this race condition issue.

Possible workarounds

Disabling ephemeral runners (#457) (i.e. removing the --once flag from run.sh) may "alleviate" this issue, but not completely.
Don't use ephemeral runners and stop runners only in the maintenance window you've defined, while telling your colleagues to not run jobs while in the maintenance window. (The downside of this approach is that you can't rolling-update runners outside of the maintenance window
Restart the whole workflow run whenever any job in it failed (Note that we can't retry individual job on GitHub Actions today)

Apr 19 '21 00:04 mumoshu

I'm still dealing with the credentials not being set when the runner "runs". But I could easily hack around the situation if I could set a liveness probe (exec) in my RunnerDeployment. There is both characteristic text in the log file as well as missing files in the /runner directory. Is that possible?

Apr 22 '21 21:04 jwalters-gpsw

@jwalters-gpsw Hey! Stil trying to understand your issue, but isn't that different from what we're talking here? Are you seeing lost communication with the server message?

Apr 22 '21 23:04 mumoshu

@jwalters-gpsw Hey! Stil trying to understand your issue, but isn't that different from what we're talking here? Are you seeing lost communication with the server message?

Yes, the underlying issue is different. But if there is a way to use a liveness probe there might be a hacky solution to both problems by restarting the containers/pods.

Apr 23 '21 13:04 jwalters-gpsw

is this issue resolved now? we are still getting this for self-hosted runners on Linux box?

Jun 20 '22 07:06 shreyasGit

@shreyasGit Hey. First of all, this can't be fixed 100% by ARC alone. For example, if you use EC2 spot instances for hosting self-hosted runners, it's unavoidable (as we can't block spot termination). ARC has addressed all the issues related to this so you'd better check your deployment first and think if it's supposed to work without the communication lost error.

Also, fundamentally this can be said as an issue in GitHub Actions itself, as it doesn't have any facility to auto-restart jobs that disappeared prematurely. Would you mind considering submitting a feature request to GitHub too?

Jun 20 '22 07:06 mumoshu

Note that the error can also happen when: The runner container got OOM-killed due to that your runner pod has insufficient resource. Set higher resource requests/limits.

Are there ways that actions-runner-controller or actions-runner could more gracefully handle the OOM-killed case? Could we somehow report OOM kills to the GitHub UI? Could we run a separate OOM killer inside the workers to kill the workflow before it exceeds memory limits?

Sep 14 '22 19:09 DerekTBrown

@DerekTBrown Hey. I think that's fundamentally impossible. Even if we were able to hook into the host's system log to know about which process got OOM killed, we have no easy way to correlate it with a process within a container in a pod and even worse, GitHub Actions doesn't provide an API to "externally" set a workflow job status.

However, I thought you could still see a job times out after 10 minutes or so and the job that was running when the runner disappeared (due to whatever reason like OOM) was eventually marked as failed(although without an explicit error message at all). Would that be enough?

Sep 15 '22 01:09 mumoshu

This issue is mainly to gather experiences from whoever has been affected by the error. I appreciate it if you could share your stories, workarounds, fixes, etc. around the issue so that it would ideally be fixed upstream or in actions-runner-controller.

Really appreciate this! :heart:

We currently see this occasionally - from our stats it's ~2.5% of builds of our main CI/CD pipeline. Our setup is that we're using actions-runner-controller within GKE, and the nodes that the runners get spun up on are Spot VMs. We do this because it's a massive cost saving - however, it inevitably means sometimes we bump into this error when a node that's hosting a runner gets taken away from us by Google.

Even outside of Spot VMs, there are all sorts of other imaginable reasons that are hard or impossible to mitigate - e.g. node upgrades, OOM kills and suchlike. The dream for us would be for jobs affected by this to automatically restart from the beginning, provisioning a fresh runner and going again.

Jul 28 '23 10:07 alyssaruth

@alyssa-glean for re-running, I'd recommend a workflow that runs on completion of the specific workflows you want (can use a cloud hosted or self-hosted runner for this) which re-triggers the job. We've been doing that at my place-of-work and it works great

As for lost communication errors, for us, this was mainly caused by custom timeout logic within our github actions workflows, which attempted to disable job control, then use the process group ID to determine which process to kill when the timeout expires. We've since changed this to using the linux native timeout command with job control and the problem has mostly resolved itself, except for some cases when github itself is having issues.

As for runner shutdown errors, this has been fully mitigated for us by the graceful termination suggestions here. This has been confirmed by both the logs on the runner (which we export to ensure we still have them after the pod dies, by building a custom image that has fluentd in it which pipes the runner logs, runner worker logs, and the actions runner daemon to stdout), and prometheus metrics, which we send to datadog.

What didn't work is setting the pod label cluster-autoscaler.kubernetes.io/safe-to-evict=false, it seemed to have no effect (at least on EKS)

As for other general termination behavior, we've noticed our longer running jobs that utilize docker in docker/dind as a sidecar do not terminate gracefully. The system logs and metrics do note that a termination signal is sent, and the runner pod successfully waits. However, the default behavior of the container runtime/kubernetes is to send the termination signal to the main process on all containers in a pod, including docker-in-docker. This kills the docker daemon and causes the tests to fail when the autoscaler decides that the pod should be terminated (which we haven't really figured out why it's happening in the first place). For us this equates to failures on ~13% of our job runs for these tests. It'd be nice for the sidecar to only get terminated if the runner itself is also exited, similar to the idea here. For now, my best idea (without adding more custom stuff to the image) would be to include dind in the runner container, rather than as a sidecar.

EDIT: The dind container DOES have that termination mechanism but the docs don't suggest how to set it properly. I peeked in the CRD and found the right way to set it (dockerEnv)

Aug 28 '23 19:08 nsheaps

actions-runner-controller actions-runner-controller copied to clipboard

Dealing with jobs failing with "lost communication with the server" errors

Verifying if you're affected by this problem

Information

Possible workarounds

actions-runner-controller
actions-runner-controller copied to clipboard