runner icon indicating copy to clipboard operation
runner copied to clipboard

Runner stops reporting in the middle of jobs

Open grant-mccarriagher opened this issue 2 years ago • 14 comments

Describe the bug We're seeing an issue on our self-hosted runners where jobs will stop having logs reported in the middle of a step. When this happens, the job continues (hanging on the step it was in the middle of), and cannot be manually cancelled. They will eventually be cancelled automatically by the job's timeout-minutes (if it has one). After they're cancelled, the logs of the step that the job timed out on (and any subsequent steps on the job), will be The log was not found. It may have been deleted based on retention settings.. The runner service itself also doesn't appear to be crashing/restarting when this happens.

I have noticed some Caught exception during append web console line to websocket, let's fallback to sending via non-websocket call (total calls: 49, failed calls: 1, websocket state: Open). errors in the _diag files for the runs, but I'm unsure whether or not this is normal. You can see them in the log files I uploaded below.

To Reproduce Unfortunately I do not have good reproduction steps here. It seems to randomly happen on any step in any of the jobs that are sent to the runners on this machine.

Expected behavior Jobs either pass for fail without hanging, and with all of their logs intact.

Runner Version and Platform

Version of your runner: 2.296.2

OS of the machine running the runner?: Mac Studio (M1 Ultra), Mac OS 12.3.1

What's not working?

Please include error messages and screenshots.

Job Log Output

This is what a hung job looks like, the output in the passed steps is normal so I'm not including it here. You'll notice this one hung on the clean up step, but the step seems totally random and I've seen it hang in any of the other steps as well:

Screen Shot 2022-09-19 at 2 56 25 PM

Runner and Worker's Diagnostic Logs

The following is from the job in the screenshot above:

STDOUT/ERR from the relevant job/time period:

Current runner version: '2.296.2'
2022-09-19 05:01:48Z: Listening for Jobs
2022-09-19 17:29:56Z: Running job: Build Beta app
2022-09-19 18:34:43Z: Job Build Beta app completed with result: Canceled

_diag file: diag_file.txt

Please let me know if there's any other info we can provide.

grant-mccarriagher avatar Sep 19 '22 19:09 grant-mccarriagher

I'm facing a similar issue, but with self-hosted linux (appengine runners). Since its AppEngine, I'm even willing to believe that Appengine could be part of the problem (it could just be taking down the instance). However I think this clearly indicates there is also an issue on the central GHA server side. Its not clear why the central service does not correctly detect the loss of heartbeat from dead self-hosted runners and then retry.

Is there a milestone or rollout plan to improve job-leasing? ISTM all job-distribution systems have to cope with disappearing workers gracefully, but GHA is not.

joshgc avatar Nov 18 '22 19:11 joshgc

We saw the same errors in this unresolved issue too, unclear but could be related:

https://github.com/github/c2c-actions-support/issues/1774#issuecomment-1353600502

pje avatar Dec 16 '22 19:12 pje

Hi @grant-mccarriagher ! Thank you for flagging this! Is it possible to share the entire zipped _diag folder to dig in further?

AvaStancu avatar Jan 11 '23 16:01 AvaStancu

@AvaStancu _diag.zip

We had a hanging build that timed out on this runner at 1/16/23 2:22 EST, for reference.

grant-mccarriagher avatar Jan 16 '23 20:01 grant-mccarriagher

@grant-mccarriagher can you share an URL to your workflow run (it's fine if the repo is private as we only use the URL to confirm some basic timing info, ex: start/finish time) and the job name of the hanging build?

TingluoHuang avatar Jan 17 '23 03:01 TingluoHuang

@TingluoHuang

Workflow: https://github.com/acv-auctions/iOS/actions/runs/3932991235/jobs/6726199665

The job that hung was “Build and upload master app for UAT”. It stopped on the “Complete job” step.

grant-mccarriagher avatar Jan 17 '23 04:01 grant-mccarriagher

Our team is experiencing a recurring problem with the Runner hanging during builds. We have 5 self-hosted Macs in operation, and the issue has occurred on each of them. Roughly 5 out of every 100 builds are affected.

The MacOS version we are using is 12.3.1, and the Runner version is 2.303.0. I have attached the logs for one of the hanging Runners, but I don't see anything that could help pinpoint the cause of the hang.

If needed, I can provide more logs for the other affected builds.

Artemiy42 avatar Mar 27 '23 16:03 Artemiy42

While investigating the code, I found a problem area where the runner is hanging. The issue seems to be with the ExecuteAsync method in the ProcessInvoker.cs file. Sometimes, the runner gets stuck at the _proc.Start() call, and nothing else gets executed after it. It appears that _proc.Start() does not return control to the ExecuteAsync method. Does anyone have any idea why calling Process.Start() might cause it to hang?

Artemiy42 avatar Apr 04 '23 17:04 Artemiy42

This may not be related but I noticed that I can get self-hosted runners to hang simply just doing this:

ssh randomhost "sleep 25m"
echo 'done' 

(and surprisingly, adding heartbeat to the ssh keepalives helps this job to end at the right time)

aranair avatar Apr 06 '23 18:04 aranair

Interesting to see the trend of only affecting MAC here. I'm curious if this is OSX Gatekeeper coming into play pausing the application as we attempt to launch new processes.

@grant-mccarriagher , @Artemiy42 do you have OSX gatekeeper or other antivirus installed on these MAC's. I'm wondering if they are flagging the processes and preventing the builds.

Would you be open to an experimenting of trying to disable them on one machine to see if it resolves your issue?

thboop avatar Aug 31 '23 20:08 thboop

We do not have any third party antivirus installed on our Mac hardware.

Gatekeeper is on by default in macOS, I can disable it for now and see if that prevents any hanging builds.

grant-mccarriagher avatar Sep 01 '23 17:09 grant-mccarriagher

@thboop

Unfortunately this didn't seem to work, and we've had more hanging builds since then.

Here's some extra logs/info, from a job that was hanging at ~10:35 local time on a specific runner:

From: _diag/blocks/*:

2023-09-07T14:37:25.3034000Z ##[debug]..Evaluating success:
2023-09-07T14:37:25.3037350Z ##[debug]..=> true
2023-09-07T14:37:25.3038320Z ##[debug]..Evaluating NotEqual:
2023-09-07T14:37:25.3042780Z ##[debug]....Evaluating Index:
2023-09-07T14:37:25.3043250Z ##[debug]......Evaluating Index:
2023-09-07T14:37:25.3043680Z ##[debug]........Evaluating steps:
2023-09-07T14:37:25.3044370Z ##[debug]........=> Object
2023-09-07T14:37:25.3045630Z ##[debug]........Evaluating String:
2023-09-07T14:37:25.3046160Z ##[debug]........=> 'app_exists_step'
2023-09-07T14:37:25.3047320Z ##[debug]......=> Object
2023-09-07T14:37:25.3047750Z ##[debug]......Evaluating String:
2023-09-07T14:37:25.3048290Z ##[debug]......=> 'outcome'
2023-09-07T14:37:25.3050740Z ##[debug]....=> 'failu

Notice that the output here seems to be cut off at the end.

From stdout/err, the last output is: 2023-09-07 14:35:59Z: Running job: Release enterprise app There's no crash or startup text after that, suggesting the runner process hasn't crashed at all.

From the most recent _diag/Worker***.log, the last updated file ends in:

[2023-09-07 14:37:25Z INFO ProcessInvokerWrapper] Starting process:
[2023-09-07 14:37:25Z INFO ProcessInvokerWrapper]   File name: '/Users/jenkins/builds-runner-1/externals/node16/bin/node'
[2023-09-07 14:37:25Z INFO ProcessInvokerWrapper]   Arguments: '"/Users/jenkins/builds-runner-1/_work/_actions/actions/download-artifact/v3.0.2/dist/index.js"'
[2023-09-07 14:37:25Z INFO ProcessInvokerWrapper]   Working directory: '/Users/jenkins/builds-runner-1/_work/inspection-app/inspection-app'
[2023-09-07 14:37:25Z INFO ProcessInvokerWrapper]   Require exit code zero: 'False'
[2023-09-07 14:37:25Z INFO ProcessInvokerWrapper]   Encoding web name:  ; code page: ''
[2023-09-07 14:37:25Z INFO ProcessInvokerWrapper]   Force kill process on cancellation: 'False'
[2023-09-07 14:37:25Z INFO ProcessInvokerWrapper]   Redirected STDIN: 'False'
[2023-09-07 14:37:25Z INFO ProcessInvokerWrapper]   Persist current code page: 'False'
[2023-09-07 14:37:25Z INFO ProcessInvokerWrapper]   Keep redirected STDIN open: 'False'
[2023-09-07 14:37:25Z INFO ProcessInvokerWrapper]   High priority process: 'False'

A ps aux looking for the specific runner that is hanging gives us:

jenkins          42061   0.0  0.2 39037516 224788   ??  S     1:01AM   0:23.90 /Users/jenkins/builds-runner-1/bin/Runner.Listener run --startuptype service
jenkins          39962   0.0  0.0 408656640   2368   ??  S     1:01AM   0:00.00 /bin/bash /Users/jenkins/builds-runner-1/runsvc.sh
jenkins           9817   0.0  0.1 37961452 184220   ??  S    10:35AM   0:06.57 /Users/jenkins/builds-runner-1/bin.2.308.0/Runner.Worker spawnclient 108 113

The times on those suggest that nothing has crashed, or restarted. The machine itself restarts every day at 1am, and I assume that last process there is launched per job (or maybe step).

In the meantime, I'm going to try and completely delete our runner folders and reinstall each runner, just in case something got screwed up on an upgrade, but hopefully this info helps diagnose things further.

grant-mccarriagher avatar Sep 07 '23 15:09 grant-mccarriagher

Any update about this issue? @grant-mccarriagher did you managed to fix the issue?

ahmedihabb2 avatar Jun 24 '24 12:06 ahmedihabb2