runner-images icon indicating copy to clipboard operation
runner-images copied to clipboard

High rate of "lost runner" errors for web-platform-tests on macOS 13

Open jgraham opened this issue 1 year ago • 17 comments

Description

Since approximately May 16th, we've been experiencing a high failure rate for web-platform-tests jobs running on macOS 13. This appears to be an infrastructure issue as we get a message indicating that the agent stopped responding. This affects some, but not all jobs, and it appears to be random within set of jobs running similar workloads (chunks of the testsuite) on macOS. It doesn't appear to be a specific part of the workload (e.g. a specific testcase).

One of the first affected builds is: https://dev.azure.com/web-platform-tests/wpt/_build/results?buildId=100660. A recent one is https://dev.azure.com/web-platform-tests/wpt/_build/results?buildId=102828&view=logs&jobId=9e909769-fc48-58b9-7383-225ac465e77e

Manually rerunning the failed jobs does work (but some jobs require multiple reruns, since the problem can also happen during the rerun)

We've tried to resolve the problem in the following ways:

  • Enabled automatic retries in the pipeline configuration. Either we got the configuration wrong, or these jobs are not retried.
  • Making each job smaller (i.e. run fewer tests per jobs). This didn't have any impact.
  • Testing on macOS-12 rather than 13. The problems started shortly after an update, but are apparently still reproducible on the older OS release (and using the latest version is important for our use case).

(cc @gsnedders who did most of the diagnosis work to date)

https://github.com/web-platform-tests/wpt/issues/40085 is the corresponding wpt repository issue

Platforms affected

  • [X] Azure DevOps
  • [ ] GitHub Actions - Standard Runners
  • [ ] GitHub Actions - Larger Runners

Runner images affected

  • [ ] Ubuntu 20.04
  • [ ] Ubuntu 22.04
  • [ ] macOS 11
  • [ ] macOS 12
  • [X] macOS 13
  • [ ] Windows Server 2019
  • [ ] Windows Server 2022

Image version and build link

https://dev.azure.com/web-platform-tests/wpt/_build/results?buildId=102828&view=logs&jobId=9e909769-fc48-58b9-7383-225ac465e77e

Is it regression?

Intermittent, but seems to have regressed mid-May.

Expected behavior

Run completes successfully

Actual behavior

Intermittent failure of runs, and no retries. Historically the error was always "We stopped hearing from agent ". Now there seem to be a mixture of error messages, including that one and "The hosted runner encountered an error while running your job. (Error Type: Disconnect).".

Repro steps

Failure is intermittent, but happens when we try to run a large number of tests in Safari on macOS using a WebDriver-backed testharness.

jgraham avatar Jun 20 '23 14:06 jgraham

@jgraham , may I ask you to open an issue on http://support.github.com/ ?

unfortunately, this issue tracker is for images, not for other aspects of github actions

I can try to reproduce your issue since you've provided clear repro steps, but I cannot provide a fix

ilia-shipitsin avatar Jun 20 '23 16:06 ilia-shipitsin

sorry, "support.github.com" is wrong link for ADO agents, I'll provide a link later

ilia-shipitsin avatar Jun 20 '23 16:06 ilia-shipitsin

@ilia-shipitsin any update here? Is there a different repository where this issue should be filed?

jgraham avatar Jun 29 '23 19:06 jgraham

I've escalated issue to proper team, they are investigating

ilia-shipitsin avatar Jun 29 '23 19:06 ilia-shipitsin

@jgraham , I see that internal issue was marked as "resolved". Can you please try to enable macos-13 builds ?

ilia-shipitsin avatar Jul 14 '23 10:07 ilia-shipitsin

@jgraham , I see that internal issue was marked as "resolved". Can you please try to enable macos-13 builds ?

Things seem much better over the last few days. Thanks!

We're still seeing occasional failures, for example:

  • https://dev.azure.com/web-platform-tests/wpt/_build/results?buildId=104432&view=logs&j=42513d51-f539-5b97-080a-60c327907e2e&t=7eb19775-f19c-5a55-e3a5-2aa9e436c041
  • https://dev.azure.com/web-platform-tests/wpt/_build/results?buildId=104394&view=logs&j=42513d51-f539-5b97-080a-60c327907e2e&t=7eb19775-f19c-5a55-e3a5-2aa9e436c041

That said, I think we were always seeing some level of drops even on macos-12 images, so it no longer appears to have significantly regressed.

gsnedders avatar Jul 17 '23 14:07 gsnedders

issue was identified on hosting level. fix is to be delivered around mid-august (reason for being better right now is not very clear). I'm closing issue for now. If around mid-august number of failures will be still high enough, feel free to reopen.

thank for bringing the issue to attention.

ilia-shipitsin avatar Jul 18 '23 19:07 ilia-shipitsin

lets keep it open for possible duplicates

mikhailkoliada avatar Jul 22 '23 08:07 mikhailkoliada

Hi @ilia-shipitsin, we're facing the same issue as described, details here. As per your comment this issue should be fixed by mid-august but i checked today and still when i switch my build to use macos13 I get the timeout error. I am looking forward for the fix, when do we expect it to be out?

antonioalwan avatar Aug 28 '23 18:08 antonioalwan

@antonioalwan , it is really impossible to tell whether your issue is same or not. Please provide details, and I would suggest to open a separate issue just to keep it clean.

ilia-shipitsin avatar Aug 29 '23 05:08 ilia-shipitsin

sorry, I see @mikhailkoliada has closed separate issue already. Let him provide a feedback

ilia-shipitsin avatar Aug 29 '23 05:08 ilia-shipitsin

issue was identified on hosting level. fix is to be delivered around mid-august (reason for being better right now is not very clear).

We're still seeing this, even on the 20230821.3 agent image.

See, e.g., https://dev.azure.com/web-platform-tests/wpt/_build/results?buildId=106811&view=results, https://dev.azure.com/web-platform-tests/wpt/_build/results?buildId=106790&view=results, and https://dev.azure.com/web-platform-tests/wpt/_build/results?buildId=106767&view=results

gsnedders avatar Aug 30 '23 14:08 gsnedders

👋 We anticipate finishing the work to resolve this issue by October 2023, and will comment on this thread once finished.

Steve-Glass avatar Aug 30 '23 14:08 Steve-Glass

Is there any update here? I believe we are seeing the same errors with macos-13-xlarge runners.

The hosted runner encountered an error while running your job. (Error Type: Disconnect).

Here is an example: https://github.com/Expensify/App/actions/runs/7032789006

AndrewGable avatar Nov 29 '23 18:11 AndrewGable

Hi @Steve-Glass,

We run also into https://github.com/actions/runner-images/issues/7754#issuecomment-1699344713 which degrades our testing infrastructure. Would it be possible to:

a) confirm that the errors we are running into are caused by the linked issue above? (logs: https://github.com/microsoft/playwright/issues/28187) b) let us know if there is some kind of workaround available except having self-hosted macOS runners? c) can it be ruled out, that its caused by a memory leak on our side?

Thank you so much!

mxschmitt avatar Dec 06 '23 17:12 mxschmitt

We are getting this error now:

Received request to deprovision: The request was cancelled by the remote provider.

pafdad avatar Dec 21 '23 13:12 pafdad

Any updates?

linliu-code avatar Feb 12 '24 06:02 linliu-code

We have finished deploying a patch to the Mac hosted runners that addresses the cause of the disconnects.

EricHorton avatar May 20 '24 19:05 EricHorton

@EricHorton amazing news!

mxschmitt avatar May 20 '24 20:05 mxschmitt

@EricHorton perfect! Closing this issue now as it is too generic, lets move to something more case-by-case in the new tickets (if any)

mikhailkoliada avatar May 20 '24 21:05 mikhailkoliada