runner-images
runner-images copied to clipboard
High rate of "lost runner" errors for web-platform-tests on macOS 13
Description
Since approximately May 16th, we've been experiencing a high failure rate for web-platform-tests jobs running on macOS 13. This appears to be an infrastructure issue as we get a message indicating that the agent stopped responding. This affects some, but not all jobs, and it appears to be random within set of jobs running similar workloads (chunks of the testsuite) on macOS. It doesn't appear to be a specific part of the workload (e.g. a specific testcase).
One of the first affected builds is: https://dev.azure.com/web-platform-tests/wpt/_build/results?buildId=100660. A recent one is https://dev.azure.com/web-platform-tests/wpt/_build/results?buildId=102828&view=logs&jobId=9e909769-fc48-58b9-7383-225ac465e77e
Manually rerunning the failed jobs does work (but some jobs require multiple reruns, since the problem can also happen during the rerun)
We've tried to resolve the problem in the following ways:
- Enabled automatic retries in the pipeline configuration. Either we got the configuration wrong, or these jobs are not retried.
- Making each job smaller (i.e. run fewer tests per jobs). This didn't have any impact.
- Testing on macOS-12 rather than 13. The problems started shortly after an update, but are apparently still reproducible on the older OS release (and using the latest version is important for our use case).
(cc @gsnedders who did most of the diagnosis work to date)
https://github.com/web-platform-tests/wpt/issues/40085 is the corresponding wpt repository issue
Platforms affected
- [X] Azure DevOps
- [ ] GitHub Actions - Standard Runners
- [ ] GitHub Actions - Larger Runners
Runner images affected
- [ ] Ubuntu 20.04
- [ ] Ubuntu 22.04
- [ ] macOS 11
- [ ] macOS 12
- [X] macOS 13
- [ ] Windows Server 2019
- [ ] Windows Server 2022
Image version and build link
https://dev.azure.com/web-platform-tests/wpt/_build/results?buildId=102828&view=logs&jobId=9e909769-fc48-58b9-7383-225ac465e77e
Is it regression?
Intermittent, but seems to have regressed mid-May.
Expected behavior
Run completes successfully
Actual behavior
Intermittent failure of runs, and no retries. Historically the error was always "We stopped hearing from agent ". Now there seem to be a mixture of error messages, including that one and "The hosted runner encountered an error while running your job. (Error Type: Disconnect).".
Repro steps
Failure is intermittent, but happens when we try to run a large number of tests in Safari on macOS using a WebDriver-backed testharness.
@jgraham , may I ask you to open an issue on http://support.github.com/ ?
unfortunately, this issue tracker is for images, not for other aspects of github actions
I can try to reproduce your issue since you've provided clear repro steps, but I cannot provide a fix
sorry, "support.github.com" is wrong link for ADO agents, I'll provide a link later
@ilia-shipitsin any update here? Is there a different repository where this issue should be filed?
I've escalated issue to proper team, they are investigating
@jgraham , I see that internal issue was marked as "resolved". Can you please try to enable macos-13 builds ?
@jgraham , I see that internal issue was marked as "resolved". Can you please try to enable macos-13 builds ?
Things seem much better over the last few days. Thanks!
We're still seeing occasional failures, for example:
- https://dev.azure.com/web-platform-tests/wpt/_build/results?buildId=104432&view=logs&j=42513d51-f539-5b97-080a-60c327907e2e&t=7eb19775-f19c-5a55-e3a5-2aa9e436c041
- https://dev.azure.com/web-platform-tests/wpt/_build/results?buildId=104394&view=logs&j=42513d51-f539-5b97-080a-60c327907e2e&t=7eb19775-f19c-5a55-e3a5-2aa9e436c041
That said, I think we were always seeing some level of drops even on macos-12
images, so it no longer appears to have significantly regressed.
issue was identified on hosting level. fix is to be delivered around mid-august (reason for being better right now is not very clear). I'm closing issue for now. If around mid-august number of failures will be still high enough, feel free to reopen.
thank for bringing the issue to attention.
lets keep it open for possible duplicates
Hi @ilia-shipitsin, we're facing the same issue as described, details here. As per your comment this issue should be fixed by mid-august but i checked today and still when i switch my build to use macos13 I get the timeout error. I am looking forward for the fix, when do we expect it to be out?
@antonioalwan , it is really impossible to tell whether your issue is same or not. Please provide details, and I would suggest to open a separate issue just to keep it clean.
sorry, I see @mikhailkoliada has closed separate issue already. Let him provide a feedback
issue was identified on hosting level. fix is to be delivered around mid-august (reason for being better right now is not very clear).
We're still seeing this, even on the 20230821.3
agent image.
See, e.g., https://dev.azure.com/web-platform-tests/wpt/_build/results?buildId=106811&view=results, https://dev.azure.com/web-platform-tests/wpt/_build/results?buildId=106790&view=results, and https://dev.azure.com/web-platform-tests/wpt/_build/results?buildId=106767&view=results
👋 We anticipate finishing the work to resolve this issue by October 2023, and will comment on this thread once finished.
Is there any update here? I believe we are seeing the same errors with macos-13-xlarge
runners.
The hosted runner encountered an error while running your job. (Error Type: Disconnect).
Here is an example: https://github.com/Expensify/App/actions/runs/7032789006
Hi @Steve-Glass,
We run also into https://github.com/actions/runner-images/issues/7754#issuecomment-1699344713 which degrades our testing infrastructure. Would it be possible to:
a) confirm that the errors we are running into are caused by the linked issue above? (logs: https://github.com/microsoft/playwright/issues/28187) b) let us know if there is some kind of workaround available except having self-hosted macOS runners? c) can it be ruled out, that its caused by a memory leak on our side?
Thank you so much!
We are getting this error now:
Received request to deprovision: The request was cancelled by the remote provider.
Any updates?
We have finished deploying a patch to the Mac hosted runners that addresses the cause of the disconnects.
@EricHorton amazing news!
@EricHorton perfect! Closing this issue now as it is too generic, lets move to something more case-by-case in the new tickets (if any)