arcade
arcade copied to clipboard
iOS - APP_LAUNCH_TIMEOUT / WORKLOAD TIMED OUT
Starting on 9/9 Rolling run 11121 there are 2x-4x failure in each run:
- Platform: net7.0-iOS-Release-arm64-Mono_Release-OSX.1200.Amd64.Iphone.Open
| Day | Run | Details |
|---|---|---|
| 9/12 | 12571 | 2x APP_LAUNCH_TIMEOUT & 1x WORKLOAD TIMED OUT |
| 9/11 | 12413 (7.0) | 2x APP_LAUNCH_TIMEOUT & 1x WORKLOAD TIMED OUT |
| 9/11 | 12028 | 1x APP_LAUNCH_TIMEOUT & 2x WORKLOAD TIMED OUT |
| 9/11 | 11973 (7.0) | 2x APP_LAUNCH_TIMEOUT & 1x WORKLOAD TIMED OUT |
| 9/11 | 11877 | 1x APP_LAUNCH_TIMEOUT & 2x WORKLOAD TIMED OUT |
| 9/10 | 11727 | 3x APP_LAUNCH_TIMEOUT |
| 9/10 | 11508 | 2x APP_LAUNCH_TIMEOUT & 1x WORKLOAD TIMED OUT |
| 9/9 | 11121 | 4x APP_LAUNCH_TIMEOUT |
| 9/9 | 11454 (7.0) | 4x APP_LAUNCH_TIMEOUT |
| 9/9 | 10781 (7.0) | 4x APP_LAUNCH_TIMEOUT |
Example of logs
APP_LAUNCH_TIMEOUT
[03:16:43] dbug: Xamarin.Hosting: Connected to DNCENGOSX117 (00008030-000804482E23402E) in 00:00:00.0018453
[03:16:47] dbug: 95976 Execution timed out after 60 seconds and the process was killed.
[03:16:47] dbug: Process mlaunch exited with 0
[03:16:48] fail: Application launch timed out before the test execution has started
[03:16:48] info: Uninstalling the application 'net.dot.System.Runtime.Numerics.Tests' from 'DNCENGOSX117'
[03:16:48] dbug:
[03:16:48] dbug: Running /private/tmp/helix/working/AC9B09AB/p/microsoft.dotnet.xharness.cli/1.0.0-prerelease.22456.1/tools/net7.0/any/../../../runtimes/any/native/mlaunch/bin/mlaunch --sdkroot /Applications/Xcode_13.4.app --uninstalldevbundleid net.dot.System.Runtime.Numerics.Tests --devname 00008030-000804482E23402E -v -v -v -v -v
[03:16:55] dbug: TCP tunnel still has not connected
[03:17:27] dbug: TCP connection hasn't started in time (00:02:00). Stopped listening.
[03:17:27] dbug: Failed to read TCP data: System.Net.Internals.SocketExceptionFactory+ExtendedSocketException (61): Connection refused [::1]:54809
at Microsoft.DotNet.XHarness.iOS.Shared.Listeners.SimpleTcpListener.StartTcpTunnel() in /_/src/Microsoft.DotNet.XHarness.iOS.Shared/Listeners/SimpleTcpListener.cs:line 164
[03:17:27] dbug: Tests have finished executing
[03:19:49] dbug: Process 95987 didn't exit within 00:03:00 and will be killed
[03:19:49] dbug: Killing process tree of 95987...
[03:19:49] dbug: Pids to kill: 95987
[03:19:49] dbug: Running lldb diagnostics for pid 95987
[03:19:49] dbug: Printing backtrace for pid=95987
[03:19:49] dbug: (lldb) command source -s 0 '/tmp/helix/working/AC9B09AB/t/tmpvvopx2.tmp'
[03:19:49] dbug: Executing commands in '/tmp/helix/working/AC9B09AB/t/tmpvvopx2.tmp'.
[03:19:49] dbug: (lldb) process attach --pid 95987
[03:20:09] dbug: 95987 Execution timed out after 180 seconds and the process was killed.
[03:20:10] dbug: Failed to get ExitCode: System.InvalidOperationException: Process must exit before requested information can be determined.
at System.Diagnostics.Process.EnsureState(State state)
at Microsoft.DotNet.XHarness.Common.Execution.ProcessManager.RunAsyncInternal(Process process, ILog log, ILog stdout, ILog stderr, Action`2 kill, Func`3 getChildProcessIds, Nullable`1 timeout, Dictionary`2 environmentVariables, Nullable`1 cancellationToken, Nullable`1 diagnostics) in /_/src/Microsoft.DotNet.XHarness.Common/Execution/ProcessManager.cs:line 357
[03:20:10] fail: Failed to uninstall the app bundle! Check logs for more details!
[03:20:10] dbug: Saving diagnostics data to '/tmp/helix/working/AC9B09AB/w/B92F09FA/e/diagnostics.json'
XHarness exit code: 90 (APP_LAUNCH_TIMEOUT)
WORKLOAD TIMED OUT
--------------------------------------------------------------------------------------
MSBuild version 17.4.0-preview-22368-02+c8492483a for .NET
/private/tmp/helix/working/AC9B09AB/p/dotnet-cli/sdk/7.0.100-preview.7.22377.5/MSBuild.dll -maxcpucount -verbosity:m /bl:/private/tmp/helix/working/AC9B09AB/w/A6F108BA/e/xharness-output/AOTBuild.binlog /p:XHARNESS_EXECUTION_DIR=/private/tmp/helix/working/AC9B09AB/w/A6F108BA/e /p:RunAOTCompilation=true /p:TargetOS=iOS /p:TargetArchitecture=arm64 /p:MonoForceInterpreter= /p:MonoEnableLLVM=true /p:DevTeamProvisioning=- /p:UsePortableRuntimePack=true /p:Configuration=Release publish/ProxyProjectForAOTOnHelix.proj
** Building a proxy for the original test project, to AOT on helix. In order to do that, this recreates the original inputs for the *iOS/tvOS* part of the build. See /private/tmp/helix/working/AC9B09AB/w/A6F108BA/e/publish/ProxyProjectForAOTOnHelix.proj, and /private/tmp/helix/working/AC9B09AB/w/A6F108BA/e/publish/ProxyProjectForAOTOnHelix.props. **
ERROR: WORKLOAD TIMED OUT - Killing user command..
./xharness-runner.apple.sh: line 173: 5367 Terminated: 15 source command.sh
Removing empty log files:
ls: /tmp/helix/working/AC9B09AB/w/A6F108BA/uploads/xunit-*.xml: No such file or directory
+ exit_code=143
+ '[' -f /tmp/helix/working/AC9B09AB/w/A6F108BA/e/.retry ']'
+ '[' -f /tmp/helix/working/AC9B09AB/w/A6F108BA/e/.reboot ']'
+ exit 143
+ export _commandExitCode=143
+ _commandExitCode=143
+ /usr/local/bin/python3 -u /tmp/helix/working/AC9B09AB/w/A6F108BA/u/xharness-event-processor.py
WARNING: Diagnostics file not found at `/tmp/helix/working/AC9B09AB/w/A6F108BA/e/diagnostics.json`
+ /usr/local/bin/python3 /tmp/helix/working/AC9B09AB/p/reporter/run.py https://dev.azure.com/dnceng-public/ public 247120 eyJ0eXAiOiJKV1QiLCJhbGciOiJSUzI1NiIsIng1dCI6Im9PdmN6NU1fN3AtSGpJS2xGWHo5M3VfVjBabyJ9.eyJuYW1laWQiOiJjNzczZjJjMi01MTIwLTQyMDctYWZlMi1hZmFmMzVhOGJjMGEiLCJzY3AiOiJhcHBfdG9rZW4iLCJhdWkiOiIyMGQ4N2I4MC1lMGVjLTRkYzgtOTFmMi00M2I0MmI4ODU0NTkiLCJzaWQiOiIxZjY5ZTIwZS1iMmRlLTQ2NGQtOGUzNi0yOWQ3ZjY4NzI2MzAiLCJCdWlsZElkIjoiY2JiMTgyNjEtYzQ4Zi00YWJiLTg2NTEtOGNkY2I1NDc0NjQ5OzEyNTcxIiwicHBpZCI6InZzdGZzOi8vL0J1aWxkL0J1aWxkLzEyNTcxIiwib3JjaGlkIjoiMzMwMWIyZGUtOTc1NS00MmZmLTg1NzEtM2M3NGJlNzAwNzFkLmJ1aWxkX2lvc19hcm02NF9yZWxlYXNlX2FsbHN1YnNldHNfbW9uby5fX2RlZmF1bHQiLCJyZXBvSWRzIjoiIiwiaXNzIjoiYXBwLnZzdG9rZW4udmlzdWFsc3R1ZGlvLmNvbSIsImF1ZCI6ImFwcC52c3Rva2VuLnZpc3VhbHN0dWRpby5jb218dnNvOjZmY2M5MmU1LTczYTctNGY4OC04ZDEzLWQ5MDQ1YjQ1ZmIyNyIsIm5iZiI6MTY2Mjk3MjcwNSwiZXhwIjoxNjYyOTg0NzA1fQ.Qp3s3PvwKjcJXb6Rad_3Zfh3ofZX7yHtqiefyYK_10hDxLzOELvqNHnNY906FwNIuT1XhyYj06nzfVKlrnvcxhw5wMh5wB87rXIQwj2bRY75oLwTiUqDD440lJI-J52tGYiDQuoVQfA-GwCshaRfCTTv6Q6SV0nv_X5IBDSpesxICqywV2JASrEIqB96XfH1s542NXYezR4RPwwhIDR2oHgb4KBcqHnlJ1KBbplkZjOGlIgiG-OQi37qnn5RLiC1uKgTV3LFaKy9TzDLdQS9b_HohYwA5XFEzJL4Ey7Lm_LBjO1g_r_meI2exhIKvBFJ89hHvf_5SxZi5jac4G3PCg
2022-09-12T10:24:50.253Z INFO run.py run(48) main Beginning reading of test results.
2022-09-12T10:24:50.255Z INFO run.py __init__(42) read_results Searching '/private/tmp/helix/working/AC9B09AB/w/A6F108BA/e' for test results files
2022-09-12T10:24:50.262Z INFO run.py __init__(42) read_results Searching '/tmp/helix/working/AC9B09AB/w/A6F108BA/uploads' for test results files
2022-09-12T10:24:50.262Z WARNING run.py __init__(55) read_results No results file found in any of the following formats: xunit, junit, trx
2022-09-12T10:24:50.263Z INFO run.py packing_test_reporter(30) report_results Packing 0 test reports to '/tmp/helix/working/AC9B09AB/w/A6F108BA/e/__test_report.json'
2022-09-12T10:24:50.263Z INFO run.py packing_test_reporter(33) report_results Packed 1414 bytes
+ exit 143
['System.Runtime.Tests' END OF WORK ITEM LOG: Command exited with 143]
Last week, we started seeing an uptick in problems with TCP connections between the device and XHarness. There were no infrastructure changes so we're weighing a possibility of some networking regression in runtime.
cc @steveisok
@premun Before we fish in the runtime, can we fully reboot the devices (turn them off and back on) to see if that fixes the problem?
@dotnet/dnceng we see issues with the 12.00 iPhone queue. Can we:
- Reboot the machines (ideally via Helix work items that call this)
- Ask DDFUN to reboot the phones themselves?
- Re-run the failing stage in this build to verify it got better: https://dev.azure.com/dnceng-public/public/_build/results?buildId=12516&view=results
This is probably related to this: https://github.com/dotnet/arcade/issues/10827
@dotnet/dnceng we see issues with the 12.00 iPhone queue. Can we:
- Reboot the machines (ideally via Helix work items that call this)
- Ask DDFUN to reboot the phones themselves?
- Re-run the failing stage in this build to verify it got better: https://dev.azure.com/dnceng-public/public/_build/results?buildId=12516&view=results
I've seen it on all the iOS and tvOS queues (OSX.1015 and OSX.1200). Can we address them all?
Let's try on the 50 iPhones as I have a 100% repro in the build now it seems and then do the other devices. It's a substantial amount of devices. Also it's much faster to reboot iPhones for DDFUN than AppleTVs, I think?
As per @premun 's instructions, I restarted all the machines in the osx.1200.amd64.iphone.open queue through a Helix job and re-ran the failing stage in the linked build (https://dev.azure.com/dnceng-public/public/_build/results?buildId=12516&view=results). The stage is now green, so the restart seemed to have helped
@oleksandr-didyk can you do the same for the appletv queues too?
@steveisok btw we could start running the full BCL on the iPhone 12.00 queue as we have ~50 devices there
Of course, will do and update the status here. Is there some pipeline I could check if the issue is fixed?
@oleksandr-didyk you can use any dotnet/xharness build and re-run the AppleTV E2E test stage. E.g. this one: https://dev.azure.com/dnceng-public/public/_build/results?buildId=12703&view=results
However, we don't have a 100% repro there
@premun restarted the machines in 1100 and 1015 queues, re-ran the pipeline stage you recommended on main -> tests passed
@premun restarted the machines in
1100and1015queues, re-ran the pipeline stage you recommended onmain-> tests passed
Were the devices rebooted as well?
@steveisok No, did not create any IcM for that, just the machines
Hit in a release/7.0 runtime PR in the runtime-extra-platforms pipeline: https://github.com/dotnet/runtime/pull/76052
You can expand the Tests -> Artifacts tabs for any of the two tvOS runs:
https://dev.azure.com/dnceng-public/public/_build/results?buildId=29620&view=ms.vss-test-web.build-test-results-tab
Then look for the log files that contain the name of the test class.
You won't find any [FAIL] results, yet the process seems to have troubles at the end, and that's probably why the whole run is reported as failed.
@dotnet/dnceng can you please open a ticket to reboot all Apple TV devices in the osx.1015.amd64.appletv.open and osx.1100.amd64.appletv.open queue?
created a ticket to reboot all devices in osx.1100.amd64.appletv.open queue.
DDFun ticket: https://portal.microsofticm.com/imp/v3/incidents/details/338271514/home
cc @premun, @steveisok
from the ticket:
The MLS team has rebooted the Apple TVs attached to these machines as you requested. Could you please check the TCP connection issue on your end and confirm the status please.
@steveisok , could you please check if the issue is solved for osx.1100.amd64.appletv.open queue
from the ticket:
The MLS team has rebooted the Apple TVs attached to these machines as you requested. Could you please check the TCP connection issue on your end and confirm the status please.
@steveisok , could you please check if the issue is solved for
osx.1100.amd64.appletv.openqueue
No, it still exists, unfortunately.
Per conversation with @akoeplinger, I am closing 11210 and 11224 in favor of this issue. Issue #11224 - machine DNCENGMAC144.local was taken offline to address this issue. Issue #11210 is also related to this issue.
@steveisok any progress on this? Should we keep this issue open or should we just open a new one whenever you are ready to engage with us again?
Closing in favor of https://github.com/dotnet/arcade/issues/11683. We are enabling retries as a potential way to mitigate this. We'll see.