arcade icon indicating copy to clipboard operation
arcade copied to clipboard

iOS - APP_LAUNCH_TIMEOUT / WORKLOAD TIMED OUT

Open karelz opened this issue 3 years ago • 14 comments

Starting on 9/9 Rolling run 11121 there are 2x-4x failure in each run:

  • Platform: net7.0-iOS-Release-arm64-Mono_Release-OSX.1200.Amd64.Iphone.Open
Day Run Details
9/12 12571 2x APP_LAUNCH_TIMEOUT & 1x WORKLOAD TIMED OUT
9/11 12413 (7.0) 2x APP_LAUNCH_TIMEOUT & 1x WORKLOAD TIMED OUT
9/11 12028 1x APP_LAUNCH_TIMEOUT & 2x WORKLOAD TIMED OUT
9/11 11973 (7.0) 2x APP_LAUNCH_TIMEOUT & 1x WORKLOAD TIMED OUT
9/11 11877 1x APP_LAUNCH_TIMEOUT & 2x WORKLOAD TIMED OUT
9/10 11727 3x APP_LAUNCH_TIMEOUT
9/10 11508 2x APP_LAUNCH_TIMEOUT & 1x WORKLOAD TIMED OUT
9/9 11121 4x APP_LAUNCH_TIMEOUT
9/9 11454 (7.0) 4x APP_LAUNCH_TIMEOUT
9/9 10781 (7.0) 4x APP_LAUNCH_TIMEOUT

Example of logs

APP_LAUNCH_TIMEOUT

Console

[03:16:43] dbug: Xamarin.Hosting: Connected to DNCENGOSX117 (00008030-000804482E23402E) in 00:00:00.0018453
[03:16:47] dbug: 95976 Execution timed out after 60 seconds and the process was killed.
[03:16:47] dbug: Process mlaunch exited with 0
[03:16:48] fail: Application launch timed out before the test execution has started
[03:16:48] info: Uninstalling the application 'net.dot.System.Runtime.Numerics.Tests' from 'DNCENGOSX117'
[03:16:48] dbug: 
[03:16:48] dbug: Running /private/tmp/helix/working/AC9B09AB/p/microsoft.dotnet.xharness.cli/1.0.0-prerelease.22456.1/tools/net7.0/any/../../../runtimes/any/native/mlaunch/bin/mlaunch --sdkroot /Applications/Xcode_13.4.app --uninstalldevbundleid net.dot.System.Runtime.Numerics.Tests --devname 00008030-000804482E23402E -v -v -v -v -v
[03:16:55] dbug: TCP tunnel still has not connected
[03:17:27] dbug: TCP connection hasn't started in time (00:02:00). Stopped listening.
[03:17:27] dbug: Failed to read TCP data: System.Net.Internals.SocketExceptionFactory+ExtendedSocketException (61): Connection refused [::1]:54809
                    at Microsoft.DotNet.XHarness.iOS.Shared.Listeners.SimpleTcpListener.StartTcpTunnel() in /_/src/Microsoft.DotNet.XHarness.iOS.Shared/Listeners/SimpleTcpListener.cs:line 164
[03:17:27] dbug: Tests have finished executing
[03:19:49] dbug: Process 95987 didn't exit within 00:03:00 and will be killed
[03:19:49] dbug: Killing process tree of 95987...
[03:19:49] dbug: Pids to kill: 95987
[03:19:49] dbug: Running lldb diagnostics for pid 95987
[03:19:49] dbug: Printing backtrace for pid=95987
[03:19:49] dbug: (lldb) command source -s 0 '/tmp/helix/working/AC9B09AB/t/tmpvvopx2.tmp'
[03:19:49] dbug: Executing commands in '/tmp/helix/working/AC9B09AB/t/tmpvvopx2.tmp'.
[03:19:49] dbug: (lldb) process attach --pid 95987
[03:20:09] dbug: 95987 Execution timed out after 180 seconds and the process was killed.
[03:20:10] dbug: Failed to get ExitCode: System.InvalidOperationException: Process must exit before requested information can be determined.
                    at System.Diagnostics.Process.EnsureState(State state)
                    at Microsoft.DotNet.XHarness.Common.Execution.ProcessManager.RunAsyncInternal(Process process, ILog log, ILog stdout, ILog stderr, Action`2 kill, Func`3 getChildProcessIds, Nullable`1 timeout, Dictionary`2 environmentVariables, Nullable`1 cancellationToken, Nullable`1 diagnostics) in /_/src/Microsoft.DotNet.XHarness.Common/Execution/ProcessManager.cs:line 357
[03:20:10] fail: Failed to uninstall the app bundle! Check logs for more details!
[03:20:10] dbug: Saving diagnostics data to '/tmp/helix/working/AC9B09AB/w/B92F09FA/e/diagnostics.json'
XHarness exit code: 90 (APP_LAUNCH_TIMEOUT)

WORKLOAD TIMED OUT

Console

--------------------------------------------------------------------------------------
MSBuild version 17.4.0-preview-22368-02+c8492483a for .NET
/private/tmp/helix/working/AC9B09AB/p/dotnet-cli/sdk/7.0.100-preview.7.22377.5/MSBuild.dll -maxcpucount -verbosity:m /bl:/private/tmp/helix/working/AC9B09AB/w/A6F108BA/e/xharness-output/AOTBuild.binlog /p:XHARNESS_EXECUTION_DIR=/private/tmp/helix/working/AC9B09AB/w/A6F108BA/e /p:RunAOTCompilation=true /p:TargetOS=iOS /p:TargetArchitecture=arm64 /p:MonoForceInterpreter= /p:MonoEnableLLVM=true /p:DevTeamProvisioning=- /p:UsePortableRuntimePack=true /p:Configuration=Release publish/ProxyProjectForAOTOnHelix.proj
  ** Building a proxy for the original test project, to AOT on helix. In order to do that, this recreates the original inputs for the *iOS/tvOS* part of the build. See /private/tmp/helix/working/AC9B09AB/w/A6F108BA/e/publish/ProxyProjectForAOTOnHelix.proj, and /private/tmp/helix/working/AC9B09AB/w/A6F108BA/e/publish/ProxyProjectForAOTOnHelix.props. **
ERROR: WORKLOAD TIMED OUT - Killing user command..
./xharness-runner.apple.sh: line 173:  5367 Terminated: 15          source command.sh
Removing empty log files:
ls: /tmp/helix/working/AC9B09AB/w/A6F108BA/uploads/xunit-*.xml: No such file or directory
+ exit_code=143
+ '[' -f /tmp/helix/working/AC9B09AB/w/A6F108BA/e/.retry ']'
+ '[' -f /tmp/helix/working/AC9B09AB/w/A6F108BA/e/.reboot ']'
+ exit 143
+ export _commandExitCode=143
+ _commandExitCode=143
+ /usr/local/bin/python3 -u /tmp/helix/working/AC9B09AB/w/A6F108BA/u/xharness-event-processor.py
WARNING: Diagnostics file not found at `/tmp/helix/working/AC9B09AB/w/A6F108BA/e/diagnostics.json`
+ /usr/local/bin/python3 /tmp/helix/working/AC9B09AB/p/reporter/run.py https://dev.azure.com/dnceng-public/ public 247120 eyJ0eXAiOiJKV1QiLCJhbGciOiJSUzI1NiIsIng1dCI6Im9PdmN6NU1fN3AtSGpJS2xGWHo5M3VfVjBabyJ9.eyJuYW1laWQiOiJjNzczZjJjMi01MTIwLTQyMDctYWZlMi1hZmFmMzVhOGJjMGEiLCJzY3AiOiJhcHBfdG9rZW4iLCJhdWkiOiIyMGQ4N2I4MC1lMGVjLTRkYzgtOTFmMi00M2I0MmI4ODU0NTkiLCJzaWQiOiIxZjY5ZTIwZS1iMmRlLTQ2NGQtOGUzNi0yOWQ3ZjY4NzI2MzAiLCJCdWlsZElkIjoiY2JiMTgyNjEtYzQ4Zi00YWJiLTg2NTEtOGNkY2I1NDc0NjQ5OzEyNTcxIiwicHBpZCI6InZzdGZzOi8vL0J1aWxkL0J1aWxkLzEyNTcxIiwib3JjaGlkIjoiMzMwMWIyZGUtOTc1NS00MmZmLTg1NzEtM2M3NGJlNzAwNzFkLmJ1aWxkX2lvc19hcm02NF9yZWxlYXNlX2FsbHN1YnNldHNfbW9uby5fX2RlZmF1bHQiLCJyZXBvSWRzIjoiIiwiaXNzIjoiYXBwLnZzdG9rZW4udmlzdWFsc3R1ZGlvLmNvbSIsImF1ZCI6ImFwcC52c3Rva2VuLnZpc3VhbHN0dWRpby5jb218dnNvOjZmY2M5MmU1LTczYTctNGY4OC04ZDEzLWQ5MDQ1YjQ1ZmIyNyIsIm5iZiI6MTY2Mjk3MjcwNSwiZXhwIjoxNjYyOTg0NzA1fQ.Qp3s3PvwKjcJXb6Rad_3Zfh3ofZX7yHtqiefyYK_10hDxLzOELvqNHnNY906FwNIuT1XhyYj06nzfVKlrnvcxhw5wMh5wB87rXIQwj2bRY75oLwTiUqDD440lJI-J52tGYiDQuoVQfA-GwCshaRfCTTv6Q6SV0nv_X5IBDSpesxICqywV2JASrEIqB96XfH1s542NXYezR4RPwwhIDR2oHgb4KBcqHnlJ1KBbplkZjOGlIgiG-OQi37qnn5RLiC1uKgTV3LFaKy9TzDLdQS9b_HohYwA5XFEzJL4Ey7Lm_LBjO1g_r_meI2exhIKvBFJ89hHvf_5SxZi5jac4G3PCg
2022-09-12T10:24:50.253Z	INFO   	run.py	run(48)	main	Beginning reading of test results.
2022-09-12T10:24:50.255Z	INFO   	run.py	__init__(42)	read_results	Searching '/private/tmp/helix/working/AC9B09AB/w/A6F108BA/e' for test results files
2022-09-12T10:24:50.262Z	INFO   	run.py	__init__(42)	read_results	Searching '/tmp/helix/working/AC9B09AB/w/A6F108BA/uploads' for test results files
2022-09-12T10:24:50.262Z	WARNING	run.py	__init__(55)	read_results	No results file found in any of the following formats: xunit, junit, trx
2022-09-12T10:24:50.263Z	INFO   	run.py	packing_test_reporter(30)	report_results	Packing 0 test reports to '/tmp/helix/working/AC9B09AB/w/A6F108BA/e/__test_report.json'
2022-09-12T10:24:50.263Z	INFO   	run.py	packing_test_reporter(33)	report_results	Packed 1414 bytes
+ exit 143
['System.Runtime.Tests' END OF WORK ITEM LOG: Command exited with 143]

karelz avatar Sep 13 '22 00:09 karelz

Last week, we started seeing an uptick in problems with TCP connections between the device and XHarness. There were no infrastructure changes so we're weighing a possibility of some networking regression in runtime.

cc @steveisok

premun avatar Sep 13 '22 13:09 premun

@premun Before we fish in the runtime, can we fully reboot the devices (turn them off and back on) to see if that fixes the problem?

steveisok avatar Sep 13 '22 13:09 steveisok

@dotnet/dnceng we see issues with the 12.00 iPhone queue. Can we:

  1. Reboot the machines (ideally via Helix work items that call this)
  2. Ask DDFUN to reboot the phones themselves?
  3. Re-run the failing stage in this build to verify it got better: https://dev.azure.com/dnceng-public/public/_build/results?buildId=12516&view=results

premun avatar Sep 13 '22 13:09 premun

This is probably related to this: https://github.com/dotnet/arcade/issues/10827

premun avatar Sep 13 '22 13:09 premun

@dotnet/dnceng we see issues with the 12.00 iPhone queue. Can we:

  1. Reboot the machines (ideally via Helix work items that call this)
  2. Ask DDFUN to reboot the phones themselves?
  3. Re-run the failing stage in this build to verify it got better: https://dev.azure.com/dnceng-public/public/_build/results?buildId=12516&view=results

I've seen it on all the iOS and tvOS queues (OSX.1015 and OSX.1200). Can we address them all?

steveisok avatar Sep 13 '22 13:09 steveisok

Let's try on the 50 iPhones as I have a 100% repro in the build now it seems and then do the other devices. It's a substantial amount of devices. Also it's much faster to reboot iPhones for DDFUN than AppleTVs, I think?

premun avatar Sep 13 '22 13:09 premun

As per @premun 's instructions, I restarted all the machines in the osx.1200.amd64.iphone.open queue through a Helix job and re-ran the failing stage in the linked build (https://dev.azure.com/dnceng-public/public/_build/results?buildId=12516&view=results). The stage is now green, so the restart seemed to have helped

oleksandr-didyk avatar Sep 14 '22 11:09 oleksandr-didyk

@oleksandr-didyk can you do the same for the appletv queues too?

premun avatar Sep 14 '22 11:09 premun

@steveisok btw we could start running the full BCL on the iPhone 12.00 queue as we have ~50 devices there

premun avatar Sep 14 '22 11:09 premun

Of course, will do and update the status here. Is there some pipeline I could check if the issue is fixed?

oleksandr-didyk avatar Sep 14 '22 11:09 oleksandr-didyk

@oleksandr-didyk you can use any dotnet/xharness build and re-run the AppleTV E2E test stage. E.g. this one: https://dev.azure.com/dnceng-public/public/_build/results?buildId=12703&view=results

However, we don't have a 100% repro there

premun avatar Sep 14 '22 11:09 premun

@premun restarted the machines in 1100 and 1015 queues, re-ran the pipeline stage you recommended on main -> tests passed

oleksandr-didyk avatar Sep 14 '22 13:09 oleksandr-didyk

@premun restarted the machines in 1100 and 1015 queues, re-ran the pipeline stage you recommended on main -> tests passed

Were the devices rebooted as well?

steveisok avatar Sep 19 '22 03:09 steveisok

@steveisok No, did not create any IcM for that, just the machines

oleksandr-didyk avatar Sep 19 '22 08:09 oleksandr-didyk

Hit in a release/7.0 runtime PR in the runtime-extra-platforms pipeline: https://github.com/dotnet/runtime/pull/76052

You can expand the Tests -> Artifacts tabs for any of the two tvOS runs: https://dev.azure.com/dnceng-public/public/_build/results?buildId=29620&view=ms.vss-test-web.build-test-results-tab Then look for the log files that contain the name of the test class. You won't find any [FAIL] results, yet the process seems to have troubles at the end, and that's probably why the whole run is reported as failed.

carlossanlop avatar Sep 28 '22 21:09 carlossanlop

@dotnet/dnceng can you please open a ticket to reboot all Apple TV devices in the osx.1015.amd64.appletv.open and osx.1100.amd64.appletv.open queue?

premun avatar Sep 29 '22 07:09 premun

created a ticket to reboot all devices in osx.1100.amd64.appletv.open queue. DDFun ticket: https://portal.microsofticm.com/imp/v3/incidents/details/338271514/home

cc @premun, @steveisok

andriipatsula avatar Sep 29 '22 08:09 andriipatsula

from the ticket:

The MLS team has rebooted the Apple TVs attached to these machines as you requested. Could you please check the TCP connection issue on your end and confirm the status please.

@steveisok , could you please check if the issue is solved for osx.1100.amd64.appletv.open queue

andriipatsula avatar Sep 30 '22 10:09 andriipatsula

from the ticket:

The MLS team has rebooted the Apple TVs attached to these machines as you requested. Could you please check the TCP connection issue on your end and confirm the status please.

@steveisok , could you please check if the issue is solved for osx.1100.amd64.appletv.open queue

No, it still exists, unfortunately.

steveisok avatar Oct 10 '22 15:10 steveisok

Per conversation with @akoeplinger, I am closing 11210 and 11224 in favor of this issue. Issue #11224 - machine DNCENGMAC144.local was taken offline to address this issue. Issue #11210 is also related to this issue.

ilyas1974 avatar Oct 27 '22 13:10 ilyas1974

@steveisok any progress on this? Should we keep this issue open or should we just open a new one whenever you are ready to engage with us again?

ilyas1974 avatar Nov 17 '22 15:11 ilyas1974

Closing in favor of https://github.com/dotnet/arcade/issues/11683. We are enabling retries as a potential way to mitigate this. We'll see.

steveisok avatar Nov 21 '22 19:11 steveisok