msbuild icon indicating copy to clipboard operation
msbuild copied to clipboard

[Infra] SampleAnalyzerIntegrationTest randomly timeouting in CI

Open JanKrivanek opened this issue 10 months ago • 12 comments

Context

System.TimeoutException : Test failed due to timeout: process 12042 is active for more than 30 sec.

   at Microsoft.Build.UnitTests.Shared.RunnerUtilities.RunProcessAndGetOutput(String process, String parameters, Boolean& successfulExit, Boolean shellExecute, ITestOutputHelper outputHelper) in /_/src/UnitTests.Shared/RunnerUtilities.cs:line 139
   at Microsoft.Build.UnitTests.Shared.RunnerUtilities.ExecBootstrapedMSBuild(String msbuildParameters, Boolean& successfulExit, Boolean shellExecute, ITestOutputHelper outputHelper) in /_/src/UnitTests.Shared/RunnerUtilities.cs:line 67
   at Microsoft.Build.BuildCheck.UnitTests.EndToEndTests.SampleAnalyzerIntegrationTest(Boolean buildInOutOfProcessNode, Boolean analysisRequested) in /home/vsts/work/1/s/src/BuildCheck.UnitTests/EndToEndTests.cs:line 122
   at System.RuntimeMethodHandle.InvokeMethod(Object target, Void** arguments, Signature sig, Boolean isConstructor)
   at System.Reflection.MethodBaseInvoker.InvokeDirectByRefWithFewArgs(Object obj, Span`1 copyOfArgs, BindingFlags invokeAttr)

Sample run: https://dev.azure.com/dnceng-public/public/_build/results?buildId=648080&view=ms.vss-test-web.build-test-results-tab&runId=16013174&resultId=100001&paneView=debug

JanKrivanek avatar Apr 18 '24 15:04 JanKrivanek

discussed offline: the execution time can be extended for handling it. Please track it for some time if this approach solves the problem

YuliiaKovalova avatar Apr 19 '24 09:04 YuliiaKovalova

After observing 9 runs that failed with this test, the failure only happened to the following test cases. From the output logs attached inline, it looks like within 30 seconds the build check analyzer hadn't finished.

~~Checking with the output log Microsoft.Build.BuildCheck.UnitTests_net8.0_x64.log of fully passed test run, I found there were over 1000 warnings when the analysis was requested. But the warnings except BC0101 couldn't be reproduced locally. This may explain it looks like within 30 seconds the build check analyzer hadn't finished. No idea why CI run has so many extra warnings.~~

GangWang01 avatar Apr 30 '24 09:04 GangWang01

@JanKrivanek , can this fix help with the test performance? https://github.com/dotnet/msbuild/pull/10084

YuliiaKovalova avatar Apr 30 '24 09:04 YuliiaKovalova

Only slightly - it slashes only about a third of currently detected slowdown of the analyzers. Though the detected slowdown was in range of about 14% - so it's puzzling that the process runs for over 30 second. It would be interesting to see if it fails with about the same or significantly decreased ratio if we increase the timeout to say - 120 seconds. That might bisect between the slowdown and deadlock theory

JanKrivanek avatar Apr 30 '24 09:04 JanKrivanek

Another incidence (after the timeout increasing): https://dev.azure.com/dnceng-public/public/_build/results?buildId=681320&view=ms.vss-test-web.build-test-results-tab&runId=16909114&resultId=100003&paneView=debug

JanKrivanek avatar May 20 '24 15:05 JanKrivanek

Might be totally red herring:

The timeouting cases seem to have TerminalLogger used (notice the control sequences):

[33;1m   Microsoft.Build.BuildCheck.UnitTests: [Long Running Test] 'Microsoft.Build.BuildCheck.UnitTests.EndToEndTests.SampleAnalyzerIntegrationTest', Elapsed: 00:01:02
[mExecuting [/home/vsts/work/1/s/.dotnet/dotnet /home/vsts/work/1/s/artifacts/bin/bootstrap/net8.0/MSBuild/MSBuild.dll FooBar.csproj /m:1 -nr:False -restore -analyze]
==== OUTPUT ====
[31;1m    Microsoft.Build.BuildCheck.UnitTests.EndToEndTests.SampleAnalyzerIntegrationTest(buildInOutOfProcessNode: True, analysisRequested: True) [FAIL]
[m[37m      System.TimeoutException : Test failed due to timeout: process 11430 is active for more than 120 sec.

Sample full log: TIMEOUT-Microsoft.Build.BuildCheck.UnitTests_net8.0_x64.log

Wherease I do not see those appear in individual test cases for the success cases - sample log: OK-Microsoft.Build.BuildCheck.UnitTests_net8.0_x64.log

Regardless of it being red herring or not - it is strange that TerminalLogger is used in CI. I believe @AR-May was looking into somthing similar

JanKrivanek avatar May 20 '24 18:05 JanKrivanek

@JanKrivanek and @maridematte I repro this issue with the command "dotnet test ./artifacts/bin/Microsoft.Build.BuildCheck.UnitTests/Debug/net8.0/Microsoft.Build.BuildCheck.UnitTests.dll --filter "FullyQualifiedName~SampleAnalyzerIntegrationTest" randomly " in linux machine.

Added -v:d with process command to output more info and found the build is stuck in target GetCopyToPublishDirectoryItems

Image

This is build log.txt.

JaynieBai avatar May 24 '24 12:05 JaynieBai

Perfect!

Can you collect 2 more cases to see if there is any pattern? Plus appart from the diag log it might be helpfull to have binlog as well.

Other than that I unfortunately do not have any idea now about what can bw causing this

JanKrivanek avatar May 24 '24 13:05 JanKrivanek

Perfect!

Can you collect 2 more cases to see if there is any pattern? Plus appart from the diag log it might be helpfull to have binlog as well.

Other than that I unfortunately do not have any idea now about what can bw causing this

Here is log with -v:diag diaglog.txt

The binlog is not completed when failed. True.binlog.txt

Image

JaynieBai avatar May 29 '24 05:05 JaynieBai

Yeah - the binlog being cut aburptly is fine (and expected as test is killed after timeout) - the timeline from binlog will be interresting - it should show where was the time spent. At it'd be nice to see 2 or 3 cases - to be able to compare if the excesive time is spent in the same part of the build or if it is random.

Since it is Linux, we unfortunately cannot use ETW :-/ Maybe @AR-May or @ladipro have some top of the head ideas of what to focus on beyond the timing in binlog.

JanKrivanek avatar May 30 '24 08:05 JanKrivanek

Yeah - the binlog being cut aburptly is fine (and expected as test is killed after timeout) - the timeline from binlog will be interresting - it should show where was the time spent. At it'd be nice to see 2 or 3 cases - to be able to compare if the excesive time is spent in the same part of the build or if it is random.

Since it is Linux, we unfortunately cannot use ETW :-/ Maybe @AR-May or @ladipro have some top of the head ideas of what to focus on beyond the timing in binlog.

I compare the logs of several failed cases, it's always the same part of the build. And unfortunately, there is no timeline shown in the binlog True.binlog.txt

JaynieBai avatar May 31 '24 04:05 JaynieBai

From offline discussion

Observations:

  • The repros seem to indicate that the test will get stuck in the same place - during _GenerateRestoreProjectPathWalk execution (which is invoked from restore) - on a FooBar-Copy.csproj (the second project) - while the previous call for a same project succeeded (from a recursive call of _GenerateRestoreProjectPathWalk on the FooBar.csproj)
  • The target is calling MSBuild task (to execute the same target recursively) - but that MSBuild call doesn't end

Recommendation on further steps:

  • Let's setup WSL unit testing (https://github.com/dotnet/msbuild/blob/9bea8026aad964cb36f3ec9d93bd95a941487690/documentation/wiki/Building-Testing-and-Debugging-on-Full-Framework-MSBuild.md?plain=1#L36)
  • Let's see if the hang reproes under the WSL testing (if it doesn't we can try with docker described in the same doc)
  • Let's setup WSL debugging https://learn.microsoft.com/en-us/visualstudio/debugger/debug-dotnet-core-in-wsl-2?view=vs-2022
  • Run the test and once it hangs, break into the child MSBuild process from Debug -> Attach to Process
  • Inspect the stacks to see if there is any blocking
  • If the stack inspection and debugging won't help - we might need to try to collect traces: https://github.com/ltrzesniewski/dotnet-runtime/blob/master/docs/project/linux-performance-tracing.md

As a side not - the issue happens during restore. So it might get resolved by skipping the restore - https://github.com/dotnet/msbuild/issues/9747. But it still would be very valuable to understand why is it happening

JanKrivanek avatar Jun 05 '24 11:06 JanKrivanek

Though this issue might be resolved in the fix above, I had some information about trying to debug the test running on linux remotely from VS on Windows. I tried the following ways. But they failed with different errors. I was not able to debug this test running on Linux.

  • Debug the test in WSL Remote debugging in WSL got the error MSB4062: The "CheckForImplicitPackageReferenceOverrides" task could not be loaded from the assembly /mnt/d/WS/fork/msbuild/artifacts/bin/bootstrap/net8.0/MSBuild/Sdks/Microsoft.NET.Sdk/targets/../tools/net8.0/Microsoft.NET.Build.Tasks.dll. An attempt was made to load a program with an incorrect format. See TestLog.txt for more details. After my investigation, I found the assembly Microsoft.NET.Build.Tasks.dll existed. But this assembly from clean build on Windows is not compatible to run in WSL due to its architecture x64, while the architecture of it from clean build on Linux is 64797.

  • Attach the debugger to the child process started by the test running on Linux I could attach the debugger to the started process(dotnet <MSBuild.dll in boostrap>) by the test on Linux for a while. There is was no call stack displayed and after a while the error below popped up. No Windows application event log was found either... image

GangWang01 avatar Jul 15 '24 09:07 GangWang01

Though this issue might be resolved in the fix above, I had some information about trying to debug the test running on linux remotely from VS on Windows. I tried the following ways. But they failed with different errors. I was not able to debug this test running on Linux.

  • Debug the test in WSL Remote debugging in WSL got the error MSB4062: The "CheckForImplicitPackageReferenceOverrides" task could not be loaded from the assembly /mnt/d/WS/fork/msbuild/artifacts/bin/bootstrap/net8.0/MSBuild/Sdks/Microsoft.NET.Sdk/targets/../tools/net8.0/Microsoft.NET.Build.Tasks.dll. An attempt was made to load a program with an incorrect format. See TestLog.txt for more details. After my investigation, I found the assembly Microsoft.NET.Build.Tasks.dll existed. But this assembly from clean build on Windows is not compatible to run in WSL due to its architecture x64, while the architecture of it from clean build on Linux is 64797.
  • Attach the debugger to the child process started by the test running on Linux I could attach the debugger to the started process(dotnet <MSBuild.dll in boostrap>) by the test on Linux for a while. There is was no call stack displayed and after a while the error below popped up. No Windows application event log was found either... image

@GangWang01 , thank you for analysis. I was able to catch it on Windows locally, after adding a new build-in rule. Using Parallel.ForEach in this case was causing deadlock, according to the callstack. So it has to be OK now. Could you please reenable to tests in a separate PR now? I included it in my changes, but the review can take time :)

YuliiaKovalova avatar Jul 15 '24 09:07 YuliiaKovalova

Fixed by https://github.com/dotnet/msbuild/pull/10353

JanKrivanek avatar Jul 23 '24 13:07 JanKrivanek