diagnostics icon indicating copy to clipboard operation
diagnostics copied to clipboard

Failure to collect hang dump from dotnet test running on arm64 MacOS in helix

Open marcpopMSFT opened this issue 11 months ago • 2 comments

Description

The SDK is having timeouts in our arm64 test leg. To try to track this down, we've switched to using dotnet test with blame hang to collect hang dumps from helix. However, we've been unsuccessful getting dumps collected and saved to helix.

console.d86c9d4a.log

Configuration

I do not know if the error mentioned is specific to arm64 MacOS but that's where it reproduced

Regression?

Unknown

Other information

From the test team, they said to report the issue here: https://github.com/dotnet/sdk/pull/45520#issuecomment-2592305499

dotnet test command run dotnet test Microsoft.NET.Build.Tests.dll -e HELIX_WORK_ITEM_TIMEOUT=02:00:00 -e DOTNET_SDK_TEST_EXECUTION_DIRECTORY=/private/tmp/helix/working/96B108B4/w/A86508FF/e/testExecutionDirectory --results-directory ./ --logger trx --logger 'console;verbosity=detailed' --blame-hang --blame-hang-timeout 15m --filter <list of tests> --

Output from the log file:

[xUnit.net 00:21:37.57] Microsoft.NET.Build.Tests: [Long Running Test] 'Microsoft.NET.Build.Tests.GivenThatWeWantToVerifyProjectReferenceCompat.Project_reference_compat(referencerTarget: "netstandard1.4", testIDPostFix: "Full", rawDependencyTargets: "netstandard1.0 netstandard1.1 netstandard1.2 netst"···, restoreSucceeds: True, buildSucceeds: True)', Elapsed: 00:21:08
[createdump] Gathering state for process 22368 
[createdump] Target process is alive
[createdump] thread_get_state(627f7) FAILED (os/kern) invalid argument (4)
[createdump] Failure took 13ms
The active test run was aborted. Reason: Test host process crashed : [createdump] thread_get_state(627f7) FAILED (os/kern) invalid argument (4)
[createdump] Failure took 13ms

Data collector 'Blame' message: The specified inactivity time of 15 minutes has elapsed. Collecting hang dumps from testhost and its child processes.
Data collector 'Blame' message: Data collector caught an exception of type 'System.IO.FileNotFoundException': 'Collect dump was enabled but no dump file was generated.'. More details: Blame: Collecting hang dump failed with error...
Results File: /private/tmp/helix/working/96B108B4/w/A86508FF/e/_dci-macm2-build-013_2025-01-13_18_59_49.trx

Attachments:
  /private/tmp/helix/working/96B108B4/w/A86508FF/e/57527c31-fa14-4d1f-ab93-ddde833bffa9/Sequence_86717a59614b4b9cbd87c41593080a68.xml
Test Run Aborted.
Total tests: Unknown
     Passed: 43
 Total time: 21.9121 Minutes

The active Test Run was aborted because the host process exited unexpectedly. Please inspect the call stack above, if available, to get more information about where the exception originated from.
The test running when the crash occurred: 
Microsoft.NET.Build.Tests.GivenThatWeWantToVerifyProjectReferenceCompat.Project_reference_compat

This test may, or may not be the source of the crash.
+ export _commandExitCode=1
+ _commandExitCode=1

marcpopMSFT avatar Jan 21 '25 17:01 marcpopMSFT

That invalid arg is a weird one for this one. Might mean the thread died here - there's no suspicious parameter otherwise.

hoyosjs avatar Jan 21 '25 22:01 hoyosjs

I'm not sure if this is the same root cause as https://github.com/dotnet/diagnostics/issues/5061. @mikem8361 any updates please?

Youssef1313 avatar Jun 02 '25 16:06 Youssef1313