testfx icon indicating copy to clipboard operation
testfx copied to clipboard

[MTP]: Timeout doesn't abort hanging tests

Open cbersch opened this issue 2 months ago • 20 comments

Describe the bug

With the VSTest-based test execution it was quite easy to add a real™ timeout which could kill hanging tests. The current --timeout doesn't work in that case. Using

  • .NET 10 RC2
  • MSTest 4.0.1
  • Windows 11

Steps To Reproduce

Consider the test

[TestClass]
public sealed class TimeoutTests
{
    [TestMethod]
    public void Lock()
    {
        ManualResetEventSlim mre = new(false);
        mre.Wait();
    }
}

with project

<Project Sdk="Microsoft.NET.Sdk">

  <PropertyGroup>
    <TargetFramework>net10.0</TargetFramework>
    <ImplicitUsings>enable</ImplicitUsings>

  </PropertyGroup>

  <PropertyGroup Condition="'$(UseMtp)' == 'true'">
    <EnableMSTestRunner>true</EnableMSTestRunner>
    <OutputType>Exe</OutputType>
  </PropertyGroup>

  <ItemGroup>
    <PackageReference Include="MSTest.TestFramework" Version="4.0.1" />
    <PackageReference Include="MSTest.TestAdapter" Version="4.0.1" />
  </ItemGroup>

  <ItemGroup Condition="'$(UseMtp)' != 'true'">
    <PackageReference Include="Microsoft.NET.Test.Sdk" Version="18.0.0" />
  </ItemGroup>
</Project>

and global.json

{
  "test": {
    "runner": "Microsoft.Testing.Platform"
  }
}

MTP

dotnet test --project .\Timeout.csproj --timeout 10s -p:UseMtp=true

runs forever (also Ctrl+C has no effect).

VSTest

Delete the global.json and execute

dotnet test .\Timeout.csproj --blame-hang-timeout=10s

This

  • "nicely" aborts the test run
  • Reports a useful message.
The active test run was aborted. Reason: Test host process crashed

The active Test Run was aborted because the host process exited unexpectedly. Please inspect the call stack above, if available, to get more information about where the exception originated from.
The test running when the crash occurred:
TimeoutTests.Lock

This test may, or may not be the source of the crash.
  Timeout test net10.0 failed with 1 error(s) and 2 warning(s) (12,0s)
    C:\Program Files\dotnet\sdk\10.0.100-rc.2.25502.107\Microsoft.TestPlatform.targets(48,5): warning Data collector 'Blame' message: The specified inactivity time of 10 seconds has elapsed. Collecting hang dumps from testhost and its child processes.
    C:\Program Files\dotnet\sdk\10.0.100-rc.2.25502.107\Microsoft.TestPlatform.targets(48,5): warning Data collector 'Blame' message: Dumping 62032 - testhost.
    C:\temp\mtp_timeout\bin\Debug\net10.0\Timeout.dll : error TESTRUNABORT: Test Run Aborted.
  • Creates a sequence file which contains the executed tests, including the one which was killed.

cbersch avatar Oct 21 '25 11:10 cbersch

This is by-design. --timeout only requests cancellation. Can you try HangDump extension for MTP instead of --timeout?

See https://learn.microsoft.com/dotnet/core/testing/microsoft-testing-platform-extensions-diagnostics#hang-dump for information.

But in short, add Microsoft.Testing.Extensions.HangDump to your test projects, matching the same version of the Microsoft.Testing.Platform version you use, and run with dotnet test --project .\Timeout.csproj --hangdump --hangdump-timeout 10s -p:UseMtp=true

Youssef1313 avatar Oct 21 '25 11:10 Youssef1313

This is by-design. --timeout only requests cancellation. Can you try HangDump extension for MTP instead of --timeout?

See https://learn.microsoft.com/dotnet/core/testing/microsoft-testing-platform-extensions-diagnostics#hang-dump for information.

Had missed that, thanks.

But in short, add Microsoft.Testing.Extensions.HangDump to your test projects, matching the same version of the Microsoft.Testing.Platform version you use, and run with dotnet test --project .\Timeout.csproj --hangdump --hangdump-timeout 10s -p:UseMtp=true

That doesn't work, tested with v 2.0.1:

Image

cbersch avatar Oct 21 '25 11:10 cbersch

As a side note: Old option --blame-hang-dump-type had type full, mini and none (https://learn.microsoft.com/en-us/dotnet/core/tools/dotnet-test?tabs=dotnet-test-with-vstest). I'm missing none, which I'm using, for the HangDump extension.

cbersch avatar Oct 21 '25 12:10 cbersch

I don't recall all details, but I think none was only creating a text file and not a proper dump. This file is not so helpful with helping diagnose so we decided not to port it to MTP.

Evangelink avatar Oct 21 '25 20:10 Evangelink

I don't recall all details, but I think none was only creating a text file and not a proper dump.

That's exactly what I need. The blame-hang-timeout timeout is to guard against too long execution of a single test, but I don't want the actual dump which can be tens of GB.

cbersch avatar Oct 21 '25 20:10 cbersch

@cbersch do you care about having trx etc? There is no good way to force kill an app and still leave room for extensions to properly complete tasks.

Evangelink avatar Oct 26 '25 09:10 Evangelink

@cbersch do you care about having trx etc?

No, I don't care about TRX or other extensions in that case.

But having a sequence file which reports the already executed tests and the test which hung, helps a lot. Would that be possible as part of the HangDump extension itself?

cbersch avatar Oct 26 '25 12:10 cbersch

Am I missing something, or is this extension currently completely useless with MTP v2? What I'm observing is that the test run never completes, can't be aborted with Ctrl+C, and no dump file is written to disk.

bart-vmware avatar Dec 03 '25 16:12 bart-vmware

--timeout is not intended to be a "hard kill" or produce a dump.

If you really need a dump, you need to use --hangdump provided via Microsoft.Testing.Extensions.HangDump. Keep in mind that HangDump had some issues on at least macOS, which will be fixed in the next version.

Youssef1313 avatar Dec 03 '25 16:12 Youssef1313

I'm not using that, instead I use this:

dotnet test --coverage --report-trx --coverage-settings coverage.config --crashdump --hangdump --hangdump-timeout 10s

bart-vmware avatar Dec 03 '25 16:12 bart-vmware

@bart-vmware Oh, this was fixed in https://github.com/microsoft/testfx/pull/6968.

Youssef1313 avatar Dec 03 '25 16:12 Youssef1313

Oh, that's nice. Any idea when this will be released? Until then, is there a dev-feed I can use to try the fix?

bart-vmware avatar Dec 03 '25 16:12 bart-vmware

@bart-vmware So far there is no timeline for when that will be released. But you can test out the latest previews from https://pkgs.dev.azure.com/dnceng/public/_packaging/test-tools/nuget/v3/index.json.

Youssef1313 avatar Dec 03 '25 19:12 Youssef1313

I don't recall all details, but I think none was only creating a text file and not a proper dump.

That's exactly what I need. The blame-hang-timeout timeout is to guard against too long execution of a single test, but I don't want the actual dump which can be tens of GB.

@Evangelink I find this point quite important for us. Could you consider adding dump type none to MTP? If yes, I could open a separate issue for this.

cbersch avatar Dec 10 '25 12:12 cbersch

That's exactly what I need. The blame-hang-timeout timeout is to guard against too long execution of a single test, but I don't want the actual dump which can be tens of GB.

@cbersch in case of timeout and blame-hang none what your "next" step? Do you fix it? It's not clear to me why simply fail without any information can be useful to fix the problem.

Also I think that if you've a test like the above one you should take the cancellation token from the test framework you're using and pass it down the api you're testing. If you don't have a method for it subscribe to the token and do a Environment.FastFail("Timeout") and you'll have similar behavior of hang with none.

MarcoRossignoli avatar Dec 10 '25 15:12 MarcoRossignoli

@cbersch just to be sure to properly understand the behavior you are trying to achieve. Do you want to stop the full execution (not having coverage, trx...) when the global run is longer than some time? Do you want to "kill" any test after some time and continue the execution? Something else?

Evangelink avatar Dec 10 '25 15:12 Evangelink

That's exactly what I need. The blame-hang-timeout timeout is to guard against too long execution of a single test, but I don't want the actual dump which can be tens of GB.

@cbersch in case of timeout and blame-hang none what your "next" step? Do you fix it? It's not clear to me why simply fail without any information can be useful to fix the problem.

We do have some tests which do massive computations on large data. If these reach a given timeout, e.g. because infrastructure issues lead to longer execution times (like wrong selection of Intel P vs E cores, or networking problems), then the respective dump would be huge and of no help.

Also I think that if you've a test like the above one you should take the cancellation token from the test framework you're using and pass it down the api you're testing. If you don't have a method for it subscribe to the token and do a Environment.FastFail("Timeout") and you'll have similar behavior of hang with none.

Ok. Yes, that would be an alternative possibility. I'll check this. Thanks

cbersch avatar Dec 10 '25 15:12 cbersch

@cbersch just to be sure to properly understand the behavior you are trying to achieve. Do you want to stop the full execution (not having coverage, trx...) when the global run is longer than some time? Do you want to "kill" any test after some time and continue the execution? Something else?

I would expect the blame-hang to stop the full execution if any test case takes longer than the given blame-hang-timeout. (So basically work like vstest does).

In that case I wouldn't expect any coverage, which is of no use anyway.

Regarding trx: it would be very helpful to log the offensive test case in some file. If that would be trx it would be great, because that would automatically list the respective test case as failed. The current behavior with the sequence file would also be ok, but that requires us to have an additional PS script which parses the vstest sequence files and creates trx files to visualize which test hung.

cbersch avatar Dec 10 '25 15:12 cbersch

do a Environment.FastFail("Timeout") and you'll have similar behavior of hang with none.

Environment.FailFast will also trigger a dump if the system is setup to collect it, so that might not be ideal. It also makes it hard to distinguish crash from the process, from expected teardown of a hanging test system. So to me the hang dump with dump type none, is a more elegant solution.

(So basically work like vstest does).

Might be a technicality, but VSTest does not work like this. There is single timer in hangdump that is reset every time a test ends. Meaning that test that will run for 2 minutes, will complete with 10s hangdump, as long as there is at least 1 other test every 20 seconds that runs.

The hang dump will kick in after the whole test run completed, and there are only hanging tests in the process for the given hangdump timeout.

(MTP adopted this timer approach as well recently after the problems with Monitor on non-windows).

nohwnd avatar Dec 11 '25 09:12 nohwnd

Environment.FailFast will also trigger a dump if the system is setup to collect it, so that might not be ideal.

Yep if the crash is setup, I prefer the timeout token from the framework where needed.

MarcoRossignoli avatar Dec 11 '25 10:12 MarcoRossignoli