sdk icon indicating copy to clipboard operation
sdk copied to clipboard

Regression: `dotnet test` hangs for 15 minutes after test run completed on mac and linux

Open AArnott opened this issue 6 years ago • 34 comments

Our builds on VSTS include dotnet test, which restores, builds, and tests our product. With dotnet CLI 2.1.100 all went well. But when we upgraded to 2.1.300, the dotnet tool doesn't exit for 15 minutes after the test run is completed with no output to explain the delay.

Expected behavior

The dotnet tool exits immediately, as can be seen on this 2.1.100 run:

image

Actual behavior

The dotnet tool exits after waiting 15 minutes, as can be seen on this 2.1.300 run:

image

Steps to reproduce

Please contact me via my microsoft.com email address for links to the builds if you'd like to investigate further.

AArnott avatar Jun 01 '18 14:06 AArnott

I am not sure what you are doing in VSTS, but I suspect this might be long running servers that are still live when the test finishes.

In 2.1.300, we now have a long running VBCSCompiler and a MSBuild node re-use server and a Razor compilation server.

They all stay around for some time to avoid JITing and improve perf if you attempt a build again.

You can turn them off by running dotnet build-server shutdown after the test execution. Or, you can set msbuild properties/env variables to prevent them from running to begin with.

livarcocc avatar Jun 01 '18 17:06 livarcocc

I suspect this might be long running servers that are still live when the test finishes.

Perhaps. But why would that prevent the tool from exiting?

You can turn them off by running dotnet build-server shutdown after the test execution

How would I run that on VSTS after test execution, since test execution itself doesn't return control to the next step in my build till the 15 minutes are over?

Or, you can set msbuild properties/env variables to prevent them from running to begin with.

I'm willing to try that. Where do I find the variables to set? And if it works, is that a bug in dotnet CLI then that it's waiting for those servers when the work is finished anyway?

AArnott avatar Jun 01 '18 18:06 AArnott

@peterhuene can you help @AArnott in getting the properties to turn off the long running servers?

@AArnott I don't understand enough about how VSTS does test execution to know why it would consider as if test was not done in that case.

livarcocc avatar Jun 01 '18 21:06 livarcocc

Use -p:UseRazorBuildServer=false to disable the Razor (rzc) server.

Use -p:UseSharedCompilation=false to disable the Roslyn (vbcscompiler) server.

For MSBuild, pass /nodeReuse:false on the command line to disable node re-use.

I don't know of anything that might be waiting on the servers. There was an issue where MSBuild might hang with node re-use enabled on non-Windows platforms (https://github.com/Microsoft/msbuild/issues/3161), but that was fixed for RTM.

Would it be possible to capture the stacks of the hang?

peterhuene avatar Jun 01 '18 21:06 peterhuene

FWIW, I'm seeing hangs on Linux for dotnet build unless I disable node reuse too. This happens on a machine where I'm running BenchmarkDotNet benchmarks, which does its own builds as well - so it's possible that something in that is jamming a build node. (My next step will be to find a way of disabling build reuse globally...)

jskeet avatar Jun 03 '18 06:06 jskeet

I can confirm that setting MSBUILDDISABLENODEREUSE=1 on mac and linux removes the hang at the end of the build+test invocation as well.

AArnott avatar Jun 03 '18 16:06 AArnott

FWIW, I vaguely recall a bug in a Windows process my team owned where a parent process spawned my child process, and so long as my child process was running the parent process never entirely exited. We fixed it (I don't remember how but could potentially find out). I wonder if something similar is happening here, where dotnet test (which builds first) spawned the reusable node processes and children (instead of as top-level processes perhaps?) and now the parent process cannot exit till the reusable nodes timeout and exit. In that case, perhaps the fix is the change how the child processes are spawned so that they do not retain a child relationship with the parent.

AArnott avatar Jun 03 '18 16:06 AArnott

@rainersigwald Some reports of hitting hangs with node reuse above.

peterhuene avatar Jun 04 '18 06:06 peterhuene

This sounds exactly like what https://github.com/Microsoft/msbuild/commit/e604f5f42cb277d02a3b4bf4ce7136c1e10a5eee addressed. But evidently that was incomplete or doesn't work everywhere.

@jskeet @AArnott what's the best way for me to repro your cases? Can I just clone nodatime (before nodatime/nodatime#1157) and run build/runbenchmarks.sh?

rainersigwald avatar Jun 04 '18 16:06 rainersigwald

@rainersigwald: I believe that should repro it, yes. You'll need to specify the targets to run - netcoreapp2.0 netcoreapp2.1 is what I do normally. It's slightly harder to tell as I've normally run it as part of a cron job. I'd be happy to try it beforehand if that would help. Note that it's expected that it takes multiple hours, but there should be a lot of output a lot of the time.

When it's broken, you should be able to run dotnet build anywhere (e.g. in src/NodaTime) and it'll hang.

jskeet avatar Jun 04 '18 16:06 jskeet

@jskeet Oh, that's interesting. It repros with dotnet build and not just dotnet test?

I think I understand the dotnet test problem: I think we're accidentally running the before-test build within the context of the

https://github.com/dotnet/cli/blob/7ce47778b4a5238db133cd0d4ffc3aa2695e406c/src/dotnet/commands/dotnet-test/Program.cs#L99

environment, leading the build nodes to be created in a hangy way.

But that wouldn't apply to build so I'll have to try debugging through the nodatime repro.

rainersigwald avatar Jun 04 '18 17:06 rainersigwald

@rainersigwald: I don't run dotnet test at all (in this workflow, or on that machine) - it repros just using dotnet run with BenchmarkDotNet, but that does some more builds in the background. I don't know all the details of what it does, but I doubt that it calls dotnet test.

But once it's in the "broken" state, then yes, dotnet build hangs - as does dotnet restore. (Specifying -nodeReuse:false on the command line makes both work.)

I'm happy to try to get it into the broken state again and then poke around - are there diagnostics that would be useful to try at that point?

jskeet avatar Jun 04 '18 17:06 jskeet

I just want to chime in and say that I'm experiencing this issue as well. I haven't been able to reproduce it reliably yet, but I'll keep trying and once I do, I'll post a link to a repo here.

The scenario I have is a unit test that is testing some MSBuild targets files. This involves calling dotnet build during the unit test. This is basically what it does:

  1. Write a project file that uses the targets to disk.
  2. Compile the project using dotnet build.
  3. Look at the output from the build and assert some stuff.

When running this via dotnet test, it sometimes hangs for 15 minutes after the tests complete (the test duration reported is only a few seconds), but sometimes it doesn't hang at all.

reduckted avatar Aug 30 '18 01:08 reduckted

@reduckted You should be able to resolve that by calling dotnet build -nodereuse:false inside your test. What's happening for you is that the outer dotnet test sets the environment variable that MSBuild uses to decide whether to attach the console or not. That is propagated down through child processes to the test process and then to the test's child dotnet build process, which then starts with node reuse but without the safe disconnect-console node startup. Explicitly specifying "no node reuse" should cause the worker nodes to exit immediately after the build, allowing the wait-on-child-process code to complete quickly.

rainersigwald avatar Aug 31 '18 14:08 rainersigwald

Thanks @rainersigwald. 👍 I had tried that after reading some of the earlier posts and the hangs appeared to stop, but since I couldn't reliably reproduce it to begin with, I wasn't 100% sure it was the solution.

reduckted avatar Sep 01 '18 00:09 reduckted

FYI, this is happening for me on Ubuntu 18.04 and with dotnet 2.2.300. Passing -nodereuse:false to dotnet test solves this problem for me.

/usr/bin/dotnet test --test-adapter-path . --framework netcoreapp2.2 --logger xunit;LogFilePath=/home/.../output/testresults-linux.xml -nodereuse:false

GeorgDangl avatar Jun 03 '19 19:06 GeorgDangl

I can confirm that this still happens on .NET Core 3.1

ubuntu@VM-0-4-ubuntu:~$ lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description:    Ubuntu 18.04.1 LTS
Release:        18.04
Codename:       bionic
root@195d912bb800:/# dotnet --version
3.1.100
Test run for ***********************(.NETCoreApp,Version=v3.1)
Microsoft (R) Test Execution Command Line Tool Version 16.3.0
Copyright (c) Microsoft Corporation.  All rights reserved.

Starting test execution, please wait...

A total of 1 test files matched the specified pattern.                      <---- hangs forever

-nodereuse:false doesn't work either.

wakuflair avatar Dec 20 '19 03:12 wakuflair

I can confirm that this still happens on .NET Core 3.1

ubuntu@VM-0-4-ubuntu:~$ lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description:    Ubuntu 18.04.1 LTS
Release:        18.04
Codename:       bionic
root@195d912bb800:/# dotnet --version
3.1.100
Test run for ***********************(.NETCoreApp,Version=v3.1)
Microsoft (R) Test Execution Command Line Tool Version 16.3.0
Copyright (c) Microsoft Corporation.  All rights reserved.

Starting test execution, please wait...

A total of 1 test files matched the specified pattern.                      <---- hangs forever

-nodereuse:false doesn't work either.

This the exact spot where my test run also hangs.

One interesting point I noticed is that the same code without changes runs in Visual Studio for Mac (the very first test method runs ok, and the others take 2+ minutes each - pure unit tests with mocking - so not even hitting the db). Not sure if that can help solve the issue. dotnet version: 3.1.201

vraravam avatar Apr 14 '20 12:04 vraravam

I too was facing the same issue where tests hang when executed through Visual Studio or through command line. In my case, the issue turned out to be happening because of a Nuget package I was using to generate Fake objects - AutoBogus . The library has an open issue for the same problem here AutoBogus Issue. When I switched the package to use Bogus the library which AutoBogus uses, it fixed the issue. My tests are running smoothly now. Sharing this here, just in case if it helps any one.

srihere17 avatar Apr 15 '20 06:04 srihere17

I am still seeing this behavior on Azure DevOps with .NET Core 3.1.301 on a Windows Hosted Agent with VS2019

running dotnet test within a powershell script causes it to hang at the end of the execution. willing to share build and repro details.

Eg: Get's stuck here for 40+ minutes

image

CC: @rainersigwald @peterhuene @livarcocc

rohit21agrawal avatar Jul 09 '20 23:07 rohit21agrawal

Same issue on Windows Server 2019 Standard dotnet version of 5.0.100-preview.8.20417.9

image

Passing -nodereuse:false to dotnet test does not solves this issue.

I think I might have found the culprit. dotnet test expects to be run in a terminal window

When run in an emacs shell

[bence@x250:/tmp]$ dotnet new xunit --name XUnitTiming
The template "xUnit Test Project" was created successfully.

Processing post-creation actions...
Running 'dotnet restore' on XUnitTiming/XUnitTiming.csproj...
  Restore completed in 463.81 ms for /tmp/XUnitTiming/XUnitTiming.csproj.

Restore succeeded.


[bence@x250:/tmp]$ echo $TERM
dumb

[bence@x250:/tmp]$ time dotnet test XUnitTiming/XUnitTiming.csproj 
Test run for /tmp/XUnitTiming/bin/Debug/netcoreapp3.1/XUnitTiming.dll(.NETCoreApp,Version=v3.1)
Microsoft (R) Test Execution Command Line Tool Version 16.3.0
Copyright (c) Microsoft Corporation.  All rights reserved.

Starting test execution, please wait...

A total of 1 test files matched the specified pattern.

Test Run Successful.
Total tests: 1
     Passed: 1
 Total time: 1.2366 Seconds

real	0m23.992s
user	0m3.246s
sys	0m0.379s

[bence@x250:/tmp]$ 

When run in xterm:

[bence@x250:/tmp]$ echo $TERM
xterm

[bence@x250:/tmp]$ time dotnet test XUnitTiming/XUnitTiming.csproj 
Test run for /tmp/XUnitTiming/bin/Debug/netcoreapp3.1/XUnitTiming.dll(.NETCoreApp,Version=v3.1)
Microsoft (R) Test Execution Command Line Tool Version 16.3.0
Copyright (c) Microsoft Corporation.  All rights reserved.

Starting test execution, please wait...

A total of 1 test files matched the specified pattern.
                                                                                                                                                                                                                                                                                                                              
Test Run Successful.
Total tests: 1
     Passed: 1
 Total time: 1.2068 Seconds

real    0m2.699s
user    0m2.961s
sys     0m0.318s

[bence@x250:/tmp]$ 

That 20 seconds in the first case is spent after the line A total of 1 test files matched the specified pattern. Where I assume it is expecting to read something from the terminal

bencef avatar Oct 15 '20 12:10 bencef

Hey folks, i'm facing this issue too, i have a project with BDD tests using specflow and mocking external api's using wiremock.net, when the pipeline run the line A total of 1 test files matched the specified pattern. stay stucked for 16 minutes, and it's just 5 tests scenarios.

Leonardo18 avatar Oct 22 '20 16:10 Leonardo18

Hello Guys, I tried all options said here, but nothing work.

I solves my problem (happening only in azure devops integration) turning off parallel test executing

In my test project > Add folder Properties > Add AssemblyInfo.cs

using Xunit;

[assembly: CollectionBehavior(DisableTestParallelization = true)]

ThiagoBarradas avatar Mar 19 '21 18:03 ThiagoBarradas

I'm experiencing some test hanging within GitLab and docker image for the dotnet SDK 5.0 and this seems a quite common issue. See https://github.com/microsoft/vstest/issues/2080

One question for those of you experiencing this. Are you using Hosted services? I can reproduce the issue only when using IHostedService, but if I disable that, all is good.

For example, I've got the following:

public class InfrastructureInitializer
    : IHostedService
{
    private readonly EventBusManager _eventBusManager;

    public InfrastructureInitializer(EventBusManager eventBusManager)
    {
        _eventBusManager = eventBusManager;
    }

    public async Task StartAsync(CancellationToken cancellationToken)
    {
        var cancellationTokenSource = new CancellationTokenSource(TimeSpan.FromSeconds(10));
        await _eventBusManager.Start(cancellationTokenSource.Token);
    }

    public async Task StopAsync(CancellationToken cancellationToken)
    {
        var cancellationTokenSource = new CancellationTokenSource(TimeSpan.FromSeconds(10));
        await _eventBusManager.Stop(cancellationTokenSource.Token);
    }
}

and registered in DI container

services.AddSingleton<IHostedService, InfrastructureInitializer>();

Notice there is a cancellation token that should be respected (and locally it does) if for whatever reason my event bus is unreachable. It seems that a deadlock is caused by this despite of properly awaiting code and having the cancellation token.

Is it because of tests running in parallel? Is that a problem when using IHostedService? Why in Windows or WSL2 ubuntu I'm not having this issue then?

diegosasw avatar Mar 30 '21 09:03 diegosasw

Verified the deadlock issue does not happen for the IHostedService when disabling parallelization as per Thiago's suggestion.

In my functional tests project I create a Properties/AssemblyInfo.cs to disable parallelization only in Release mode, because locally in Debug I don't find the issue.

using Xunit;
#if DEBUG
[assembly: CollectionBehavior(DisableTestParallelization = false)]
#else
[assembly: CollectionBehavior(DisableTestParallelization = true)]
#endif

diegosasw avatar Mar 30 '21 10:03 diegosasw

I am seeing this issue on CircleCI with a Windows Server 2019 VM. CircleCI's no_output_timeout is causing it to report failure when it actually succeeded.

SDK Version: 5.0.302

Passed!  - Failed:     0, Passed:  1687, Skipped:     0, Total:  1687, Duration: 4 m 39 s - ProtoPromiseTests.dll (net5.0)

Too long with no output (exceeded 10m0s): context deadline exceeded

Passing -nodereuse:false seems to help, though I've only ran 1 test since I added it, so I'm not sure how stable it is.

timcassell avatar Aug 04 '21 11:08 timcassell

I can reproduce the issue only when using IHostedService, but if I disable that, all is good.

I just reported a similar bug, related to Task.Delay in IHostedService.StopAsync():

Deadlock in integration tests after upgrading to ASP.NET Core 6 https://github.com/dotnet/aspnetcore/issues/38649

angularsen avatar Nov 26 '21 14:11 angularsen

FWIW

I'm having a similar situation at github actions with setup .NET action, using XUnit + .NET6

lpeixotoo avatar Jan 07 '22 20:01 lpeixotoo

I am using Xunit and was able to fix this issue by adding the following to csproj:

<PackageReference Include="Microsoft.NET.Test.Sdk" Version="17.0.0" />

lloydstockman avatar Mar 17 '22 20:03 lloydstockman