nunit-console icon indicating copy to clipboard operation
nunit-console copied to clipboard

Running tests in parallel sporadically fails with exit code 255

Open asendimitrov opened this issue 5 years ago • 10 comments

Hi, I'm using NUnit 24/7 at work and I'm sporadically hitting an issue when running tests in parallel.

NUnit Framework version: 3.12 (NuGet) NUnit ConsoleRunner version: 3.10 (NuGet) Console command: %nunit.console.path% --trace=Debug --inprocess --result=%nunit.result.path% %tests.dlls.list% --workers=10 ParallelizableScope: Children (on assembly level)

The issue occurs around 10% of the time on our TeamCity agents and the error is Process exited with code 255. It looks to be affected by the hardware of the box the tests are ran on - there's no hard proof of this - but I've never managed to reproduce it locally (the agents are VMs with weakish dual core processors whereas mine has 8).

Lowering the workers doesn't seem to affect as I've seen the same result with both 5 and not specified at all (would default to 2 when ran on TC).

I'm always using --inprocess ever since I first hit the issue in order to see the real reason why this happens but there's nothing specific either as an exception by the console or in the logs. Omitting --inprocess results in a SocketException issue (#255 )

What I've noticed is that this (mostly) occurrs when test number X is running, where X is equal to the number of workers. Here's an example with both 5 and 10:

5 workers:

[11:38:46]MyCompany.dll (23s)
[11:38:53]MyCompany.Tests.GetBalances.GetBalancesTests.Test1("1") (16s)
[11:38:53]MyCompany.Tests.GetBalances.GetBalancesTests.Test1("2") (16s)
[11:38:53]MyCompany.Tests.GetBalances.GetBalancesTests.Test1("3") (16s)
[11:38:53]MyCompany.Tests.GetBalances.GetBalancesTests.Test1("4") (16s)
[11:38:53]MyCompany.Tests.GetBalances.GetBalancesTests.Test1("5") (16s)
[11:39:10]Process exited with code 255
[11:39:10]Process exited with code 255 (Step: Run tests (Command Line))
[11:39:10]Step Run tests (Command Line) failed

logs: 5 workers - InternalTrace.5864.zip

10 workers:

[11:03:42]MyCompany.dll (11s)
[11:03:48]MyCompany.Tests.GetBalances.GetBalancesTests.Test1("1") (5s)
[11:03:48]MyCompany.Tests.GetBalances.GetBalancesTests.Test1("2") (5s)
[11:03:48]MyCompany.Tests.GetBalances.GetBalancesTests.Test1("3") (5s)
[11:03:48]MyCompany.Tests.GetBalances.GetBalancesTests.Test1("4") (5s)
[11:03:48]MyCompany.Tests.GetBalances.GetBalancesTests.Test1("5") (5s)
[11:03:48]MyCompany.Tests.GetBalances.GetBalancesTests.Test1("6") (5s)
[11:03:48]MyCompany.Tests.GetBalances.GetBalancesTests.Test2(50,111.07d,50.1d) (5s)
[11:03:48]MyCompany.Tests.GetBalances.GetBalancesTests.Test3("1", 50,41.5d) (5s)
[11:03:48]MyCompany.Tests.GetBalances.GetBalancesTests.Test3("2", 50,41.5d) (5s)
[11:03:48]MyCompany.Tests.GetBalances.GetBalancesTests.Test4(50,"Default") (5s)
[11:03:53]Process exited with code 255
[11:03:53]Process exited with code 255 (Step: Run tests (Command Line))
[11:03:53]Step Run tests (Command Line) failed

logs: 10 workers - InternalTrace.2948.zip

Please let me know if you need anything else. Thanks.

asendimitrov avatar Nov 05 '19 12:11 asendimitrov

There's nothing in the logs to help here, unfortunately. Do you get anything additional logged when you run without --inprocess - other than the SocketException, of course?

ChrisMaddock avatar Nov 05 '19 22:11 ChrisMaddock

I don't think anything unusual other than default SocketException appears in the logs in that case. SocketException - InternalTrace.2136.zip

Example is from a build running with --workers=10 and without --inprocess

TeamCity logs:

[12:08:40]MyCompany.dll (36s)
[12:08:46]MyCompany.Tests.GetBalances.GetBalancesTests.Test1("1") (31s)
[12:08:46]MyCompany.Tests.GetBalances.GetBalancesTests.Test1("2") (31s)
[12:08:46]MyCompany.Tests.GetBalances.GetBalancesTests.Test1("3") (31s)
[12:08:46]MyCompany.Tests.GetBalances.GetBalancesTests.Test1("4") (31s)
[12:08:46]MyCompany.Tests.GetBalances.GetBalancesTests.Test1("5") (31s)
[12:08:46]MyCompany.Tests.GetBalances.GetBalancesTests.Test1("6") (31s)
[12:08:46]MyCompany.Tests.ReserveFunds.ReserveFundsTests.Test1("1","BalanceTooLow","The requested amount cannot be reserved due to insufficient funds.",BadRequest) (31s)
[12:08:46]MyCompany.Tests.ReserveFunds.ReserveFundsTests.Test1("2","BalanceTooLow","The requested amount cannot be reserved due to insufficient funds.",BadRequest) (31s)
[12:08:46]MyCompany.Tests.ReserveFunds.ReserveFundsTests.Test1("3","BalanceTooLow","The requested amount cannot be reserved due to insufficient funds.",BadRequest) (31s)
[12:08:46]MyCompany.Tests.ReserveFunds.ReserveFundsTests.Test1("4","BalanceTooLow","The requested amount cannot be reserved due to insufficient funds.",BadRequest) (31s)
[12:09:15]
[12:09:15]Unhandled Exception: NUnit.Engine.NUnitEngineException: Remote test agent exited with non-zero exit code 255
[12:09:15]   at NUnit.Engine.Services.TestAgency.OnAgentExit(Object sender, EventArgs e)
[12:09:15]   at System.Diagnostics.Process.OnExited()
[12:09:15]   at System.Diagnostics.Process.RaiseOnExited()
[12:09:15]   at System.Threading.ExecutionContext.RunInternal(ExecutionContext executionContext, ContextCallback callback, Object state, Boolean preserveSyncCtx)
[12:09:15]   at System.Threading.ExecutionContext.Run(ExecutionContext executionContext, ContextCallback callback, Object state, Boolean preserveSyncCtx)
[12:09:15]   at System.Threading._ThreadPoolWaitOrTimerCallback.PerformWaitOrTimerCallback(Object state, Boolean timedOut)
[12:09:17]
[12:09:17]Errors, Failures and Warnings
[12:09:17]
[12:09:17]1) Error : 
[12:09:17]System.Net.Sockets.SocketException : An existing connection was forcibly closed by the remote host
[12:09:17]--SocketException
[12:09:17]An existing connection was forcibly closed by the remote host
[12:09:17]
[12:09:17]Server stack trace: 
[12:09:17]   at System.Net.Sockets.Socket.Receive(Byte[] buffer, Int32 offset, Int32 size, SocketFlags socketFlags)
[12:09:17]   at System.Runtime.Remoting.Channels.SocketStream.Read(Byte[] buffer, Int32 offset, Int32 size)
[12:09:17]   at System.Runtime.Remoting.Channels.SocketHandler.ReadFromSocket(Byte[] buffer, Int32 offset, Int32 count)
[12:09:17]   at System.Runtime.Remoting.Channels.SocketHandler.Read(Byte[] buffer, Int32 offset, Int32 count)
[12:09:17]   at System.Runtime.Remoting.Channels.SocketHandler.ReadAndMatchFourBytes(Byte[] buffer)
[12:09:17]   at System.Runtime.Remoting.Channels.Tcp.TcpSocketHandler.ReadAndMatchPreamble()
[12:09:17]   at System.Runtime.Remoting.Channels.Tcp.TcpSocketHandler.ReadVersionAndOperation(UInt16& operation)
[12:09:17]   at System.Runtime.Remoting.Channels.Tcp.TcpClientSocketHandler.ReadHeaders()
[12:09:17]   at System.Runtime.Remoting.Channels.Tcp.TcpClientTransportSink.ProcessMessage(IMessage msg, ITransportHeaders requestHeaders, Stream requestStream, ITransportHeaders& responseHeaders, Stream& responseStream)
[12:09:17]   at System.Runtime.Remoting.Channels.BinaryClientFormatterSink.SyncProcessMessage(IMessage msg)
[12:09:17]
[12:09:17]Exception rethrown at [0]: 
[12:09:17]   at System.Runtime.Remoting.Proxies.RealProxy.HandleReturnMessage(IMessage reqMsg, IMessage retMsg)
[12:09:17]   at System.Runtime.Remoting.Proxies.RealProxy.PrivateInvoke(MessageData& msgData, Int32 type)
[12:09:17]   at NUnit.Engine.ITestEngineRunner.Run(ITestEventListener listener, TestFilter filter)
[12:09:17]   at NUnit.Engine.Runners.ProcessRunner.RunTests(ITestEventListener listener, TestFilter filter)
[12:09:17]
[12:09:17]Test Run Summary
[12:09:17]  Overall result: Failed
[12:09:17]  Test Count: 0, Passed: 0, Failed: 0, Warnings: 0, Inconclusive: 0, Skipped: 0
[12:09:17]  Start time: 2019-11-06 12:08:39Z
[12:09:17]    End time: 2019-11-06 12:09:15Z
[12:09:17]    Duration: 35.610 seconds
[12:09:17]
[12:09:17]Results (nunit3) saved as src/NUnitTestResult.xml
[12:09:17]
[12:09:17]NUnit.Engine.NUnitEngineUnloadException : Agent Process was terminated successfully after error.
[12:09:17]  ----> System.Net.Sockets.SocketException : No connection could be made because the target machine actively refused it 127.0.0.1:53782
[12:09:17]Process exited with code -100
[12:09:17]Process exited with code -100 (Step: Run tests (Command Line))
[12:09:17]Step Run tests (Command Line) failed

asendimitrov avatar Nov 06 '19 12:11 asendimitrov

You're right, there's no extra clues here. I'm afraid I'm really not sure what would cause this, perhaps others on @nunit/engine-team might have some ideas?

ChrisMaddock avatar Nov 09 '19 16:11 ChrisMaddock

The SocketException means the agent process died unceremoniously. What's the error message when the agent is not in the picture (use --inprocess)?

jnm2 avatar Nov 09 '19 16:11 jnm2

Nothing, it seems - see the top post. That's the bit that's confusing me! 🙂

ChrisMaddock avatar Nov 09 '19 16:11 ChrisMaddock

Oh, sorry. This is a long shot but I wonder if it could possibly be fixed by https://github.com/nunit/nunit-console/pull/660. How about trying the latest from the prerelease feed?

jnm2 avatar Nov 09 '19 16:11 jnm2

The internal trace just stops, similar to what I was seeing with the .NET Framework race condition that #660 avoids.

(Btw, the race condition is fixed in the November .NET Framework updates.)

jnm2 avatar Nov 09 '19 16:11 jnm2

Doesn't seem to help in any way :(

Tried several times with NUnit.ConsoleRunner.3.11.0-dev-04648 and results are exactly the same: Process breaks at the 10th test (with --workers=10) and logs contain just the SocketException or nothing at all if running with --inprocess.

asendimitrov avatar Nov 11 '19 08:11 asendimitrov

@nunit/engine-team Doesn't the engine get distributed with access to source code for debugging? Maybe we should make --agent-debug available in release builds so people can step through it?

@asendimitrov Just to be sure... have you looked at the separate internal trace log produced for the agent process? Each run may give you three or four logs.

CharliePoole avatar Nov 11 '19 19:11 CharliePoole

@CharliePoole thanks for tuning in. Yes, I've checked all 3 log files and there doesn't see to be a hint in either of them. Above I've shared a zip containing the nunit agent process log, and in it the last messages are:

12:08:40.441 Info  [ 5] NUnitFrameworkDriver: Loading MyCompany.dll - see separate log file
12:08:40.661 Info  [ 5] NUnitFrameworkDriver: Loaded MyCompany.dll
12:08:40.662 Info  [ 5] NUnitFrameworkDriver: Running MyCompany.dll - see separate log file

The process breaks after those timestamps, unless I'm missing anything?

asendimitrov avatar Nov 12 '19 07:11 asendimitrov