runtime icon indicating copy to clipboard operation
runtime copied to clipboard

Intermittend failures in ThreadPoolTests on Mono Linux ARM64

Open jkoritzinsky opened this issue 2 years ago • 7 comments

Build Information

Build: https://dev.azure.com/dnceng-public/public/_build/results?buildId=266460 Build error leg or test failing: System.Threading.ThreadPools.Tests.ThreadPoolTests.ThreadPoolMinMaxThreadsEventTest Pull request:

Error Message

Fill the error message using known issues guidance.

{
  "ErrorMessage": "Exit code was 139 but it should have been 42",
  "BuildRetry": false,
  "ExcludeConsoleLog": true
}

Report

Build Definition Test Pull Request
533131 dotnet/runtime System.Diagnostics.Tests.DiagnosticSourceEventSourceBridgeTests.Stress_WriteConcurrently_DoesntCrash
533126 dotnet/runtime System.Diagnostics.Tests.DiagnosticSourceEventSourceBridgeTests.Stress_WriteConcurrently_DoesntCrash
533114 dotnet/runtime System.Diagnostics.Tests.DiagnosticSourceEventSourceBridgeTests.Stress_WriteConcurrently_DoesntCrash
531704 dotnet/runtime System.Diagnostics.Tests.DiagnosticSourceEventSourceBridgeTests.Stress_WriteConcurrently_DoesntCrash
531687 dotnet/runtime System.Diagnostics.Tests.DiagnosticSourceEventSourceBridgeTests.Stress_WriteConcurrently_DoesntCrash
531702 dotnet/runtime System.Diagnostics.Tests.DiagnosticSourceEventSourceBridgeTests.Stress_WriteConcurrently_DoesntCrash
530026 dotnet/runtime System.Diagnostics.Tests.DiagnosticSourceEventSourceBridgeTests.Stress_WriteConcurrently_DoesntCrash
530031 dotnet/runtime System.Diagnostics.Tests.DiagnosticSourceEventSourceBridgeTests.Stress_WriteConcurrently_DoesntCrash
530032 dotnet/runtime System.Diagnostics.Tests.DiagnosticSourceEventSourceBridgeTests.Stress_WriteConcurrently_DoesntCrash
528394 dotnet/runtime System.Diagnostics.Tests.DiagnosticSourceEventSourceBridgeTests.Stress_WriteConcurrently_DoesntCrash
528407 dotnet/runtime System.Diagnostics.Tests.DiagnosticSourceEventSourceBridgeTests.Stress_WriteConcurrently_DoesntCrash
528396 dotnet/runtime System.Diagnostics.Tests.DiagnosticSourceEventSourceBridgeTests.Stress_WriteConcurrently_DoesntCrash
527301 dotnet/runtime System.Diagnostics.Tests.DiagnosticSourceEventSourceBridgeTests.Stress_WriteConcurrently_DoesntCrash
527292 dotnet/runtime System.Diagnostics.Tests.DiagnosticSourceEventSourceBridgeTests.Stress_WriteConcurrently_DoesntCrash
527283 dotnet/runtime System.Diagnostics.Tests.DiagnosticSourceEventSourceBridgeTests.Stress_WriteConcurrently_DoesntCrash
526648 dotnet/runtime System.Diagnostics.Tests.DiagnosticSourceEventSourceBridgeTests.Stress_WriteConcurrently_DoesntCrash
526208 dotnet/runtime System.Diagnostics.Tests.DiagnosticSourceEventSourceBridgeTests.Stress_WriteConcurrently_DoesntCrash
526198 dotnet/runtime System.Diagnostics.Tests.DiagnosticSourceEventSourceBridgeTests.Stress_WriteConcurrently_DoesntCrash
526195 dotnet/runtime System.Diagnostics.Tests.DiagnosticSourceEventSourceBridgeTests.Stress_WriteConcurrently_DoesntCrash
526197 dotnet/runtime System.Diagnostics.Tests.DiagnosticSourceEventSourceBridgeTests.Stress_WriteConcurrently_DoesntCrash
526190 dotnet/runtime System.Diagnostics.Tests.DiagnosticSourceEventSourceBridgeTests.Stress_WriteConcurrently_DoesntCrash
523477 dotnet/runtime System.Diagnostics.Tests.DiagnosticSourceEventSourceBridgeTests.Stress_WriteConcurrently_DoesntCrash
523476 dotnet/runtime System.Diagnostics.Tests.DiagnosticSourceEventSourceBridgeTests.Stress_WriteConcurrently_DoesntCrash
521431 dotnet/runtime System.Diagnostics.Tests.DiagnosticSourceEventSourceBridgeTests.Stress_WriteConcurrently_DoesntCrash
521447 dotnet/runtime System.Diagnostics.Tests.DiagnosticSourceEventSourceBridgeTests.Stress_WriteConcurrently_DoesntCrash
521433 dotnet/runtime System.Diagnostics.Tests.DiagnosticSourceEventSourceBridgeTests.Stress_WriteConcurrently_DoesntCrash
521439 dotnet/runtime System.Diagnostics.Tests.DiagnosticSourceEventSourceBridgeTests.Stress_WriteConcurrently_DoesntCrash
519187 dotnet/runtime System.Diagnostics.Tests.DiagnosticSourceEventSourceBridgeTests.Stress_WriteConcurrently_DoesntCrash
517705 dotnet/runtime System.Diagnostics.Tests.DiagnosticSourceEventSourceBridgeTests.Stress_WriteConcurrently_DoesntCrash
517694 dotnet/runtime System.Diagnostics.Tests.DiagnosticSourceEventSourceBridgeTests.Stress_WriteConcurrently_DoesntCrash
517706 dotnet/runtime System.Diagnostics.Tests.DiagnosticSourceEventSourceBridgeTests.Stress_WriteConcurrently_DoesntCrash
517695 dotnet/runtime System.Diagnostics.Tests.DiagnosticSourceEventSourceBridgeTests.Stress_WriteConcurrently_DoesntCrash
517703 dotnet/runtime System.Diagnostics.Tests.DiagnosticSourceEventSourceBridgeTests.Stress_WriteConcurrently_DoesntCrash
517697 dotnet/runtime System.Diagnostics.Tests.DiagnosticSourceEventSourceBridgeTests.Stress_WriteConcurrently_DoesntCrash
517194 dotnet/runtime System.Diagnostics.Tests.DiagnosticSourceEventSourceBridgeTests.Stress_WriteConcurrently_DoesntCrash
516789 dotnet/runtime System.Diagnostics.Tests.DiagnosticSourceEventSourceBridgeTests.Stress_WriteConcurrently_DoesntCrash
516782 dotnet/runtime System.Diagnostics.Tests.DiagnosticSourceEventSourceBridgeTests.Stress_WriteConcurrently_DoesntCrash
516024 dotnet/runtime System.Diagnostics.Tests.DiagnosticSourceEventSourceBridgeTests.Stress_WriteConcurrently_DoesntCrash
516008 dotnet/runtime System.Diagnostics.Tests.DiagnosticSourceEventSourceBridgeTests.Stress_WriteConcurrently_DoesntCrash
514563 dotnet/runtime System.Diagnostics.Tests.DiagnosticSourceEventSourceBridgeTests.Stress_WriteConcurrently_DoesntCrash
513653 dotnet/runtime System.Diagnostics.Tests.DiagnosticSourceEventSourceBridgeTests.TestWithPropertyThrowing dotnet/runtime#96440
512010 dotnet/runtime System.Diagnostics.Tests.DiagnosticSourceEventSourceBridgeTests.Stress_WriteConcurrently_DoesntCrash
511537 dotnet/runtime System.Diagnostics.Tests.DiagnosticSourceEventSourceBridgeTests.Stress_WriteConcurrently_DoesntCrash
511524 dotnet/runtime System.Diagnostics.Tests.DiagnosticSourceEventSourceBridgeTests.Stress_WriteConcurrently_DoesntCrash
511520 dotnet/runtime System.Diagnostics.Tests.DiagnosticSourceEventSourceBridgeTests.Stress_WriteConcurrently_DoesntCrash
511303 dotnet/runtime System.Diagnostics.Tests.DiagnosticSourceEventSourceBridgeTests.Stress_WriteConcurrently_DoesntCrash
511300 dotnet/runtime System.Diagnostics.Tests.DiagnosticSourceEventSourceBridgeTests.Stress_WriteConcurrently_DoesntCrash
511296 dotnet/runtime System.Diagnostics.Tests.DiagnosticSourceEventSourceBridgeTests.Stress_WriteConcurrently_DoesntCrash
510969 dotnet/runtime System.Diagnostics.Tests.DiagnosticSourceEventSourceBridgeTests.Stress_WriteConcurrently_DoesntCrash
510976 dotnet/runtime System.Diagnostics.Tests.DiagnosticSourceEventSourceBridgeTests.Stress_WriteConcurrently_DoesntCrash
510960 dotnet/runtime System.Diagnostics.Tests.DiagnosticSourceEventSourceBridgeTests.Stress_WriteConcurrently_DoesntCrash
510443 dotnet/runtime System.Diagnostics.Tests.DiagnosticSourceEventSourceBridgeTests.Stress_WriteConcurrently_DoesntCrash
509956 dotnet/runtime System.Diagnostics.Tests.DiagnosticSourceEventSourceBridgeTests.Stress_WriteConcurrently_DoesntCrash
509958 dotnet/runtime System.Diagnostics.Tests.DiagnosticSourceEventSourceBridgeTests.Stress_WriteConcurrently_DoesntCrash
509963 dotnet/runtime System.Diagnostics.Tests.DiagnosticSourceEventSourceBridgeTests.Stress_WriteConcurrently_DoesntCrash
509409 dotnet/runtime System.Diagnostics.Tests.DiagnosticSourceEventSourceBridgeTests.Stress_WriteConcurrently_DoesntCrash
509412 dotnet/runtime System.Diagnostics.Tests.DiagnosticSourceEventSourceBridgeTests.Stress_WriteConcurrently_DoesntCrash
509393 dotnet/runtime System.Diagnostics.Tests.DiagnosticSourceEventSourceBridgeTests.Stress_WriteConcurrently_DoesntCrash
508998 dotnet/runtime System.Diagnostics.Tests.DiagnosticSourceEventSourceBridgeTests.Stress_WriteConcurrently_DoesntCrash
509008 dotnet/runtime System.Diagnostics.Tests.DiagnosticSourceEventSourceBridgeTests.Stress_WriteConcurrently_DoesntCrash
508523 dotnet/runtime System.Diagnostics.Tests.DiagnosticSourceEventSourceBridgeTests.Stress_WriteConcurrently_DoesntCrash
508211 dotnet/runtime System.Diagnostics.Tests.DiagnosticSourceEventSourceBridgeTests.Stress_WriteConcurrently_DoesntCrash
508212 dotnet/runtime System.Diagnostics.Tests.DiagnosticSourceEventSourceBridgeTests.Stress_WriteConcurrently_DoesntCrash
507916 dotnet/runtime System.Diagnostics.Tests.DiagnosticSourceEventSourceBridgeTests.Stress_WriteConcurrently_DoesntCrash
507915 dotnet/runtime System.Diagnostics.Tests.DiagnosticSourceEventSourceBridgeTests.Stress_WriteConcurrently_DoesntCrash
507381 dotnet/runtime System.Diagnostics.Tests.DiagnosticSourceEventSourceBridgeTests.Stress_WriteConcurrently_DoesntCrash
507379 dotnet/runtime System.Diagnostics.Tests.DiagnosticSourceEventSourceBridgeTests.Stress_WriteConcurrently_DoesntCrash
507370 dotnet/runtime System.Diagnostics.Tests.DiagnosticSourceEventSourceBridgeTests.Stress_WriteConcurrently_DoesntCrash

Summary

24-Hour Hit Count 7-Day Hit Count 1-Month Count
0 16 68

Known issue validation

Build: :mag_right: Result validation: :warning: Validation could not be done without an Azure DevOps build URL on the issue. Please add it to the "Build: :mag_right:" line.

jkoritzinsky avatar May 09 '23 04:05 jkoritzinsky

Tagging subscribers to this area: @mangod9 See info in area-owners.md if you want to be subscribed.

Issue Details

Build Information

Build: https://dev.azure.com/dnceng-public/public/_build/results?buildId=266460 Build error leg or test failing: System.Threading.ThreadPools.Tests.ThreadPoolTests.ThreadPoolMinMaxThreadsEventTest Pull request:

Error Message

Fill the error message using known issues guidance.

{
  "ErrorMessage": "Exit code was 139 but it should have been 42",
  "BuildRetry": false,
  "ExcludeConsoleLog": true
}
Author: jkoritzinsky
Assignees: -
Labels:

area-System.Threading, blocking-clean-ci, Known Build Error

Milestone: -

ghost avatar May 09 '23 04:05 ghost

Tagging subscribers to this area: @mangod9 See info in area-owners.md if you want to be subscribed.

Issue Details

Build Information

Build: https://dev.azure.com/dnceng-public/public/_build/results?buildId=266460 Build error leg or test failing: System.Threading.ThreadPools.Tests.ThreadPoolTests.ThreadPoolMinMaxThreadsEventTest Pull request:

Error Message

Fill the error message using known issues guidance.

{
  "ErrorMessage": "Exit code was 139 but it should have been 42",
  "BuildRetry": false,
  "ExcludeConsoleLog": true
}

Report

Build Definition Test Pull Request
266460 dotnet/runtime System.Threading.ThreadPools.Tests.ThreadPoolTests.ThreadPoolMinMaxThreadsEventTest dotnet/runtime#85941

Summary

24-Hour Hit Count 7-Day Hit Count 1-Month Count
1 1 1
Author: jkoritzinsky
Assignees: -
Labels:

arch-arm64, area-System.Threading, os-linux, blocking-clean-ci, untriaged, Known Build Error, area-Threading-mono

Milestone: -

ghost avatar May 09 '23 15:05 ghost

@lambdageek - didn't we have a similar intermittent threading issue somewhere?

SamMonoRT avatar Jul 25 '23 14:07 SamMonoRT

here I have one on x64, with mono_dump_native_crash_info

=================================================================
	External Debugger Dump:
=================================================================
/tmp/mono-gdb-commands.209:1: Error in sourced command file:
ptrace: Operation not permitted.
[New LWP 210]
[New LWP 211]
[New LWP 212]
[New LWP 213]
Missing separate debuginfo for /root/helix/work/correlation/dotnet
Try: yum --enablerepo='*debug*' install /usr/lib/debug/.build-id/9a/ee12662a8f8c3b14915491ceba0663c152d89a.debug
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
Missing separate debuginfo for /root/helix/work/correlation/host/fxr/9.0.0/libhostfxr.so
Try: yum --enablerepo='*debug*' install /usr/lib/debug/.build-id/85/5812b12a0494863911e4320a48b9c4d4450817.debug
Missing separate debuginfo for /root/helix/work/correlation/shared/Microsoft.NETCore.App/9.0.0/libhostpolicy.so
Try: yum --enablerepo='*debug*' install /usr/lib/debug/.build-id/ea/2c76e0d159d038a7eab2f0cc9f07155aa572ef.debug
Dwarf Error: Cannot handle DW_FORM_strx1 in DWARF reader [in module /root/helix/work/correlation/shared/Microsoft.NETCore.App/9.0.0/libcoreclr.so.dbg]
Dwarf Error: Cannot handle DW_FORM_strx1 in DWARF reader [in module /root/helix/work/correlation/shared/Microsoft.NETCore.App/9.0.0/libSystem.Native.so.dbg]
0x00007fe2b0683672 in waitpid () from /lib64/libpthread.so.0
  Id   Target Id                                         Frame 
* 1    Thread 0x7fe2b0ab5740 (LWP 209) "dotnet"          0x00007fe2b0683672 in waitpid () from /lib64/libpthread.so.0
  2    Thread 0x7fe2ae7ff700 (LWP 210) "SGen worker"     0x00007fe2b067f45c in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
  3    Thread 0x7fe2aee11700 (LWP 211) ".NET EventPipe"  0x00007fe2af89d301 in poll () from /lib64/libc.so.6
  4    Thread 0x7fe2ac9ff700 (LWP 212) "Finalizer"       0x00007fe2b0681da6 in do_futex_wait.constprop () from /lib64/libpthread.so.0
  5    Thread 0x7fe2a5f94700 (LWP 213) ".NET Long Runni" 0x00007fe2b0683672 in waitpid () from /lib64/libpthread.so.0

Thread 5 (Thread 0x7fe2a5f94700 (LWP 213)):
#0  0x00007fe2b0683672 in waitpid () from /lib64/libpthread.so.0
#1  0x00007fe2af3f2212 in mono_dump_native_crash_info () from /root/helix/work/correlation/shared/Microsoft.NETCore.App/9.0.0/libcoreclr.so
#2  0x00007fe2af39946e in mono_handle_native_crash () from /root/helix/work/correlation/shared/Microsoft.NETCore.App/9.0.0/libcoreclr.so
#3  0x00007fe2af3001e1 in mono_sigsegv_signal_handler_debug () from /root/helix/work/correlation/shared/Microsoft.NETCore.App/9.0.0/libcoreclr.so
#4  <signal handler called>
#5  0x00007fe2af52f412 in mono_profiler_set_jit_begin_callback () from /root/helix/work/correlation/shared/Microsoft.NETCore.App/9.0.0/libcoreclr.so
#6  0x00007fe2af468a08 in EventPipeEtwCallbackDotNETRuntime () from /root/helix/work/correlation/shared/Microsoft.NETCore.App/9.0.0/libcoreclr.so
#7  0x00007fe2af45a3e9 in provider_invoke_callback () from /root/helix/work/correlation/shared/Microsoft.NETCore.App/9.0.0/libcoreclr.so
#8  0x00007fe2af45ab34 in disable_helper () from /root/helix/work/correlation/shared/Microsoft.NETCore.App/9.0.0/libcoreclr.so
#9  0x00000000418b9cfa in ?? ()
#10 0x000000000000000a in ?? ()
#11 0x00007fe2af617150 in ?? ()
#12 0x00007fe2a5f93848 in ?? ()
#13 0x0000000000000001 in ?? ()
#14 0x0000000000000000 in ?? ()

Thread 4 (Thread 0x7fe2ac9ff700 (LWP 212)):
#0  0x00007fe2b0681da6 in do_futex_wait.constprop () from /lib64/libpthread.so.0
#1  0x00007fe2b0681e98 in __new_sem_wait_slow.constprop.0 () from /lib64/libpthread.so.0
#2  0x00007fe2af55f9e8 in finalizer_thread () from /root/helix/work/correlation/shared/Microsoft.NETCore.App/9.0.0/libcoreclr.so
#3  0x00007fe2af538669 in start_wrapper () from /root/helix/work/correlation/shared/Microsoft.NETCore.App/9.0.0/libcoreclr.so
#4  0x00007fe2b06791ca in start_thread () from /lib64/libpthread.so.0
#5  0x00007fe2af7b2e73 in clone () from /lib64/libc.so.6

Thread 3 (Thread 0x7fe2aee11700 (LWP 211)):
#0  0x00007fe2af89d301 in poll () from /lib64/libc.so.6
#1  0x00007fe2af484ece in ds_ipc_poll () from /root/helix/work/correlation/shared/Microsoft.NETCore.App/9.0.0/libcoreclr.so
#2  0x00007fe2af481de8 in ds_ipc_stream_factory_get_next_available_stream () from /root/helix/work/correlation/shared/Microsoft.NETCore.App/9.0.0/libcoreclr.so
#3  0x00007fe2af48044f in server_thread () from /root/helix/work/correlation/shared/Microsoft.NETCore.App/9.0.0/libcoreclr.so
#4  0x00007fe2af484341 in ep_rt_thread_mono_start_func () from /root/helix/work/correlation/shared/Microsoft.NETCore.App/9.0.0/libcoreclr.so
#5  0x00007fe2b06791ca in start_thread () from /lib64/libpthread.so.0
#6  0x00007fe2af7b2e73 in clone () from /lib64/libc.so.6

Thread 2 (Thread 0x7fe2ae7ff700 (LWP 210)):
#0  0x00007fe2b067f45c in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00007fe2af5c7323 in thread_func () from /root/helix/work/correlation/shared/Microsoft.NETCore.App/9.0.0/libcoreclr.so
#2  0x00007fe2b06791ca in start_thread () from /lib64/libpthread.so.0
#3  0x00007fe2af7b2e73 in clone () from /lib64/libc.so.6

Thread 1 (Thread 0x7fe2b0ab5740 (LWP 209)):
#0  0x00007fe2b0683672 in waitpid () from /lib64/libpthread.so.0
#1  0x00007fe2af3f2212 in mono_dump_native_crash_info () from /root/helix/work/correlation/shared/Microsoft.NETCore.App/9.0.0/libcoreclr.so
#2  0x00007fe2af39946e in mono_handle_native_crash () from /root/helix/work/correlation/shared/Microsoft.NETCore.App/9.0.0/libcoreclr.so
#3  0x00007fe2af3f1981 in sigabrt_signal_handler () from /root/helix/work/correlation/shared/Microsoft.NETCore.App/9.0.0/libcoreclr.so
#4  <signal handler called>
#5  0x00007fe2af7c7acf in raise () from /lib64/libc.so.6
#6  0x00007fe2af79aea5 in abort () from /lib64/libc.so.6
#7  0x00007fe2af48fb05 in monoeg_assert_abort () from /root/helix/work/correlation/shared/Microsoft.NETCore.App/9.0.0/libcoreclr.so
#8  0x00007fe2af4a0106 in mono_log_write_logfile () from /root/helix/work/correlation/shared/Microsoft.NETCore.App/9.0.0/libcoreclr.so
#9  0x00007fe2af48fec6 in monoeg_g_logv () from /root/helix/work/correlation/shared/Microsoft.NETCore.App/9.0.0/libcoreclr.so
#10 0x00007fe2af48fffe in monoeg_g_log () from /root/helix/work/correlation/shared/Microsoft.NETCore.App/9.0.0/libcoreclr.so
#11 0x00007fe2af467b5b in ep_rt_mono_fini () from /root/helix/work/correlation/shared/Microsoft.NETCore.App/9.0.0/libcoreclr.so
#12 0x00007fe2af304601 in mini_cleanup () from /root/helix/work/correlation/shared/Microsoft.NETCore.App/9.0.0/libcoreclr.so
#13 0x00007fe2af4e5dbb in ves_icall_System_Environment_Exit () from /root/helix/work/correlation/shared/Microsoft.NETCore.App/9.0.0/libcoreclr.so
#14 0x00000000418b8bba in ?? ()
#15 0x0000000000000000 in ?? ()
[Inferior 1 (process 209) detached]

Log

pavelsavara avatar Dec 06 '23 14:12 pavelsavara

@lateralusX some kind of race condition in eventpipe shutdown, looks like.

lambdageek avatar Dec 06 '23 19:12 lambdageek

Note that we don't shut down threads in Environment.Exit () so there is a possibility of races.

vargaz avatar Dec 06 '23 20:12 vargaz

ep_rt_mono_fini assumes that all threads that might run EventPipe code has been stopped, so if there are still threads that can call into EventPipe at that point, it will race with shutdown logic. In this case Thread 5 is disabling a managed EventSource provider that will use the profiler callbacks, that is reset by ep_rt_mono_fini since it assumes no other threads are running at that point, the assert in Thread 1 happens since it will clean up a mutex, that is currently held by Thread 5.

On CoreClr/NativeAot we don't have any cleanup done in ep_rt_shutdown and those runtimes will leak resources, but on Mono we do cleanup of runtime resources. We probably need to detect if the shutdown is triggered in a way where other managed threads migth still be running when calling ep_rt_shutdown, if so we would probably need to leak these resources.

lateralusX avatar Dec 07 '23 08:12 lateralusX

This should be fixed on https://github.com/dotnet/runtime/pull/99332

tommcdon avatar May 20 '24 22:05 tommcdon

Looks like the test passed on the linux-arm64 Debug Mono_MiniJIT_LibrariesTests lane on a mono related PR in the general runtime pipeline. Didn't find the testResults.xml, since the helix job only had a console, but I was able to filter the tests tab on the azdo build and found that ThreadPoolMinMaxThreadsEventTest passed on both the Debian 11 and Ubuntu helix machines https://dev.azure.com/dnceng-public/public/_build/results?buildId=682588&view=ms.vss-test-web.build-test-results-tab&runId=16939684&paneView=debug.

mdh1418 avatar May 21 '24 15:05 mdh1418