runtime
runtime copied to clipboard
Intermittend failures in ThreadPoolTests on Mono Linux ARM64
Build Information
Build: https://dev.azure.com/dnceng-public/public/_build/results?buildId=266460 Build error leg or test failing: System.Threading.ThreadPools.Tests.ThreadPoolTests.ThreadPoolMinMaxThreadsEventTest Pull request:
Error Message
Fill the error message using known issues guidance.
{
"ErrorMessage": "Exit code was 139 but it should have been 42",
"BuildRetry": false,
"ExcludeConsoleLog": true
}
Report
Summary
| 24-Hour Hit Count | 7-Day Hit Count | 1-Month Count |
|---|---|---|
| 0 | 16 | 68 |
Known issue validation
Build: :mag_right: Result validation: :warning: Validation could not be done without an Azure DevOps build URL on the issue. Please add it to the "Build: :mag_right:" line.
Tagging subscribers to this area: @mangod9 See info in area-owners.md if you want to be subscribed.
Issue Details
Build Information
Build: https://dev.azure.com/dnceng-public/public/_build/results?buildId=266460 Build error leg or test failing: System.Threading.ThreadPools.Tests.ThreadPoolTests.ThreadPoolMinMaxThreadsEventTest Pull request:
Error Message
Fill the error message using known issues guidance.
{
"ErrorMessage": "Exit code was 139 but it should have been 42",
"BuildRetry": false,
"ExcludeConsoleLog": true
}
| Author: | jkoritzinsky |
|---|---|
| Assignees: | - |
| Labels: |
|
| Milestone: | - |
Tagging subscribers to this area: @mangod9 See info in area-owners.md if you want to be subscribed.
Issue Details
Build Information
Build: https://dev.azure.com/dnceng-public/public/_build/results?buildId=266460 Build error leg or test failing: System.Threading.ThreadPools.Tests.ThreadPoolTests.ThreadPoolMinMaxThreadsEventTest Pull request:
Error Message
Fill the error message using known issues guidance.
{
"ErrorMessage": "Exit code was 139 but it should have been 42",
"BuildRetry": false,
"ExcludeConsoleLog": true
}
Report
| Build | Definition | Test | Pull Request |
|---|---|---|---|
| 266460 | dotnet/runtime | System.Threading.ThreadPools.Tests.ThreadPoolTests.ThreadPoolMinMaxThreadsEventTest | dotnet/runtime#85941 |
Summary
| 24-Hour Hit Count | 7-Day Hit Count | 1-Month Count |
|---|---|---|
| 1 | 1 | 1 |
| Author: | jkoritzinsky |
|---|---|
| Assignees: | - |
| Labels: |
|
| Milestone: | - |
@lambdageek - didn't we have a similar intermittent threading issue somewhere?
here I have one on x64, with mono_dump_native_crash_info
=================================================================
External Debugger Dump:
=================================================================
/tmp/mono-gdb-commands.209:1: Error in sourced command file:
ptrace: Operation not permitted.
[New LWP 210]
[New LWP 211]
[New LWP 212]
[New LWP 213]
Missing separate debuginfo for /root/helix/work/correlation/dotnet
Try: yum --enablerepo='*debug*' install /usr/lib/debug/.build-id/9a/ee12662a8f8c3b14915491ceba0663c152d89a.debug
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
Missing separate debuginfo for /root/helix/work/correlation/host/fxr/9.0.0/libhostfxr.so
Try: yum --enablerepo='*debug*' install /usr/lib/debug/.build-id/85/5812b12a0494863911e4320a48b9c4d4450817.debug
Missing separate debuginfo for /root/helix/work/correlation/shared/Microsoft.NETCore.App/9.0.0/libhostpolicy.so
Try: yum --enablerepo='*debug*' install /usr/lib/debug/.build-id/ea/2c76e0d159d038a7eab2f0cc9f07155aa572ef.debug
Dwarf Error: Cannot handle DW_FORM_strx1 in DWARF reader [in module /root/helix/work/correlation/shared/Microsoft.NETCore.App/9.0.0/libcoreclr.so.dbg]
Dwarf Error: Cannot handle DW_FORM_strx1 in DWARF reader [in module /root/helix/work/correlation/shared/Microsoft.NETCore.App/9.0.0/libSystem.Native.so.dbg]
0x00007fe2b0683672 in waitpid () from /lib64/libpthread.so.0
Id Target Id Frame
* 1 Thread 0x7fe2b0ab5740 (LWP 209) "dotnet" 0x00007fe2b0683672 in waitpid () from /lib64/libpthread.so.0
2 Thread 0x7fe2ae7ff700 (LWP 210) "SGen worker" 0x00007fe2b067f45c in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
3 Thread 0x7fe2aee11700 (LWP 211) ".NET EventPipe" 0x00007fe2af89d301 in poll () from /lib64/libc.so.6
4 Thread 0x7fe2ac9ff700 (LWP 212) "Finalizer" 0x00007fe2b0681da6 in do_futex_wait.constprop () from /lib64/libpthread.so.0
5 Thread 0x7fe2a5f94700 (LWP 213) ".NET Long Runni" 0x00007fe2b0683672 in waitpid () from /lib64/libpthread.so.0
Thread 5 (Thread 0x7fe2a5f94700 (LWP 213)):
#0 0x00007fe2b0683672 in waitpid () from /lib64/libpthread.so.0
#1 0x00007fe2af3f2212 in mono_dump_native_crash_info () from /root/helix/work/correlation/shared/Microsoft.NETCore.App/9.0.0/libcoreclr.so
#2 0x00007fe2af39946e in mono_handle_native_crash () from /root/helix/work/correlation/shared/Microsoft.NETCore.App/9.0.0/libcoreclr.so
#3 0x00007fe2af3001e1 in mono_sigsegv_signal_handler_debug () from /root/helix/work/correlation/shared/Microsoft.NETCore.App/9.0.0/libcoreclr.so
#4 <signal handler called>
#5 0x00007fe2af52f412 in mono_profiler_set_jit_begin_callback () from /root/helix/work/correlation/shared/Microsoft.NETCore.App/9.0.0/libcoreclr.so
#6 0x00007fe2af468a08 in EventPipeEtwCallbackDotNETRuntime () from /root/helix/work/correlation/shared/Microsoft.NETCore.App/9.0.0/libcoreclr.so
#7 0x00007fe2af45a3e9 in provider_invoke_callback () from /root/helix/work/correlation/shared/Microsoft.NETCore.App/9.0.0/libcoreclr.so
#8 0x00007fe2af45ab34 in disable_helper () from /root/helix/work/correlation/shared/Microsoft.NETCore.App/9.0.0/libcoreclr.so
#9 0x00000000418b9cfa in ?? ()
#10 0x000000000000000a in ?? ()
#11 0x00007fe2af617150 in ?? ()
#12 0x00007fe2a5f93848 in ?? ()
#13 0x0000000000000001 in ?? ()
#14 0x0000000000000000 in ?? ()
Thread 4 (Thread 0x7fe2ac9ff700 (LWP 212)):
#0 0x00007fe2b0681da6 in do_futex_wait.constprop () from /lib64/libpthread.so.0
#1 0x00007fe2b0681e98 in __new_sem_wait_slow.constprop.0 () from /lib64/libpthread.so.0
#2 0x00007fe2af55f9e8 in finalizer_thread () from /root/helix/work/correlation/shared/Microsoft.NETCore.App/9.0.0/libcoreclr.so
#3 0x00007fe2af538669 in start_wrapper () from /root/helix/work/correlation/shared/Microsoft.NETCore.App/9.0.0/libcoreclr.so
#4 0x00007fe2b06791ca in start_thread () from /lib64/libpthread.so.0
#5 0x00007fe2af7b2e73 in clone () from /lib64/libc.so.6
Thread 3 (Thread 0x7fe2aee11700 (LWP 211)):
#0 0x00007fe2af89d301 in poll () from /lib64/libc.so.6
#1 0x00007fe2af484ece in ds_ipc_poll () from /root/helix/work/correlation/shared/Microsoft.NETCore.App/9.0.0/libcoreclr.so
#2 0x00007fe2af481de8 in ds_ipc_stream_factory_get_next_available_stream () from /root/helix/work/correlation/shared/Microsoft.NETCore.App/9.0.0/libcoreclr.so
#3 0x00007fe2af48044f in server_thread () from /root/helix/work/correlation/shared/Microsoft.NETCore.App/9.0.0/libcoreclr.so
#4 0x00007fe2af484341 in ep_rt_thread_mono_start_func () from /root/helix/work/correlation/shared/Microsoft.NETCore.App/9.0.0/libcoreclr.so
#5 0x00007fe2b06791ca in start_thread () from /lib64/libpthread.so.0
#6 0x00007fe2af7b2e73 in clone () from /lib64/libc.so.6
Thread 2 (Thread 0x7fe2ae7ff700 (LWP 210)):
#0 0x00007fe2b067f45c in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1 0x00007fe2af5c7323 in thread_func () from /root/helix/work/correlation/shared/Microsoft.NETCore.App/9.0.0/libcoreclr.so
#2 0x00007fe2b06791ca in start_thread () from /lib64/libpthread.so.0
#3 0x00007fe2af7b2e73 in clone () from /lib64/libc.so.6
Thread 1 (Thread 0x7fe2b0ab5740 (LWP 209)):
#0 0x00007fe2b0683672 in waitpid () from /lib64/libpthread.so.0
#1 0x00007fe2af3f2212 in mono_dump_native_crash_info () from /root/helix/work/correlation/shared/Microsoft.NETCore.App/9.0.0/libcoreclr.so
#2 0x00007fe2af39946e in mono_handle_native_crash () from /root/helix/work/correlation/shared/Microsoft.NETCore.App/9.0.0/libcoreclr.so
#3 0x00007fe2af3f1981 in sigabrt_signal_handler () from /root/helix/work/correlation/shared/Microsoft.NETCore.App/9.0.0/libcoreclr.so
#4 <signal handler called>
#5 0x00007fe2af7c7acf in raise () from /lib64/libc.so.6
#6 0x00007fe2af79aea5 in abort () from /lib64/libc.so.6
#7 0x00007fe2af48fb05 in monoeg_assert_abort () from /root/helix/work/correlation/shared/Microsoft.NETCore.App/9.0.0/libcoreclr.so
#8 0x00007fe2af4a0106 in mono_log_write_logfile () from /root/helix/work/correlation/shared/Microsoft.NETCore.App/9.0.0/libcoreclr.so
#9 0x00007fe2af48fec6 in monoeg_g_logv () from /root/helix/work/correlation/shared/Microsoft.NETCore.App/9.0.0/libcoreclr.so
#10 0x00007fe2af48fffe in monoeg_g_log () from /root/helix/work/correlation/shared/Microsoft.NETCore.App/9.0.0/libcoreclr.so
#11 0x00007fe2af467b5b in ep_rt_mono_fini () from /root/helix/work/correlation/shared/Microsoft.NETCore.App/9.0.0/libcoreclr.so
#12 0x00007fe2af304601 in mini_cleanup () from /root/helix/work/correlation/shared/Microsoft.NETCore.App/9.0.0/libcoreclr.so
#13 0x00007fe2af4e5dbb in ves_icall_System_Environment_Exit () from /root/helix/work/correlation/shared/Microsoft.NETCore.App/9.0.0/libcoreclr.so
#14 0x00000000418b8bba in ?? ()
#15 0x0000000000000000 in ?? ()
[Inferior 1 (process 209) detached]
@lateralusX some kind of race condition in eventpipe shutdown, looks like.
Note that we don't shut down threads in Environment.Exit () so there is a possibility of races.
ep_rt_mono_fini assumes that all threads that might run EventPipe code has been stopped, so if there are still threads that can call into EventPipe at that point, it will race with shutdown logic. In this case Thread 5 is disabling a managed EventSource provider that will use the profiler callbacks, that is reset by ep_rt_mono_fini since it assumes no other threads are running at that point, the assert in Thread 1 happens since it will clean up a mutex, that is currently held by Thread 5.
On CoreClr/NativeAot we don't have any cleanup done in ep_rt_shutdown and those runtimes will leak resources, but on Mono we do cleanup of runtime resources. We probably need to detect if the shutdown is triggered in a way where other managed threads migth still be running when calling ep_rt_shutdown, if so we would probably need to leak these resources.
This should be fixed on https://github.com/dotnet/runtime/pull/99332
Looks like the test passed on the linux-arm64 Debug Mono_MiniJIT_LibrariesTests lane on a mono related PR in the general runtime pipeline. Didn't find the testResults.xml, since the helix job only had a console, but I was able to filter the tests tab on the azdo build and found that ThreadPoolMinMaxThreadsEventTest passed on both the Debian 11 and Ubuntu helix machines https://dev.azure.com/dnceng-public/public/_build/results?buildId=682588&view=ms.vss-test-web.build-test-results-tab&runId=16939684&paneView=debug.