mono_os_mutex_destroy: pthread_mutex_destroy failed
Build Information
Build: https://dev.azure.com/dnceng-public/cbb18261-c48f-4abb-8651-8cdcb5474649/_build/results?buildId=542041 Build error leg or test failing: tracing/runtimeeventsource/nativeruntimeeventsource/nativeruntimeeventsource.cmd Pull request: https://github.com/dotnet/runtime/pull/97553
Error Message
Fill the error message using step by step known issues guidance.
{
"ErrorMessage": "mono_os_mutex_destroy: pthread_mutex_destroy failed with \"Resource busy\" (16)",
"ErrorPattern": "",
"BuildRetry": false,
"ExcludeConsoleLog": false
}
Known issue validation
Build: :mag_right: https://dev.azure.com/dnceng-public/public/_build/results?buildId=542041
Error message validated: mono_os_mutex_destroy: pthread_mutex_destroy failed with "Resource busy" (16)
Result validation: :white_check_mark: Known issue matched with the provided build.
Validation performed at: 1/26/2024 5:28:29 PM UTC
Report
Summary
| 24-Hour Hit Count | 7-Day Hit Count | 1-Month Count |
|---|---|---|
| 0 | 2 | 9 |
Stack trace:
mono_os_mutex_destroy: pthread_mutex_destroy failed with "Resource busy" (16)
=================================================================
Native Crash Reporting
=================================================================
Got a SIGABRT while executing native code. This usually indicates
a fatal error in the mono runtime or one of the native libraries
used by your application.
=================================================================
=================================================================
Native stacktrace:
=================================================================
0x1074d0785 - /private/tmp/helix/working/C2010A91/p/libcoreclr.dylib : mono_dump_native_crash_info
0x10746e2be - /private/tmp/helix/working/C2010A91/p/libcoreclr.dylib : mono_handle_native_crash
0x1076c6858 - /private/tmp/helix/working/C2010A91/p/libcoreclr.dylib : sigabrt_signal_handler.cold.1
0x1074d0100 - /private/tmp/helix/working/C2010A91/p/libcoreclr.dylib : mono_runtime_setup_stat_profiler
0x7ff8047cadfd - /usr/lib/system/libsystem_platform.dylib : _sigtramp
0x0 - Unknown
0x7ff804700d14 - /usr/lib/system/libsystem_c.dylib : abort
0x10757d238 - /private/tmp/helix/working/C2010A91/p/libcoreclr.dylib : monoeg_assert_abort
0x10758b7ca - /private/tmp/helix/working/C2010A91/p/libcoreclr.dylib : mono_log_write_logfile
0x10757d6a8 - /private/tmp/helix/working/C2010A91/p/libcoreclr.dylib : monoeg_g_logv
0x10757d842 - /private/tmp/helix/working/C2010A91/p/libcoreclr.dylib : monoeg_g_log
0x10755247b - /private/tmp/helix/working/C2010A91/p/libcoreclr.dylib : ep_rt_mono_fini
0x1073c32ac - /private/tmp/helix/working/C2010A91/p/libcoreclr.dylib : mini_cleanup
0x1074269f8 - /private/tmp/helix/working/C2010A91/p/libcoreclr.dylib : mono_main
0x1074ab29d - /private/tmp/helix/working/C2010A91/p/libcoreclr.dylib : monovm_execute_assembly
0x106e4e702 - /private/tmp/helix/working/C2010A91/p/corerun : _ZL3runRK13configuration
0x106e4aa72 - /private/tmp/helix/working/C2010A91/p/corerun : main
0x10c3f352e - Unknown
So it's eventpipe cleanup. I think we saw somethign similar recently...
Might be related to https://github.com/dotnet/runtime/issues/85960#issuecomment-1844892277
Possibly addressed with #96936 @davmason
The callstack pasted above is not related to #96936, this is crashing in ep_rt_mono_fini which is mono specific cleanup. It's likely a similar issue though, just with a different resource.
The callstack pasted above is not related to #96936, this is crashing in
ep_rt_mono_finiwhich is mono specific cleanup. It's likely a similar issue though, just with a different resource.
@lambdageek would you mind taking a second look and/or provide pointers to @davmason for next steps?
Next steps are to do less cleanup in ep_rt_mono_fini. As Johan mentioned in https://github.com/dotnet/runtime/issues/85960#issuecomment-1844892277
ep_rt_mono_finiassumes that all threads that might run EventPipe code has been stopped, so if there are still threads that can call into EventPipe at that point, it will race with shutdown logic.
On CoreClr/NativeAot we don't have any cleanup done in
ep_rt_shutdownand those runtimes will leak resources, but on Mono we do cleanup of runtime resources. We probably need to detect if the shutdown is triggered in a way where other managed threads migth still be running when callingep_rt_shutdown, if so we would probably need to leak these resources.
I'm not sure what Johan had in mind, but one possiblity is just to call mono_runtime_is_shutting_down and if it is FALSE, just make ep_rt_mono_fini exit early without cleaning up. But I'm not 100% certain that this will account for all situations where event pipe might be shutting down but managed threads are still running.
@lambdageek, I spent a little time looking at the mono code and I can't convince myself that mono_runtime_is_shutting_down () guarantees that we are not running managed code. Looking at mono_runtime_try_shutdown () it looks like we just stop creating new threads and don't have any guarantee that current threads are gone. Am I missing something?
@davmason you're probably not missing anything. I don't think we have a way to know when all the existing managed threads are really stopped/gonge (we used to havemono_thread_suspend_all_other_threads in "classic" Mono, but in modern .NET we don't try to stop other threads anymore before exiting - it wouldn't work on platforms like WASM, anyway, where we don't have signals).
My suggestion was more "best effort" - if we know we haven't started shutting down at all, don't even try to do any cleanup in ep_rt_mono_fini. If on the other hand shutdown started, we may possibly try to cleanup, but we could still hit the deadlock.
it wouldn't work on platforms like WASM, anyway, where we don't have signals).
Probably off-topic here, but we could kill all threads on emscripten via JavaScript, if we are on UI thread. I'm working on it here @lambdageek