runtime mono_os_mutex_destroy: pthread_mutex

Build Information

Build: https://dev.azure.com/dnceng-public/cbb18261-c48f-4abb-8651-8cdcb5474649/_build/results?buildId=542041 Build error leg or test failing: tracing/runtimeeventsource/nativeruntimeeventsource/nativeruntimeeventsource.cmd Pull request: https://github.com/dotnet/runtime/pull/97553

Error Message

Fill the error message using step by step known issues guidance.

{
  "ErrorMessage": "mono_os_mutex_destroy: pthread_mutex_destroy failed with \"Resource busy\" (16)",
  "ErrorPattern": "",
  "BuildRetry": false,
  "ExcludeConsoleLog": false
}

Known issue validation

Build: :mag_right: https://dev.azure.com/dnceng-public/public/_build/results?buildId=542041 Error message validated: mono_os_mutex_destroy: pthread_mutex_destroy failed with "Resource busy" (16) Result validation: :white_check_mark: Known issue matched with the provided build. Validation performed at: 1/26/2024 5:28:29 PM UTC

Report

Build	Definition	Test	Pull Request
561807	dotnet/runtime	tracing/runtimeeventsource/nativeruntimeeventsource/nativeruntimeeventsource.cmd
558019	dotnet/runtime	tracing/runtimeeventsource/nativeruntimeeventsource/nativeruntimeeventsource.cmd	dotnet/runtime#98138
548257	dotnet/runtime	tracing/runtimeeventsource/nativeruntimeeventsource/nativeruntimeeventsource.cmd
548264	dotnet/runtime	tracing/runtimeeventsource/nativeruntimeeventsource/nativeruntimeeventsource.cmd	dotnet/runtime#96806
544478	dotnet/runtime	System.Threading.Tests.EtwTests.WaitHandleWaitEventTest	dotnet/runtime#97644
544295	dotnet/runtime	tracing/runtimeeventsource/nativeruntimeeventsource/nativeruntimeeventsource.cmd	dotnet/runtime#97637
542512	dotnet/runtime	tracing/runtimeeventsource/nativeruntimeeventsource/nativeruntimeeventsource.cmd
542041	dotnet/runtime	tracing/runtimeeventsource/nativeruntimeeventsource/nativeruntimeeventsource.cmd	dotnet/runtime#97553
540861	dotnet/runtime	tracing/runtimeeventsource/nativeruntimeeventsource/nativeruntimeeventsource.cmd	dotnet/runtime#97441

Summary

24-Hour Hit Count	7-Day Hit Count	1-Month Count
0	2	9

Jan 26 '24 17:01 lewing

Stack trace:

mono_os_mutex_destroy: pthread_mutex_destroy failed with "Resource busy" (16)

=================================================================
	Native Crash Reporting
=================================================================
Got a SIGABRT while executing native code. This usually indicates
a fatal error in the mono runtime or one of the native libraries 
used by your application.
=================================================================

=================================================================
	Native stacktrace:
=================================================================
	0x1074d0785 - /private/tmp/helix/working/C2010A91/p/libcoreclr.dylib : mono_dump_native_crash_info
	0x10746e2be - /private/tmp/helix/working/C2010A91/p/libcoreclr.dylib : mono_handle_native_crash
	0x1076c6858 - /private/tmp/helix/working/C2010A91/p/libcoreclr.dylib : sigabrt_signal_handler.cold.1
	0x1074d0100 - /private/tmp/helix/working/C2010A91/p/libcoreclr.dylib : mono_runtime_setup_stat_profiler
	0x7ff8047cadfd - /usr/lib/system/libsystem_platform.dylib : _sigtramp
	0x0 - Unknown
	0x7ff804700d14 - /usr/lib/system/libsystem_c.dylib : abort
	0x10757d238 - /private/tmp/helix/working/C2010A91/p/libcoreclr.dylib : monoeg_assert_abort
	0x10758b7ca - /private/tmp/helix/working/C2010A91/p/libcoreclr.dylib : mono_log_write_logfile
	0x10757d6a8 - /private/tmp/helix/working/C2010A91/p/libcoreclr.dylib : monoeg_g_logv
	0x10757d842 - /private/tmp/helix/working/C2010A91/p/libcoreclr.dylib : monoeg_g_log
	0x10755247b - /private/tmp/helix/working/C2010A91/p/libcoreclr.dylib : ep_rt_mono_fini
	0x1073c32ac - /private/tmp/helix/working/C2010A91/p/libcoreclr.dylib : mini_cleanup
	0x1074269f8 - /private/tmp/helix/working/C2010A91/p/libcoreclr.dylib : mono_main
	0x1074ab29d - /private/tmp/helix/working/C2010A91/p/libcoreclr.dylib : monovm_execute_assembly
	0x106e4e702 - /private/tmp/helix/working/C2010A91/p/corerun : _ZL3runRK13configuration
	0x106e4aa72 - /private/tmp/helix/working/C2010A91/p/corerun : main
	0x10c3f352e - Unknown

So it's eventpipe cleanup. I think we saw somethign similar recently...

Jan 26 '24 17:01 lambdageek

Might be related to https://github.com/dotnet/runtime/issues/85960#issuecomment-1844892277

Jan 26 '24 17:01 lambdageek

Possibly addressed with #96936 @davmason

Feb 11 '24 20:02 tommcdon

The callstack pasted above is not related to #96936, this is crashing in ep_rt_mono_fini which is mono specific cleanup. It's likely a similar issue though, just with a different resource.

Feb 13 '24 08:02 davmason

The callstack pasted above is not related to #96936, this is crashing in ep_rt_mono_fini which is mono specific cleanup. It's likely a similar issue though, just with a different resource.

@lambdageek would you mind taking a second look and/or provide pointers to @davmason for next steps?

Feb 26 '24 16:02 tommcdon

Next steps are to do less cleanup in ep_rt_mono_fini. As Johan mentioned in https://github.com/dotnet/runtime/issues/85960#issuecomment-1844892277

ep_rt_mono_fini assumes that all threads that might run EventPipe code has been stopped, so if there are still threads that can call into EventPipe at that point, it will race with shutdown logic.

On CoreClr/NativeAot we don't have any cleanup done in ep_rt_shutdown and those runtimes will leak resources, but on Mono we do cleanup of runtime resources. We probably need to detect if the shutdown is triggered in a way where other managed threads migth still be running when calling ep_rt_shutdown, if so we would probably need to leak these resources.

I'm not sure what Johan had in mind, but one possiblity is just to call mono_runtime_is_shutting_down and if it is FALSE, just make ep_rt_mono_fini exit early without cleaning up. But I'm not 100% certain that this will account for all situations where event pipe might be shutting down but managed threads are still running.

Feb 26 '24 16:02 lambdageek

@lambdageek, I spent a little time looking at the mono code and I can't convince myself that mono_runtime_is_shutting_down () guarantees that we are not running managed code. Looking at mono_runtime_try_shutdown () it looks like we just stop creating new threads and don't have any guarantee that current threads are gone. Am I missing something?

Feb 27 '24 20:02 davmason

@davmason you're probably not missing anything. I don't think we have a way to know when all the existing managed threads are really stopped/gonge (we used to havemono_thread_suspend_all_other_threads in "classic" Mono, but in modern .NET we don't try to stop other threads anymore before exiting - it wouldn't work on platforms like WASM, anyway, where we don't have signals).

My suggestion was more "best effort" - if we know we haven't started shutting down at all, don't even try to do any cleanup in ep_rt_mono_fini. If on the other hand shutdown started, we may possibly try to cleanup, but we could still hit the deadlock.

Feb 27 '24 20:02 lambdageek

it wouldn't work on platforms like WASM, anyway, where we don't have signals).

Probably off-topic here, but we could kill all threads on emscripten via JavaScript, if we are on UI thread. I'm working on it here @lambdageek

Apr 09 '24 10:04 pavelsavara