libdatadog icon indicating copy to clipboard operation
libdatadog copied to clipboard

Add `initialize_before_fork` functions

Open morrisonlevi opened this issue 3 years ago • 8 comments
trafficstars

What does this PR do?

Adds initialize_before_fork() functions.

Also re-organizes the includes and mods section a bit to group things a bit more.

Motivation

From one of the function's documentation:

/// On Apple platforms in particular, some things need to be initialized
/// before a fork, and ideally before threads are created.
/// Users may run into errors like the following if they do not:
/// > objc[25938]: +[__NSCFConstantString initialize] may have been in
/// > progress in another thread when fork() was called. We cannot safely
/// > call it or ignore it in the fork() child process. Crashing instead. Set
/// > a breakpoint on objc_initializeAfterForkError to debug.

Additional Notes

If the codebase using libdatadog might fork, then this should be called before forking, at least on Apple platforms.

The profiling::exporter module re-exports the ddcommon::connector's function, and the profiling-ffi::exporter uses the profiling::exporter version. This organization made sense to me at the time but I am having trouble defending it so I'm happy to hear your thoughts.

How to test the change?

Call one of the initialize_before_fork functions on an Apple platform, then fork, and note that you don't crash, whereas on older versions of libdatadog you might (depends potentially on if anything else has fully initialized the Apple frameworks). Note that you don't need to use the return value for this use case, though you could if you were interested in logging failures.

morrisonlevi avatar Sep 27 '22 16:09 morrisonlevi

Ahh... This is interesting, since I've seen it as well before: https://github.com/DataDog/libddprof/issues/61 .

I do have a small suggestion. Since the intention of the libdatadog callers is not to load_root_certs (that's an implementation detail for libdatadog), do you think we could expose a safely_initialize_macos_resources() (safely_initialize_before_fork()?) or something similar that would then call into this function?

This way we would abstract what's going on exactly, and could change the details whenever, without having such an internal detail of libdatadog need to be exposed directly.

(Also, it would be really nice if this was exposed via the ffi! :P)

ivoanjo avatar Sep 28 '22 07:09 ivoanjo

Interestingly, Ruby recently added a workaround that sounds similar: https://github.com/ruby/ruby/commit/e3cc1a6cae0e6c88c04cd54c3afa3c022bb6772c

ivoanjo avatar Sep 28 '22 07:09 ivoanjo

Indeed, there is a whole class of types out there that can trigger these issues. I don't know the Apple ecosystem well enough to say, "yes, definitely, we've fixed all the things" so I guess we'll be playing whack-a-mole. Exposing it through FFI is important too, good call.

morrisonlevi avatar Sep 28 '22 15:09 morrisonlevi

I've updated this to your suggestions, @ivoanjo, and also to resolve a conflict with the main branch 👍🏻

morrisonlevi avatar Sep 28 '22 21:09 morrisonlevi

I was able to re-test this with Ruby and I still get crashes in the child process. Here's the relevant part of the stack trace for the crash:

Thread 0 Crashed:: Dispatch queue: com.apple.security.keychain-cache-queue
0   libsystem_kernel.dylib        	0x00007fff2049791e __pthread_kill + 10
1   libsystem_pthread.dylib       	0x00007fff204c65bd pthread_kill + 263
2   libsystem_c.dylib             	0x00007fff2041b406 abort + 125
3   libruby.2.7.dylib             	0x00000001043d5359 die + 9
4   libruby.2.7.dylib             	0x00000001043d54f3 rb_bug_for_fatal_signal + 403
5   libruby.2.7.dylib             	0x00000001044eac12 sigsegv + 82
6   libsystem_platform.dylib      	0x00007fff2050bd7d _sigtramp + 29
7   ???                           	0x00007f89ed526190 0 + 140230368846224
8   com.apple.security            	0x00007fff225b9b79 invocation function for block in Security::KeychainCore::StorageManager::tickleKeychain(Security::KeychainCore::KeychainImpl*) + 76
9   libdispatch.dylib             	0x00007fff2031c806 _dispatch_client_callout + 8
10  libdispatch.dylib             	0x00007fff203295ce _dispatch_lane_barrier_sync_invoke_and_complete + 60
11  com.apple.security            	0x00007fff225b9a31 Security::KeychainCore::StorageManager::tickleKeychain(Security::KeychainCore::KeychainImpl*) + 535
12  com.apple.security            	0x00007fff2237c14e Security::KeychainCore::KCCursorImpl::next(Security::KeychainCore::Item&) + 302
13  com.apple.security            	0x00007fff225c663a Security::KeychainCore::TrustSettings::findQualifiedCerts(std::__1::vector<Security::KeychainCore::Keychain, std::__1::allocator<Security::KeychainCore::Keychain> >&, bool, bool, cssm_data const*, char const*, unsigned int, __CFArray*) + 252
14  com.apple.security            	0x00007fff225b560c SecTrustSettingsCopyCertificates + 505
15  libdatadog_profiling.dylib    	0x0000000106217dae security_framework::trust_settings::TrustSettings::iter::h63e45bed18fb1639 + 25 (trust_settings.rs:104) [inlined]
16  libdatadog_profiling.dylib    	0x0000000106217dae rustls_native_certs::macos::load_native_certs::h13356bc16655a82b + 136 (macos.rs:24) [inlined]
17  libdatadog_profiling.dylib    	0x0000000106217dae core::ops::function::FnOnce::call_once::hec68719506ba1e77 + 136 (function.rs:248) [inlined]
18  libdatadog_profiling.dylib    	0x0000000106217dae core::option::Option$LT$T$GT$::unwrap_or_else::h5e0faed1e3ce7d22 + 136 (option.rs:825) [inlined]
19  libdatadog_profiling.dylib    	0x0000000106217dae rustls_native_certs::load_native_certs::h90685e5da29cf340 + 314 (lib.rs:50)
20  libdatadog_profiling.dylib    	0x00000001061811c8 ddcommon::connector::load_root_certs::h85c62c2c73d6cf4d + 57 (mod.rs:84)
21  libdatadog_profiling.dylib    	0x00000001061235fd ddcommon::connector::build_https_connector::he7e610d954436c28 + 5 (mod.rs:69) [inlined]
22  libdatadog_profiling.dylib    	0x00000001061235fd ddcommon::connector::Connector::new::ha077e2da0c75f5e3 + 5 (mod.rs:38) [inlined]
23  libdatadog_profiling.dylib    	0x00000001061235fd datadog_profiling::exporter::Exporter::new::h83d1b824a33f1a6f + 110 (mod.rs:254) [inlined]
24  libdatadog_profiling.dylib    	0x00000001061235fd datadog_profiling::exporter::ProfileExporter::new::h7270b40f967a9bd3 + 110 (mod.rs:128) [inlined]
25  libdatadog_profiling.dylib    	0x00000001061235fd datadog_profiling_ffi::exporter::profile_exporter_new::_$u7b$$u7b$closure$u7d$$u7d$::hbe34ceef863b4739 + 2580 (exporter.rs:150) [inlined]
26  libdatadog_profiling.dylib    	0x00000001061235fd ddog_ProfileExporter_new + 4133 (exporter.rs:144)
27  ddtrace_profiling_native_extension.2.7.6_x86_64-darwin20.bundle	0x0000000106115791 create_exporter + 1457 (http_transport.c:102)
28  ddtrace_profiling_native_extension.2.7.6_x86_64-darwin20.bundle	0x0000000106114926 _native_validate_exporter + 70 (http_transport.c:79)

I did make sure to call the new method, and the result is OK. Maybe the issue is that we're still calling load_root_certs in the fork, and perhaps we should just cache that information?

ivoanjo avatar Sep 30 '22 09:09 ivoanjo

I think in osx you can not use keychains after fork

  • man page of fork: https://www.unix.com/man-page/osx/2/fork/
     There are limits to what you can do in the child process. To be totally safe you should restrict yourself to only executing async-signal safe operations until such time as one of the exec functions is called.  All APIs, including global data symbols, in any framework or library should be assumed to be unsafe after a fork() unless explicitly documented to be safe or async-signal safe.  If you need to use these frameworks in the child process, you must exec.  In this situation it is reasonable to exec yourself.
  • The idea would be to drop everything after fork ?
  • Similar discussions for python: https://github.com/jaraco/keyring/issues/281

r1viollet avatar Oct 04 '22 12:10 r1viollet

Quoted differently so it's easier to read:

There are limits to what you can do in the child process. To be totally safe you should restrict yourself to only executing async-signal safe operations until such time as one of the exec functions is called. All APIs, including global data symbols, in any framework or library should be assumed to be unsafe after a fork() unless explicitly documented to be safe or async-signal safe. If you need to use these frameworks in the child process, you must exec. In this situation it is reasonable to exec yourself.

I believe this is why Mac wants you to initialize these things before you fork, and detects such issues and crashes otherwise. We should be fine to initialize the things we need, though since we're not experts it may take some time to iron all of them out.

I'm also fine to not support HTTPS on Mac specifically if it's feasible to implement.

Lastly, fork is terrible and I pray for the days that it can be completely deprecated in favor of newer things like posix_spawn or io_uring_spawn.

morrisonlevi avatar Oct 06 '22 15:10 morrisonlevi

I think we can close until we go back to implementing macOS support.

r1viollet avatar Mar 09 '23 15:03 r1viollet