ipmpsc
ipmpsc copied to clipboard
pthread_cond_timedwait error with EINVAL on macos
This happens when I run tests::bench_ipmpsc.
$ cd ipc-benchmarks
$ RUST_BACKTRACE=1 cargo bench tests::bench_ipmpsc -- --nocapture
...
running 1 test
thread 'main' panicked at 'error receiving: Runtime("timeout_ok(libc::pthread_cond_timedwait(self.0.condition.get(),\n self.0.mutex.get(), &then)) failed: 22")', src/lib.rs:107:27
stack backtrace:
0: rust_begin_unwind
at /rustc/1700ca07c6dd7becff85678409a5df6ad4cf4f47/library/std/src/panicking.rs:493:5
1: std::panicking::begin_panic_fmt
at /rustc/1700ca07c6dd7becff85678409a5df6ad4cf4f47/library/std/src/panicking.rs:435:5
2: test::bench::ns_iter_inner
3: test::bench::Bencher::iter
4: ipc_benchmarks::tests::bench_ipmpsc
5: core::ops::function::FnOnce::call_once
6: core::ops::function::FnMut::call_mut
at /rustc/1700ca07c6dd7becff85678409a5df6ad4cf4f47/library/core/src/ops/function.rs:150:5
7: test::bench::Bencher::bench
at /rustc/1700ca07c6dd7becff85678409a5df6ad4cf4f47/library/test/src/bench.rs:45:9
8: test::bench::benchmark::{{closure}}
at /rustc/1700ca07c6dd7becff85678409a5df6ad4cf4f47/library/test/src/bench.rs:192:51
9: <std::panic::AssertUnwindSafe<F> as core::ops::function::FnOnce<()>>::call_once
at /rustc/1700ca07c6dd7becff85678409a5df6ad4cf4f47/library/std/src/panic.rs:322:9
10: std::panicking::try::do_call
at /rustc/1700ca07c6dd7becff85678409a5df6ad4cf4f47/library/std/src/panicking.rs:379:40
11: std::panicking::try
at /rustc/1700ca07c6dd7becff85678409a5df6ad4cf4f47/library/std/src/panicking.rs:343:19
12: std::panic::catch_unwind
at /rustc/1700ca07c6dd7becff85678409a5df6ad4cf4f47/library/std/src/panic.rs:396:14
13: test::bench::benchmark
at /rustc/1700ca07c6dd7becff85678409a5df6ad4cf4f47/library/test/src/bench.rs:192:18
14: test::run_test
at /rustc/1700ca07c6dd7becff85678409a5df6ad4cf4f47/library/test/src/lib.rs:489:13
15: test::run_tests
at /rustc/1700ca07c6dd7becff85678409a5df6ad4cf4f47/library/test/src/lib.rs:339:13
16: test::console::run_tests_console
at /rustc/1700ca07c6dd7becff85678409a5df6ad4cf4f47/library/test/src/console.rs:289:5
17: test::test_main
at /rustc/1700ca07c6dd7becff85678409a5df6ad4cf4f47/library/test/src/lib.rs:120:15
18: test::test_main_static
at /rustc/1700ca07c6dd7becff85678409a5df6ad4cf4f47/library/test/src/lib.rs:139:5
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.
test tests::bench_ipmpsc ... FAILED
The error originates at https://github.com/dicej/ipmpsc/blob/2b4357e57c2c3fbd79159512f19ddf14ed360565/src/lib.rs#L160-L164
A quick search suggests that EINVAL is returned when the tv_nsec is invalid, but it doesn't seem to be the case. I added a println right before the call, and I got:
tv_sec=1922927639
tv_nsec=612420000
Note the error happens non-deterministically. Some iterations succeed. The error does not happen on a Ubuntu machine with the same version of rustc.
Environment: OS: macOS Catalina 10.15.7 (19H15) rustc: nightly-2020-12-09-x86_64-apple-darwin
@dicej I tried to investigate further, but I don't have a clue. I changed the code to give a constant abstime and it still happens. I'm more than happy to help investigate the issue if you could help shed a bit more light.
@skywhale Thanks so much for reporting this, and sorry for the delayed response. I'll take a look at this when I get a chance, probably this weekend.
I've refactored the code a bit so it's using pthread_cond_wait instead of pthread_cond_timedwait, so that at least eliminates the timespec as a factor. However, it's still failing occasionally:
thread 'main' panicked at 'error receiving: Runtime("libc::pthread_cond_wait(self.0.header().condition.get(),\n self.0.header().mutex.get()) failed: 22")', src/lib.rs:106:27
Seems like it doesn't like either the condition or the mutex (or both). And it's a race condition, because sometimes it runs to completion without any error. Will continue studying it.
I've also noticed that tests::aribitrary_case in the main test suite frequently deadlocks forever (with all threads waiting forever and never being notified). I'm guessing that's related even if the symptom is different.
Here's my current theory: unlike on Linux, Android, and Windows, using the same interprocess mutex and/or condition variable from different memory mappings within the same process is not safe on MacOS (and maybe not on the BSDs in general). That's what the tests and benchmarks do: spawn separate threads for each receiver and sender as if they were other processes. I'm guessing there's something in the MacOS/BSD pthread implementation that assumes PTHREAD_PROCESS_SHARED mutexes and condition variables are not aliased in different memory locations within the same process, and although it usually works anyway, it doesn't always.
When I change the code to make the senders and receiver all use the same memory mapping, everything seems to work reliably. That defeats the purpose of the tests, though, since there's no way that can happen when using separate processes.
Next, I'm going to change tests to actually spawn separate processes instead of separate threads and see what happens.
Update: I went ahead and modified the tests and benchmarks to fork processes instead of spawning threads. Unfortunately, that didn't address the issue -- pthread_cond_wait still returns EINVAL unpredictably.
I don't see any way forward from here, unfortunately. I'll update the README to indicate that MacOS is not currently supported until someone figures out how to fix this.
This comment indicates that PTHREAD_PROCESS_SHARED has been broken in MacOS since Lion: https://github.com/bitcoin/bitcoin/issues/19411#issuecomment-712153417.
Thank you so much for looking into this @dicej. It's unfortunate that macOS Posix compliance is partial.
I did a bit more research, and it seems pthread shared-process read/write lock has never been supported on macOS. Their pthread implementation seems to be based on BSD 7.4, which lacks the support of PTHREAD_PROCESS_SHARED attribute as stated in the BUGS section (due to the use of pointers in their structs). It was resolved in BSD 11.0, but it hasn't been picked up by macOS sdk yet.
Hi,
out of curiosity I tried the repro-steps on my mac and it is still failing with the same error - MacOs X 12.4.
I went down the internet rabbit hole and searched about similar pthread errors and see if there are alternatives.
ipc-channel seems to rely on mach ports to pass data from one process to another one.
Would it be possible to use mach port to implement an IP-Mutex? I am not familiar with this, so this may be a silly qestion.
Interestingly, boost has an interprocess_mutex class, which seems to work with MacOs.
- https://github.com/dials/boost/blob/master/boost/interprocess/sync/interprocess_mutex.hpp
However, digging a bit, it seems that on Mac, the implementation falls back to using a spin mutex.
- https://github.com/dials/boost/blob/master/boost/interprocess/detail/workaround.hpp#L52
- https://github.com/dials/boost/blob/master/boost/interprocess/sync/interprocess_mutex.hpp#L42
so... perhaps the same thing should be done in ipmpsc?
Yes, I was wondering about some Mach-specific way to do locking, or else using named pipes just for locking (but still using the ring buffer for actually moving data around).
The spin lock approach could also work. We'd either need to port the Boost code to Rust or create a Rust wrapper for a C++ library that exports the Boost functionality using a C API Rust can talk to. The latter would be easiest, I imagine. Ideally this would be its own crate so others could reuse it.
Note that ipmpsc needs both a mutex and a condition variable implementation; Boost appears to support both of those in its interprocess spin lock suite.
I don't expect I'll have time to work on this myself any time soon, but I'd be happy to review a PR.
I'll see what I can do... no promises :-)