ipmpsc pthread_cond_timedwait error with EINVAL on macos

This happens when I run tests::bench_ipmpsc.

$ cd ipc-benchmarks
$ RUST_BACKTRACE=1 cargo bench tests::bench_ipmpsc -- --nocapture
...
running 1 test
thread 'main' panicked at 'error receiving: Runtime("timeout_ok(libc::pthread_cond_timedwait(self.0.condition.get(),\n                                        self.0.mutex.get(), &then)) failed: 22")', src/lib.rs:107:27
stack backtrace:
   0: rust_begin_unwind
             at /rustc/1700ca07c6dd7becff85678409a5df6ad4cf4f47/library/std/src/panicking.rs:493:5
   1: std::panicking::begin_panic_fmt
             at /rustc/1700ca07c6dd7becff85678409a5df6ad4cf4f47/library/std/src/panicking.rs:435:5
   2: test::bench::ns_iter_inner
   3: test::bench::Bencher::iter
   4: ipc_benchmarks::tests::bench_ipmpsc
   5: core::ops::function::FnOnce::call_once
   6: core::ops::function::FnMut::call_mut
             at /rustc/1700ca07c6dd7becff85678409a5df6ad4cf4f47/library/core/src/ops/function.rs:150:5
   7: test::bench::Bencher::bench
             at /rustc/1700ca07c6dd7becff85678409a5df6ad4cf4f47/library/test/src/bench.rs:45:9
   8: test::bench::benchmark::{{closure}}
             at /rustc/1700ca07c6dd7becff85678409a5df6ad4cf4f47/library/test/src/bench.rs:192:51
   9: <std::panic::AssertUnwindSafe<F> as core::ops::function::FnOnce<()>>::call_once
             at /rustc/1700ca07c6dd7becff85678409a5df6ad4cf4f47/library/std/src/panic.rs:322:9
  10: std::panicking::try::do_call
             at /rustc/1700ca07c6dd7becff85678409a5df6ad4cf4f47/library/std/src/panicking.rs:379:40
  11: std::panicking::try
             at /rustc/1700ca07c6dd7becff85678409a5df6ad4cf4f47/library/std/src/panicking.rs:343:19
  12: std::panic::catch_unwind
             at /rustc/1700ca07c6dd7becff85678409a5df6ad4cf4f47/library/std/src/panic.rs:396:14
  13: test::bench::benchmark
             at /rustc/1700ca07c6dd7becff85678409a5df6ad4cf4f47/library/test/src/bench.rs:192:18
  14: test::run_test
             at /rustc/1700ca07c6dd7becff85678409a5df6ad4cf4f47/library/test/src/lib.rs:489:13
  15: test::run_tests
             at /rustc/1700ca07c6dd7becff85678409a5df6ad4cf4f47/library/test/src/lib.rs:339:13
  16: test::console::run_tests_console
             at /rustc/1700ca07c6dd7becff85678409a5df6ad4cf4f47/library/test/src/console.rs:289:5
  17: test::test_main
             at /rustc/1700ca07c6dd7becff85678409a5df6ad4cf4f47/library/test/src/lib.rs:120:15
  18: test::test_main_static
             at /rustc/1700ca07c6dd7becff85678409a5df6ad4cf4f47/library/test/src/lib.rs:139:5
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.
test tests::bench_ipmpsc            ... FAILED

The error originates at https://github.com/dicej/ipmpsc/blob/2b4357e57c2c3fbd79159512f19ddf14ed360565/src/lib.rs#L160-L164

A quick search suggests that EINVAL is returned when the tv_nsec is invalid, but it doesn't seem to be the case. I added a println right before the call, and I got:

tv_sec=1922927639
tv_nsec=612420000

Note the error happens non-deterministically. Some iterations succeed. The error does not happen on a Ubuntu machine with the same version of rustc.

Environment: OS: macOS Catalina 10.15.7 (19H15) rustc: nightly-2020-12-09-x86_64-apple-darwin

Dec 10 '20 02:12 skywhale

@dicej I tried to investigate further, but I don't have a clue. I changed the code to give a constant abstime and it still happens. I'm more than happy to help investigate the issue if you could help shed a bit more light.

Dec 10 '20 02:12 skywhale

@skywhale Thanks so much for reporting this, and sorry for the delayed response. I'll take a look at this when I get a chance, probably this weekend.

Dec 15 '20 16:12 dicej

I've refactored the code a bit so it's using pthread_cond_wait instead of pthread_cond_timedwait, so that at least eliminates the timespec as a factor. However, it's still failing occasionally:

thread 'main' panicked at 'error receiving: Runtime("libc::pthread_cond_wait(self.0.header().condition.get(),\n                        self.0.header().mutex.get()) failed: 22")', src/lib.rs:106:27

Seems like it doesn't like either the condition or the mutex (or both). And it's a race condition, because sometimes it runs to completion without any error. Will continue studying it.

Dec 19 '20 18:12 dicej

I've also noticed that tests::aribitrary_case in the main test suite frequently deadlocks forever (with all threads waiting forever and never being notified). I'm guessing that's related even if the symptom is different.

Here's my current theory: unlike on Linux, Android, and Windows, using the same interprocess mutex and/or condition variable from different memory mappings within the same process is not safe on MacOS (and maybe not on the BSDs in general). That's what the tests and benchmarks do: spawn separate threads for each receiver and sender as if they were other processes. I'm guessing there's something in the MacOS/BSD pthread implementation that assumes PTHREAD_PROCESS_SHARED mutexes and condition variables are not aliased in different memory locations within the same process, and although it usually works anyway, it doesn't always.

When I change the code to make the senders and receiver all use the same memory mapping, everything seems to work reliably. That defeats the purpose of the tests, though, since there's no way that can happen when using separate processes.

Next, I'm going to change tests to actually spawn separate processes instead of separate threads and see what happens.

Dec 19 '20 20:12 dicej

Update: I went ahead and modified the tests and benchmarks to fork processes instead of spawning threads. Unfortunately, that didn't address the issue -- pthread_cond_wait still returns EINVAL unpredictably.

I don't see any way forward from here, unfortunately. I'll update the README to indicate that MacOS is not currently supported until someone figures out how to fix this.

Dec 23 '20 05:12 dicej

This comment indicates that PTHREAD_PROCESS_SHARED has been broken in MacOS since Lion: https://github.com/bitcoin/bitcoin/issues/19411#issuecomment-712153417.

Dec 23 '20 05:12 dicej

Thank you so much for looking into this @dicej. It's unfortunate that macOS Posix compliance is partial.

I did a bit more research, and it seems pthread shared-process read/write lock has never been supported on macOS. Their pthread implementation seems to be based on BSD 7.4, which lacks the support of PTHREAD_PROCESS_SHARED attribute as stated in the BUGS section (due to the use of pointers in their structs). It was resolved in BSD 11.0, but it hasn't been picked up by macOS sdk yet.

Dec 28 '20 02:12 skywhale

Hi, out of curiosity I tried the repro-steps on my mac and it is still failing with the same error - MacOs X 12.4.

I went down the internet rabbit hole and searched about similar pthread errors and see if there are alternatives. ipc-channel seems to rely on mach ports to pass data from one process to another one. Would it be possible to use mach port to implement an IP-Mutex? I am not familiar with this, so this may be a silly qestion.

Interestingly, boost has an interprocess_mutex class, which seems to work with MacOs.

https://github.com/dials/boost/blob/master/boost/interprocess/sync/interprocess_mutex.hpp

However, digging a bit, it seems that on Mac, the implementation falls back to using a spin mutex.

https://github.com/dials/boost/blob/master/boost/interprocess/detail/workaround.hpp#L52
https://github.com/dials/boost/blob/master/boost/interprocess/sync/interprocess_mutex.hpp#L42

so... perhaps the same thing should be done in ipmpsc?

Jun 30 '22 21:06 remifontan

Yes, I was wondering about some Mach-specific way to do locking, or else using named pipes just for locking (but still using the ring buffer for actually moving data around).

The spin lock approach could also work. We'd either need to port the Boost code to Rust or create a Rust wrapper for a C++ library that exports the Boost functionality using a C API Rust can talk to. The latter would be easiest, I imagine. Ideally this would be its own crate so others could reuse it.

Note that ipmpsc needs both a mutex and a condition variable implementation; Boost appears to support both of those in its interprocess spin lock suite.

I don't expect I'll have time to work on this myself any time soon, but I'd be happy to review a PR.

Jun 30 '22 22:06 dicej

I'll see what I can do... no promises :-)

Jul 02 '22 04:07 remifontan

ipmpsc ipmpsc copied to clipboard

pthread_cond_timedwait error with EINVAL on macos

ipmpsc
ipmpsc copied to clipboard