moonpool
moonpool copied to clipboard
Spurious Thread_local_storage.Not_set exception
I've been getting spurious Thread_local_storage.Not_set exceptions coming from dpool when doing IO with moonpool-io inside child fibers running on the same Moonpool_fib.main runner as the parent. My tests work fine most of the time, except for the seldom run that fails.
When it fails, however, I always get the same trace (on moonpool 0.7):
Thread 1 killed on uncaught exception Thread_local_storage.Not_set
Called from Moonpool_dpool.work_.main_loop in file "src/dpool/moonpool_dpool.pp.ml", line 102, characters 12-29
Called from Thread.create.(fun) in file "thread.ml", line 48, characters 8-14
That's unfortunate (and a bit weird, afaict literally all calls to TLS.get_exn are properly guarded!). Is there a way to get some repro/regression test? :-)
Also: try to opam pin https://github.com/c-cube/thread-local-storage.git/#feat/debug-tls-get, it should give you a better backtrace.
I've got a repro: I've learned that it happens even on a single fiber, and the exception isn't even being raised from main, just the backtrace being printed, which was enough to fail my cram tests.
let () =
Printexc.record_backtrace true;
let null = Moonpool_io.Unix.openfile "/dev/null" [ O_RDWR; O_CLOEXEC ] 0 in
Fun.protect
~finally:(fun () -> Moonpool_io.Unix.close null)
(fun () ->
Moonpool_fib.main (fun _ ->
let rec loop () =
let buf = Bytes.create 1024 in
match Moonpool_io.Unix.read null buf 0 (Bytes.length buf) with
| 0 -> ()
| read_n ->
ignore (Moonpool_io.Unix.write null buf 0 read_n : int);
loop ()
in
loop ()
)
)
Making a cram test for this file and running on bash while dune test --force ; do :; done eventually fails the test.
Doesn't print an exception:
$ ./main.exe
+ Thread 1 killed on uncaught exception Thread_local_storage.Not_set
+ Called from Moonpool_dpool.work_.main_loop in file "src/dpool/moonpool_dpool.pp.ml", line 102, characters 12-29
+ Called from Thread.create.(fun) in file "thread.ml", line 48, characters 8-14
Also: try to opam pin https://github.com/c-cube/thread-local-storage.git/#feat/debug-tls-get, it should give you a better backtrace.
My opam 2.1.5 didn't like that:
[ERROR] Could not retrieve Could not extract archive:
Unknown archive type: /tmp/opam-259024-2a235b/thread-local-storage.git
Could you try adding a call / top-level side-effect
let () = Picos_io_select.configure () (* Can only be called once *)
or
let () = Picos_io_select.check_configured () (* Can be called multple times *)
to the program such that it will be executed on the main thread before any threads or domains are spawned (and before using Moonpool_io)?
Either call will (among other things) configure signal masks such that the signal used by the Intr mechanism of Picos_io_select, which is SIGUSR2 by default, and SIGCHLD will not be delivered to the (current) thread and via signal mask inheritance to any other thread (the threads spawned by Picos_io_select will then handle those signals and manage signal masks for those signals).
In the latest main version of Picos_io_select the program would give the error message:
not configured; please configure before spawning any threads or domains
instead of letting the Not_set exception propagate as in 0.5.0. Apologies for not handling that in 0.5.0! (The documentation of 0.5.0 says that the library will automatically configure with default arguments, but unfortunately that does not happen when using the Intr mechanism.)
BTW, for any new file descriptors that your create (and don't pass to other processes) it is recommended to put them into non-blocking mode. This should do it:
let null = Moonpool_io.Unix.openfile "/dev/null" [ O_RDWR; O_CLOEXEC ] 0 in
+ Moonpool_io.Unix.set_nonblock null;
Fun.protect
The underlying Picos_io.Unix library automatically handles EINTR.
Checking my tests again, I find it suspicious that the exception doesn't leak or crash any IO, it just prints that backtrace along the actual output and the tests otherwise terminate successfully. To me, this feels more like a forgotten print.
Could you try adding a call / top-level side-effect [...]
I just tried both calls with the above repro, but neither fixed the spurious backtraces.
BTW, for any new file descriptors that your create (and don't pass to other processes) it is recommended to put them into non-blocking mode.
Noted, although the software I was testing moonpool and picos with only uses the standard streams for now.
Hmm... I can't seem to be able to reproduce the issue.
Here is the program I tried: moonpool_issue_35.ml
It runs fine on Ubuntu GitHub action:
(cd _build/default/test && ./moonpool_issue_35.exe)
Read...
Got 0
OK
It also produces the same output on my macOS laptop.
You should be able to clone the repo and run dune test.
It also runs just fine without the Picos_io_select.configure () call. Reading from /dev/null seems to just immediately return 0 and the read does not block. (Can you read from /dev/null? 🤔) However, you should call Picos_io_select.configure () before spawning any threads or domains.
Note that you should have your entire moonpool application inside the Moonpool_fib.main call. That call starts the scheduler. If you call functions that need a scheduler before that it can cause issues.
Also, Fun.protect should be avoided or used with care in programs using fibers, because it doesn't protect against cancelation. In this particular case Fun.protect should work fine, because Unix.close is not a cancelable operation, but I want to note this issue as it could cause surprises at some point.
@polytypic I just cloned your repo: if I run while dune test --force ; do :; done and wait past 5-10 seconds of expected output I get again:
Read...
Got 0
OK
Thread 1 killed on uncaught exception Thread_local_storage.Not_set
Called from Moonpool_dpool.work_.main_loop in file "src/dpool/moonpool_dpool.pp.ml", line 102, characters 12-29
Called from Thread.create.(fun) in file "thread.ml", line 48, characters 8-14
Sadly the loop doesn't stop on its own because the exit code is still 0; a cram test makes it easier to find.
It runs fine on Ubuntu GitHub action
This is why I mentioned these are spurious and ran these tests in a loop: the trigger isn't deterministic, sometimes it refuses to show up for a minute and then it shows several times in a row.
@polytypic I've forked your repo and changed the test suite to make it easier. Depending on luck, it takes from a few seconds to a few minutes to reproduce:
$ time while dune test --force ; do :; done
File "test/test.t", line 1, characters 0-0:
diff --git a/_build/.sandbox/81e12ad9a439dc22a3d38fb96cf5752a/default/test/test.t b/_build/.sandbox/81e12ad9a439dc22a3d38fb96cf5752a/default/test/test.t.corrected
index e4b0053..4adc3e0 100644
--- a/_build/.sandbox/81e12ad9a439dc22a3d38fb96cf5752a/default/test/test.t
+++ b/_build/.sandbox/81e12ad9a439dc22a3d38fb96cf5752a/default/test/test.t.corrected
@@ -4,3 +4,6 @@ No backtraces:
Read...
Got 0
OK
+ Thread 1 killed on uncaught exception Thread_local_storage.Not_set
+ Called from Moonpool_dpool.work_.main_loop in file "src/dpool/moonpool_dpool.pp.ml", line 102, characters 12-29
+ Called from Thread.create.(fun) in file "thread.ml", line 48, characters 8-14
real 0m12.307s
user 0m4.394s
sys 0m3.437s
Yes, it seems a thread is created in moonpool_dpool.ml in a static initializer:
(* special case for main domain: we start a worker immediately *)
let () =
assert (Domain_.is_main_domain ());
let w = { th_count = Atomic_.make 1; q = Bb_queue.create () } in
(* thread that stays alive *)
print_endline "moonpool_dpool.ml";
ignore (Thread.create (fun () -> work_ 0 w) () : Thread.t);
domains_.(0) <- Lock.create (Some w, None)
@c-cube Would it be possible to delay that to happen only when you call something from Moonpool? (Personally I think one should avoid static intializers like the plague. They always come back to bite you at some point.)
Alternatively, do you really want that thread to receive POSIX signals? Maybe you could just block all the signals?
modified src/dpool/moonpool_dpool.ml
@@ -148,7 +148,14 @@ let () =
assert (Domain_.is_main_domain ());
let w = { th_count = Atomic_.make 1; q = Bb_queue.create () } in
(* thread that stays alive *)
- ignore (Thread.create (fun () -> work_ 0 w) () : Thread.t);
+ ignore
+ (Thread.create
+ (fun () ->
+ Thread.sigmask SIG_BLOCK [ Sys.sigusr2 ] |> ignore;
+ work_ 0 w)
+ ()
+ : Thread.t);
domains_.(0) <- Lock.create (Some w, None)
let[@inline] max_number_of_domains () : int = Array.length domains_
The above prevents the issue, but blocking just SIGUSR2 is silly. You could probably just block all the signals you are not specifically interested in. Which signals would you like to receive in that thread?
I don't know how to run this static initializer from the main domain in a lazy way, alas. If I make it into a some sort of lazy block it might be called from the wrong domain :-(. I fully blame the design of Domain for this, I'd rather not have to write the pool at all honestly.
Blocking common signals from these threads seems a lot easier.
Blocking signals seems to work for me, can you confirm @amongonz ?
@c-cube Not entirely, I pinned moonpool to simon/fix-35 and retried my tests, but the proposed fix kinda turns it into a heisenbug: While the above repro has looped for almost an hour without triggering, the test suite which originally revealed it still triggers it within 10-50 seconds of looping (it used to take 1-10 s), so it made the error harder to trigger but still there.
I'm not sure how to make a stronger repro though, maybe giving it an actual workload. [^1] Luckily I was just trying out moonpool/picos, so I'm not personally in a hurry to find a fix.
[^1]: The repro originally used stdin/stdout, then I switched both to /dev/null in an attempt to make the cram test simpler, but I've realised too late that makes the read_n branch unused. Whoops.
Fun fact, trying to fence the source of our mystery exception I realised that if I modify moonpool_dpool.ml here I can consistently catch it when it happens, despite loop not using TLS at all (I think):
- loop ()
+ try loop ()
+ with Thread_local_storage.Not_set as exn ->
+ Format.eprintf "MYSTERY: %s\n%s@." (Printexc.to_string exn)
+ (Printexc.get_backtrace ());
+ raise exn
But wrapping the body of loop instead fails to catch it before it kills the thread [^2], is it coming from a safe point? I'm not familiar with the codebase, but could some signal handler be making an unguarded access to a TLS key at the wrong time, e.g. use-after-free?
[^2]: Depending on where I place the try/with I've also seen it blamed on unix.ml.
I'm not aware of any signal handling in moonpool itself (besides the new one). Picos does have some signal handling though.
I have absolutely no idea where this is coming from :grimacing: . Async exceptions are a possibility but even then, where does the TLS access come from?
One unguarded TLS access, which slipped into Picos 0.5.0, is in a signal handler. Have you also pinned to use the latest main of Picos? The latest main will raise with an error message instead of letting the Not_set propagate.
If it is the signal handler, then it means that some thread / domain has been spawned before the signal mask has been set. At what point is the thread / domain running the loop in moonpool_dpool.ml spawned?
@polytypic I had not pinned picos. To check this isn't fixed already to some degree, I've just pinned all picos and moonpool packages to the main branch, reinstalled them on opam, rebuilt my project from a clean state and tested again: I still hit this same output, possibly even sooner now (which is great for debugging!)
So, the way I found the earlier Thread.create call happening before the signal mask was set is that I simply added print_endline calls to all the place where threads were spawned:
Unstaged changes (6)
modified src/core/fifo_pool.ml
@@ -164,6 +164,7 @@ let create ?on_init_thread ?on_exit_thread ?on_exn ?around_task ?num_threads
(* function called in domain with index [i], to
create the thread and push it into [receive_threads] *)
let create_thread_in_domain () =
+ print_endline "fifo_pool.ml";
let st = { idx = i; dom_idx; st = pool } in
let thread = Thread.create (WL.worker_loop ~ops:worker_ops) st in
(* send the thread from the domain back to us *)
modified src/core/moonpool.ml
@@ -3,6 +3,7 @@ open Types_
exception Shutdown = Runner.Shutdown
let start_thread_on_some_domain f x =
+ print_endline "moonpool.ml";
let did = Random.int (Domain_pool_.max_number_of_domains ()) in
Domain_pool_.run_on_and_wait did (fun () -> Thread.create f x)
modified src/core/ws_pool.ml
@@ -310,6 +310,7 @@ let create ?(on_init_thread = default_thread_init_exit_)
(* function called in domain with index [i], to
create the thread and push it into [receive_threads] *)
let create_thread_in_domain () =
+ print_endline "ws_pool.ml";
let thread = Thread.create (WL.worker_loop ~ops:worker_ops) st in
(* send the thread from the domain back to us *)
Bb_queue.push receive_threads (idx, thread)
modified src/dpool/moonpool_dpool.ml
@@ -148,7 +148,14 @@ let () =
assert (Domain_.is_main_domain ());
let w = { th_count = Atomic_.make 1; q = Bb_queue.create () } in
(* thread that stays alive *)
- ignore (Thread.create (fun () -> work_ 0 w) () : Thread.t);
+ print_endline "moonpool_dpool.ml";
+ ignore
+ (Thread.create
+ (fun () ->
+ Thread.sigmask SIG_BLOCK [ Sys.sigusr2 ] |> ignore;
+ work_ 0 w)
+ ()
+ : Thread.t);
domains_.(0) <- Lock.create (Some w, None)
let[@inline] max_number_of_domains () : int = Array.length domains_
modified src/fib/dune
@@ -2,7 +2,7 @@
(name moonpool_fib)
(public_name moonpool.fib)
(synopsis "Fibers and structured concurrency for Moonpool")
- (libraries moonpool picos)
+ (libraries moonpool picos picos_io.select)
(enabled_if
(>= %{ocaml_version} 5.0))
(flags :standard -open Moonpool_private -open Moonpool)
modified src/private/domain_.ml
@@ -19,7 +19,11 @@ let recommended_number () = 1
type t = Thread.t
let get_id (self : t) : int = Thread.id self
-let spawn f : t = Thread.create f ()
+
+let spawn f : t =
+ print_endline "domain.ml";
+ Thread.create f ()
+
let relax () = Thread.yield ()
let join = Thread.join
let is_main_domain () = true
And then I also added a print_endline call to the point where Picos_io_select.configure () was called. You could also add a print_endline to the point where moonpool now blocks signals. You want to ensure that neither threads nor domains are spawned before that point.
I've just pinned all picos and moonpool packages to the main branch [...] I still hit this same output
So, if you still get the Not_set exception, that would seem to indicate that the source of TLS.get_exn is not in Picos, but in somewhere inside Moonpool, perhaps.
the test suite which originally revealed it still triggers it within 10-50 seconds of looping
Would it be possible to get access to that original test suite or something trimmed from it that triggers the issue? I could try to find the place where a thread/domain is spawned before the signal mask is set and/or identify the place where TLS.get_exn is being called.
Would it be possible to get access to that original test suite or something trimmed from it that triggers the issue?
I'll try to share a more faithful repro shortly, but I suspect my project tests fail sooner simply because it does more work per test and more tests per run of dune test.
The "test suite" is just a few cram tests for a barebones LSP server I'm working on: your classic read loop reading bytes from stdin into a buffer until it can parse a message, then replying on stdout. I'm currently trying out fiber libraries to implement more complex upcoming features, but for now the IO and tasking is really that simple.
@amongonz Any news on a potential repro case, I could take a look.
@polytypic I've fixed the repro in https://github.com/amongonz/picos-issue to perform some "actual work" (copy 1 MiB from /dev/zero to /dev/null in chunks of up to 1 KiB) and added a script to easily run the above test loop.
Once the test executable is built, by running run-test-loop.sh I consistently trigger the cram test failure within seconds; I'm running all moonpool and picos packages pinned to their main branches.
I'm killing moonpool_io, so closing this, sorry.