domainslib
domainslib copied to clipboard
Prominently document that `Mutex`, `Condition`, ... might not behave as expected with Domainslib
Thank you for this nice project, we found it quite helpful in our ongoing efforts to parallelize a fixpoint algorithm in OCaml.
A quick suggestion: It might be a good idea to prominently document that Mutex
and Condition
will not work out of the box as one might expect when combined with Domainslib. This will help people new to Multicore avoid going down a potentially time-consuming rabbit hole. (Apologies if there is such a remark somewhere, I re-checked and still did not find any).
Details (can be skipped by people familiar with the difference in behavior)
It took us quite a while to understand why our algorithm was not terminating and sometimes throwing exceptions, and we managed to extract this example:
open Domainslib
let main () =
let mutex = Mutex.create () in
let pool = T.setup_pool ~num_domains:2 () in
let task () =
for i = 0 to 1000 do
(
Mutex.lock mutex;
let work = T.async pool (fun () -> ()) in
Task.await pool work;
Mutex.unlock mutex
)
done
in
Domainslib.Task.run pool (fun () ->
let p = T.async pool (fun () -> task ()) in
let p1 = T.async pool (fun () -> task ()) in
let p2 = T.async pool (fun () -> task ()) in
let p3 = T.async pool (fun () -> task ()) in
Task.await pool p;
Task.await pool p1;
Task.await pool p2;
Task.await pool p3;
);
()
let _ = main ()
which will either crash with
michael@michael-XPS-13-9360:~/Documents/td-parallel$ _build/default/mutexproblem.exe
Locking thread different from unlocking thread
Fatal error: exception Sys_error("Mutex.unlock: Operation not permitted")
or deadlock.
We had a similar problem also when we tried using a condition variable to wait until a certain number of tasks had reached a certain point, which did deadlock (for n domains) as soon as n tasks had reached that point.
After looking into how Domainslib works, it of course becomes clear that one would have to use something akin to, e.g., https://github.com/ocaml-multicore/domain-local-await.
Yes, this has been a known issue for a long time. See issue #126 here and remark here, for example.
You mentioned domain-local-await. Yes, that currently works with Domainslib and Eio. I'm currently working on Picos, which aims to provide a more comprehensive and more widely accepted solution to interoperability and replace domain-local-await and domain-local-timeout. Picos already provides replacements for the Stdlib Mutex and Condition. Unfortunately, no existing scheduler (aside from the sample schedulers in the Picos package) currently provides full compatibility with Picos. Hopefully we'll get a chance at some point to rewrite the internals of Domainslib to use Picos.