Parallel Sys calls may cause a hang
When extending the Sys tests in multicoretests I've encountered that executing parallel tests may cause such tests to hang indefinitely on macOS.
I've confirmed this both on GHA CI (on both ARM64 and Intel runners) as well as locally on my old macbook (later its battery gave physically up so now I'm left to CI-golfing ⛳ ). I can furthermore trigger it on 5.3, 5.4, and trunk on either GHA macOS runner (I haven't tried on 5.2 or lower at this point).
On GHA CI I have to press Cancel workflow to get it to stop. Locally however, my machine would become unresponsive and require a reboot, so consider yourself warned... ⚠️
Below follows a stand-alone reproducer. As usual it performs a "sequential prefix" (creating 2 dirs and 1 file), and then continues to run a combination of Sys.{mkdir,rename,rmdir} in two parallel domains, each of which should fail and raise an exception. This pattern is then repeated in outer loops to consistently trigger the issue.
let sandbox_root = "_sandbox"
let init_sut () =
try Sys.mkdir sandbox_root 0o700
with Sys_error msg when msg = sandbox_root ^ ": File exists" -> ()
let cleanup _ =
Sys.command ("rm -r " ^ Filename.quote sandbox_root) |> ignore
let mkfile filepath =
let flags = [Open_wronly; Open_creat; Open_excl] in
Out_channel.with_open_gen flags 0o666 filepath (fun _ -> ())
let stress_prop_par () =
let sut = init_sut () in
Sys.mkdir "_sandbox/hhh" 0o755;
mkfile "_sandbox/hhh/iii";
Sys.mkdir "_sandbox/hhh/hhh" 0o755;
let barrier = Atomic.make 2 in
let dom1 () =
Atomic.decr barrier;
while Atomic.get barrier <> 0 do Domain.cpu_relax() done;
(try Sys.rename "_sandbox/hhh/hhh" "_sandbox"; assert false with _ -> ());
(try Sys.rename "_sandbox/bbb" "_sandbox"; assert false with _ -> ());
(try Sys.mkdir "_sandbox/hhh/iii/eee" 0o755; assert false with _ -> ());
in
let dom2 () =
Atomic.decr barrier;
while Atomic.get barrier <> 0 do Domain.cpu_relax() done;
(try Sys.rename "_sandbox/hhh/iii" "_sandbox/iii/ccc"; assert false with _ -> ());
(try Sys.mkdir "_sandbox/hhh/iii" 0o755; assert false with _ -> ());
(try Sys.rmdir "_sandbox/hhh"; assert false with _ -> ());
(try Sys.rmdir "_sandbox/hhh"; assert false with _ -> ());
in
let dom1 = Domain.spawn dom1 in
let dom2 = Domain.spawn dom2 in
Domain.join dom1;
Domain.join dom2;
cleanup sut
let rec repeat n prop input =
if n=0 then () else (prop input; Printf.printf "#%!"; repeat (n-1) prop input)
let _ =
let rep_count = 50 in (* No. of inner repetitions of the non-deterministic property *)
for i=1 to 1000 do
Printf.printf "\nIteration %i %!" i;
repeat rep_count stress_prop_par () (* 50 times each *)
done
On my Linux box this program runs and prints as expected:
$ dune exec src/sys/stm_tests.exe -- -v
Iteration 1 ##################################################
Iteration 2 ##################################################
Iteration 3 ##################################################
Iteration 4 ##################################################
Iteration 5 ##################################################
Iteration 6 ##################################################
Iteration 7 ##################################################
Iteration 8 ##################################################
Iteration 9 ##################################################
Iteration 10 ######################## [...]
Not so on macOS unfortunately. Here are a couple example links to macOS CI runs that instead hang:
- on macOS Intel with trunk https://github.com/ocaml-multicore/multicoretests/actions/runs/18198277138/job/51810241379
- on macOS ARM64 with trunk https://github.com/ocaml-multicore/multicoretests/actions/runs/18198277120/job/51810241257
I wonder if it would be possible to reproduce the issue without OCaml, but in C using pthreads standard POSIX commands for filesystem operations. That would indicate that the problem comes from OSX rather than OCaml.
This seems like a macOS kernel bug to me. FYI the call to Filename.quote isn’t sufficient if the filename is untrusted. You need a -- before the filename to prevent option injection.