ocaml icon indicating copy to clipboard operation
ocaml copied to clipboard

Parallel Sys calls may cause a hang

Open jmid opened this issue 2 months ago • 2 comments

When extending the Sys tests in multicoretests I've encountered that executing parallel tests may cause such tests to hang indefinitely on macOS.

I've confirmed this both on GHA CI (on both ARM64 and Intel runners) as well as locally on my old macbook (later its battery gave physically up so now I'm left to CI-golfing ⛳ ). I can furthermore trigger it on 5.3, 5.4, and trunk on either GHA macOS runner (I haven't tried on 5.2 or lower at this point).

On GHA CI I have to press Cancel workflow to get it to stop. Locally however, my machine would become unresponsive and require a reboot, so consider yourself warned... ⚠️

Below follows a stand-alone reproducer. As usual it performs a "sequential prefix" (creating 2 dirs and 1 file), and then continues to run a combination of Sys.{mkdir,rename,rmdir} in two parallel domains, each of which should fail and raise an exception. This pattern is then repeated in outer loops to consistently trigger the issue.

let sandbox_root = "_sandbox"

let init_sut () =
  try Sys.mkdir sandbox_root 0o700
  with Sys_error msg when msg = sandbox_root ^ ": File exists" -> ()

let cleanup _ =
  Sys.command ("rm -r " ^ Filename.quote sandbox_root) |> ignore

let mkfile filepath =
  let flags = [Open_wronly; Open_creat; Open_excl] in
  Out_channel.with_open_gen flags 0o666 filepath (fun _ -> ())

let stress_prop_par () =
  let sut = init_sut () in

  Sys.mkdir "_sandbox/hhh" 0o755;
  mkfile "_sandbox/hhh/iii";
  Sys.mkdir "_sandbox/hhh/hhh" 0o755;

  let barrier = Atomic.make 2 in
  let dom1 () =
    Atomic.decr barrier;
    while Atomic.get barrier <> 0 do Domain.cpu_relax() done;
    (try Sys.rename "_sandbox/hhh/hhh" "_sandbox"; assert false with _ -> ());
    (try Sys.rename "_sandbox/bbb" "_sandbox"; assert false with _ -> ());
    (try Sys.mkdir "_sandbox/hhh/iii/eee" 0o755; assert false with _ -> ());
  in
  let dom2 () =
    Atomic.decr barrier;
    while Atomic.get barrier <> 0 do Domain.cpu_relax() done;
    (try Sys.rename "_sandbox/hhh/iii" "_sandbox/iii/ccc"; assert false with _ -> ());
    (try Sys.mkdir "_sandbox/hhh/iii" 0o755; assert false with _ -> ());
    (try Sys.rmdir "_sandbox/hhh"; assert false with _ -> ());
    (try Sys.rmdir "_sandbox/hhh"; assert false with _ -> ());
  in
  let dom1 = Domain.spawn dom1 in
  let dom2 = Domain.spawn dom2 in
  Domain.join dom1;
  Domain.join dom2;
  cleanup sut

let rec repeat n prop input =
  if n=0 then () else (prop input; Printf.printf "#%!"; repeat (n-1) prop input)

let _ =
  let rep_count = 50 in (* No. of inner repetitions of the non-deterministic property *)
  for i=1 to 1000 do
    Printf.printf "\nIteration %i %!" i;
    repeat rep_count stress_prop_par () (* 50 times each *)
  done

On my Linux box this program runs and prints as expected:

$ dune exec src/sys/stm_tests.exe -- -v
                                    
Iteration 1 ##################################################
Iteration 2 ##################################################
Iteration 3 ##################################################
Iteration 4 ##################################################
Iteration 5 ##################################################
Iteration 6 ##################################################
Iteration 7 ##################################################
Iteration 8 ##################################################
Iteration 9 ##################################################
Iteration 10 ######################## [...]

Not so on macOS unfortunately. Here are a couple example links to macOS CI runs that instead hang:

  • on macOS Intel with trunk https://github.com/ocaml-multicore/multicoretests/actions/runs/18198277138/job/51810241379
  • on macOS ARM64 with trunk https://github.com/ocaml-multicore/multicoretests/actions/runs/18198277120/job/51810241257

jmid avatar Oct 02 '25 16:10 jmid

I wonder if it would be possible to reproduce the issue without OCaml, but in C using pthreads standard POSIX commands for filesystem operations. That would indicate that the problem comes from OSX rather than OCaml.

gasche avatar Oct 02 '25 17:10 gasche

This seems like a macOS kernel bug to me. FYI the call to Filename.quote isn’t sufficient if the filename is untrusted. You need a -- before the filename to prevent option injection.

DemiMarie avatar Oct 21 '25 23:10 DemiMarie