multicoretests
multicoretests copied to clipboard
[ocaml5-issue] `Out_channel.flush` can cause `Syserror` when used in parallel
A Cygwin 5.2 run triggered when merging #443 to main found an unexpected counterexample to STM Out_channel parallel:
https://github.com/ocaml-multicore/multicoretests/actions/runs/8361686850/job/22890359851
random seed: 292949458
generated error fail pass / total time test name
[ ] 0 0 0 0 / 1000 0.0s STM Out_channel test sequential
[ ] 0 0 0 0 / 1000 0.0s STM Out_channel test sequential (generating)
[✓] 1000 0 0 1000 / 1000 2.9s STM Out_channel test sequential
[ ] 0 0 0 0 / 1000 0.0s STM Out_channel test parallel
[ ] 620 0 0 542 / 1000 57.1s STM Out_channel test parallel
[ ] 1116 0 0 966 / 1000 117.5s STM Out_channel test parallel (shrinking: 1.0011)
[✗] 1117 0 1 966 / 1000 129.8s STM Out_channel test parallel
--- Failure --------------------------------------------------------------------
Test STM Out_channel test parallel failed (1 shrink steps):
|
Output_substring ("\225\185!\188Q\142\138", 0, 96)
Output_byte 36
Output_bytes "\158\199e\220\253\148\... (truncated)
Flush
Close
Output ("Z\197j", 3, 3)
Open_text
Is_buffered
Set_binary_mode false
Output_substring ("{8)g", 18, 1)
Length
Close_noerr
Open_text
Set_buffered true
|
.----------------------------------------------------.
| |
Close_noerr Set_binary_mode true
Open_text Pos
Output_substring ("\to\128N\186", 7, 6) Set_binary_mode true
Output_char '9' Output_byte 54
Set_binary_mode true Length
Output_bytes "\182U" Flush
Output_char '5' Output_substring ("\014\236\187\146^... (truncated)
Output ("(|$\217\203\206\162t\235", 18, 4) Set_binary_mode true
Pos Is_buffered
Set_binary_mode false Output_byte 1
+++ Messages ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Messages for test STM Out_channel test parallel:
Results incompatible with linearized model
|
Output_substring ("\225\185!\188Q\142\138", 0, 96) : Error (Invalid_argument("output_substring"))
Output_byte 36 : Ok (())
Output_bytes "\158\199e\220\253\148\... (truncated) : Ok (())
Flush : Ok (())
Close : Ok (())
Output ("Z\197j", 3, 3) : Error (Invalid_argument("output"))
Open_text : Ok (())
Is_buffered : Ok (true)
Set_binary_mode false : Ok (())
Output_substring ("{8)g", 18, 1) : Error (Invalid_argument("output_substring"))
Length : Ok (0)
Close_noerr : Ok (())
Open_text : Ok (())
Set_buffered true : Ok (())
|
.--------------------------------------------------------------------------------------------------.
| |
Close_noerr : Ok (()) Set_binary_mode true : Ok (())
Open_text : Ok (()) Pos : Ok (0)
Output_substring ("\to\128N\186", 7, 6) : Error (Invalid_argument("output_substring")) Set_binary_mode true : Ok (())
Output_char '9' : Ok (()) Output_byte 54 : Ok (())
Set_binary_mode true : Ok (()) Length : Ok (0)
Output_bytes "\182U" : Ok (()) Flush : Error (Sys_error("Bad file descriptor"))
Output_char '5' : Ok (()) Output_substring ("\014\236\187\146^... (truncated) : Ok (())
Output ("(|$\217\203\206\162t\235", 18, 4) : Error (Invalid_argument("output")) Set_binary_mode true : Error (Sys_error("Bad file descriptor"))
Pos : Ok (4) Is_buffered : Ok (true)
Set_binary_mode false : Ok (()) Output_byte 1 : Ok (())
================================================================================
failure (1 tests failed, 0 tests errored, ran 2 tests)
File "src/io/dune", line 40, characters 7-16:
40 | (name stm_tests)
^^^^^^^^^
(cd _build/default/src/io && ./stm_tests.exe --verbose)
Command exited with code 1.
AFAICS, the failure can be explained by Flush in the "right leg" causing Sys_error("Bad file descriptor").
This goes against the specification, which says:
val close : t -> unit
Close the given channel, flushing all buffered write operations. Output functions raise a Sys_error exception when they are applied to a closed output
channel, except Out_channel.close and Out_channel.flush , which do nothing when applied to an already closed channel. [...]
This is currently captured in the STM Out_channel test as:
| Flush, Res ((Result (Unit,Exn),_), r) ->
(match s,r with
| Closed, Error (Sys_error _) -> false (* should not generate an error *)
| Closed, Ok () -> true
| Open _, Ok () -> true
| _ -> false)
Just observed this again - now on FreeBSD 5.1 trying out the latest QCheck memory usage improvement https://github.com/ocaml-multicore/multicoretests/tree/try-qcheck-mem-improvement https://ocaml-multicoretests.ci.dev:8100/job/2024-06-28/125000-ci-ocluster-build-caa2d4
random seed: 362318939
generated error fail pass / total time test name
[ ] 0 0 0 0 / 1000 0.0s STM Out_channel test sequential
[ ] 0 0 0 0 / 1000 0.0s STM Out_channel test sequential (generating)
[✓] 1000 0 0 1000 / 1000 1.2s STM Out_channel test sequential
[ ] 0 0 0 0 / 1000 0.0s STM Out_channel test parallel
[✗] 103 0 1 90 / 1000 4.7s STM Out_channel test parallel
--- Failure --------------------------------------------------------------------
Test STM Out_channel test parallel failed (1 shrink steps):
|
Output ("\011\230\197\204\173\164'\024", 9, 1)
Output_bytes "\159\238+\152]\006*\24... (truncated)
Close_noerr
Open_text
Flush
Flush
|
.----------------------------------------------------.
| |
Length Close_noerr
Output_char 'i' Output ("Dc\015\194*x\159YlA\b\247\1... (truncated)
Length Open_text
Output_byte 8 Output_char 'G'
Flush Set_binary_mode false
+++ Messages ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Messages for test STM Out_channel test parallel:
Results incompatible with linearized model
|
Output ("\011\230\197\204\173\164'\024", 9, 1) : Error (Invalid_argument("output"))
Output_bytes "\159\238+\152]\006*\24... (truncated) : Ok (())
Close_noerr : Ok (())
Open_text : Ok (())
Flush : Ok (())
Flush : Ok (())
|
.-----------------------------------------------------------------------------------------------.
| |
Length : Ok (0) Close_noerr : Ok (())
Output_char 'i' : Ok (()) Output ("Dc\015\194*x\159YlA\b\247\1... (truncated) : Error (Sys_error("Bad file descriptor"))
Length : Ok (0) Open_text : Ok (())
Output_byte 8 : Ok (()) Output_char 'G' : Ok (())
Flush : Error (Sys_error("Bad file descriptor")) Set_binary_mode false : Ok (())
================================================================================
failure (1 tests failed, 0 tests errored, ran 2 tests)
File "src/io/dune", line 40, characters 7-16:
40 | (name stm_tests)
^^^^^^^^^
(cd _build/default/src/io && ./stm_tests.exe --verbose)
Command exited with code 1.
I've created a standalone reproducer for this one:
let test () =
let path = Filename.temp_file "stm-" "" in
let channel = Atomic.make (Out_channel.open_text path) in
(* First, a bit of one-domain channel activity *)
Out_channel.output_byte (Atomic.get channel) 1;
(try Out_channel.flush (Atomic.get channel) with (Sys_error _) -> assert false);
let wait = Atomic.make true in
(* Domain 1 closes-opens-outputs repeatedly *)
let d1 = Domain.spawn (fun () ->
while Atomic.get wait do Domain.cpu_relax() done;
for _ = 1 to 50 do
Out_channel.close (Atomic.get channel);
Atomic.set channel (Out_channel.open_text path);
Out_channel.output_byte (Atomic.get channel) 1;
done;
) in
(* Domain 2 calls flush and pos repeatedly *)
let d2 = Domain.spawn (fun () ->
Atomic.set wait false;
(*
"Output functions raise a Sys_error exception when they are applied to a closed output channel,
except Out_channel.close and Out_channel.flush , which do nothing when applied to an already closed channel."
*)
for _ = 1 to 50 do
(try Out_channel.flush (Atomic.get channel)
with (Sys_error msg) -> Printf.printf "Out_channel.flush raised Sys_error %S\n%!" msg; assert false);
ignore (Out_channel.pos (Atomic.get channel));
done;
) in
let () = Domain.join d1 in
let () = Domain.join d2 in
(* Please leave the torture chamber nice and clean as you found it *)
(try Out_channel.close (Atomic.get channel) with Sys_error _ -> ());
Sys.remove path
let _ =
for i = 1 to 5_000 do
if i mod 250 = 0 then Printf.printf "#%!";
test ()
done
With it, locally on Linux I am able to trigger the assertion failure on 5.1.0+fp and 5.2.0+fp switches.
The branch https://github.com/ocaml-multicore/multicoretests/tree/flushtest-repro-focus triggers a focused test of it on the CI.
Across all runs, this reproducer only triggered in multicoretests-ci where flush may trigger a Sys_error in
- linux-arm64-5.1 https://ocaml-multicoretests.ci.dev:8100/job/2024-08-09/141442-ci-ocluster-build-d1fac4
- linux-s390x-5.1 https://ocaml-multicoretests.ci.dev:8100/job/2024-08-09/141442-ci-ocluster-build-96b82e
In trying to understand this better, I ran dune exec src/io/stm_tests.exe -- -v locally under a 5.2.0+fp switch, where it fails spectacularly - even the sequential one:
$ dune exec src/io/stm_tests.exe -- -v
random seed: 242921126
generated error fail pass / total time test name
[✗] 20 0 1 19 / 1000 0.1s STM Out_channel test sequential
[✗] 8 0 1 7 / 1000 5.9s STM Out_channel test parallel
--- Failure --------------------------------------------------------------------
Test STM Out_channel test sequential failed (19 shrink steps):
Close_noerr
Output_char 'H'
Output_char ']'
+++ Messages ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Messages for test STM Out_channel test sequential:
Results incompatible with model
Close_noerr : Ok (())
Output_char 'H' : Error (Sys_error("Bad file descriptor"))
Output_char ']' : Ok (())
--- Failure --------------------------------------------------------------------
Test STM Out_channel test parallel failed (45 shrink steps):
|
Close
Output_byte 4
Output_char 'E'
|
.---------------------.
| |
+++ Messages ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Messages for test STM Out_channel test parallel:
Results incompatible with linearized model
|
Close : Ok (())
Output_byte 4 : Error (Sys_error("Bad file descriptor"))
Output_char 'E' : Ok (())
|
.---------------------------------------------------------.
| |
================================================================================
failure (2 tests failed, 0 tests errored, ran 2 tests)
The same test however completes under the following two which seems to indicate a fp-related bug that may have since been fixed on trunk:
- 5.2.0 (no fp)
- 5.3.0+fp
A standalone reproducer:
let test () =
let path = Filename.temp_file "stm-" "" in
let channel = Out_channel.open_text path in
Out_channel.close_noerr channel;
(try Out_channel.output_char channel 'A'; assert false with (Sys_error _) -> ()); (* Error (Sys_error("Bad file descriptor")) *)
(try Out_channel.output_char channel 'B'; assert false with (Sys_error _) -> ()); (* Succeeds *)
Sys.remove path
let _ =
for i = 1 to 10_000 do
if i mod 250 = 0 then Printf.printf "#%!";
test ()
done
To summarize
- the Cygwin
5.2.0failure was actually in5.2.0~alpha1. It hasn't shown up since #12678 was merged (into 5.2.0~beta1) - the FreeBSD 5.1 failure was in
5.1.1- not including #12678 - linux-arm64-5.1 - same
- linux-s390x-5.1 - same
This means that we are only missing an explanation of the above 5.2.0+fp standalone reproducers to close this issue.
I ran into this too when running the test suite with Jane Street's branch of the OCaml compiler. My explanation seems to apply equally to the most recent head of ocaml/ocaml, so I thought I'd share it here.
Explanation: the STM model expects that Out_channel.flush will never raise if interleaved with Out_channel.close. However, Out_channel.flush can raise if you get an unlucky concurrent interleaving with Out_channel.close. In particular:
Out_channel.flushis called before the close happens, and this guard is satisfied (given the channel is still open): https://github.com/ocaml/ocaml/blob/fb5e30cb40603c75b5d19987fcd7911cc003d077/runtime/io.c#L822-L823check_pendingincaml_flush_partialrelinquishesOut_channel.flush’s lock on the channel: https://github.com/ocaml/ocaml/blob/fb5e30cb40603c75b5d19987fcd7911cc003d077/runtime/io.c#L118-L132Out_channel.closeis called and completes, settingchannel->buffer + 1 = channel->curr(https://github.com/ocaml/ocaml/blob/fb5e30cb40603c75b5d19987fcd7911cc003d077/runtime/io.c#L721-L730)- The execution of flush continues.
caml_flush_partialis called, and this check indicates that there is exactly 1 byte to write (given the previous bullet point). - The write fails, and so this code raises
Sys_error
If my explanation is correct, then maybe caml_flush_partial should check if the channel is closed before attempting to write to it.
(The issue is probably more widespread than suggested by my last message: I see that caml_flush is called by lots of functions, not just caml_ml_flush. e.g. outputting a string. All of these operations can thus relinquish the channel lock midoperation.)
Thanks for sharing! :pray: Your explanation makes sense to me. I can't reproduce the error ATM though... I'll need to look more into it.
I've now created a reproducer and filed this upstream: https://github.com/ocaml/ocaml/issues/13586