multicoretests
multicoretests copied to clipboard
[ocaml5-issue] Deadlock in Dynlink test on Cygwin+MinGW+MSVC
Deadlock observed in a run on trunk Cygwin: https://github.com/shym/multicoretests/actions/runs/4367430739/jobs/7638729550#step:21:764
Wed, 08 Mar 2023 22:46:19 GMT random seed: 366632243
Wed, 08 Mar 2023 22:46:19 GMT generated error fail pass / total time test name
Wed, 08 Mar 2023 22:46:19 GMT
Wed, 08 Mar 2023 22:46:19 GMT [ ] 0 0 0 0 / 100 0.0s negative Lin DSL Dynlink test with Domain
Wed, 08 Mar 2023 22:47:55 GMT [ ] 0 0 0 0 / 100 0.0s negative Lin DSL Dynlink test with Domain (generating)
Wed, 08 Mar 2023 22:49:08 GMT [ ] 0 0 0 0 / 100 96.0s negative Lin DSL Dynlink test with Domain (shrinking: 1)
Thu, 09 Mar 2023 00:38:34 GMT [ ] 0 0 0 0 / 100 168.9s negative Lin DSL Dynlink test with Domain (shrinking: 3)
Thu, 09 Mar 2023 00:38:34 GMT Error: The operation was canceled.
As the code paths are completely different between Cygwin (which provides a dlopen()) and Windows (which uses flexdll), this is probably not related to #290.
Ah - actually, we do still use flexdll on Cygwin, so I expect this is related.
Another point of interest, related but maybe involving something else: the Dynlink test on Cygwin can end up abruptly (like a segfault) but reporting no error ($? is 0). There might be another issue there, in the way some errors get dropped? :thinking:
We are seeing several Cygwin timeouts during Dynlink, which may well be this bug being triggered
- https://github.com/ocaml-multicore/multicoretests/actions/runs/6049285277/job/16416279067 5.1
- https://github.com/ocaml-multicore/multicoretests/actions/runs/5950931125/job/16139719760 5.1
- https://github.com/ocaml-multicore/multicoretests/actions/runs/5942185755/job/16114588868 5.1
- https://github.com/ocaml-multicore/multicoretests/actions/runs/6042013562/job/16396215009 trunk
Seen again on Cygwin 5.1.0~rc2 when merging #389 into main
https://github.com/ocaml-multicore/multicoretests/actions/runs/6077015167/job/16485985404
random seed: 153981
generated error fail pass / total time test name
[ ] 0 0 0 0 / 100 0.0s negative Lin DSL Dynlink test with Domain
[ ] 0 0 0 0 / 100 0.0s negative Lin DSL Dynlink test with Domain (generating)
Error: The operation was canceled.
This triggered again on the 0.3 branch for Cygwin trunk part1 https://github.com/ocaml-multicore/multicoretests/actions/runs/6481561492/job/17599326796
random seed: 406303381
generated error fail pass / total time test name
[ ] 0 0 0 0 / 100 0.0s negative Lin DSL Dynlink test with Domain
[ ] 0 0 0 0 / 100 0.0s negative Lin DSL Dynlink test with Domain (generating)
Error: The operation was canceled.
I've spent some time creating a reproducer for this:
libB.ml:
let value = 34
repro.ml:
let loadfile f =
try Dynlink.loadfile (Dynlink.adapt_filename f)
with Dynlink.Error (Dynlink.Module_already_loaded _) -> ()
let dont_crash () =
let wait = Atomic.make true in
let dom1 = Domain.spawn (fun () ->
while Atomic.get wait do Domain.cpu_relax() done;
loadfile "libB.cmxs") in
let dom2 = Domain.spawn (fun () ->
Atomic.set wait false;
loadfile "libB.cmxs") in
let _ = Domain.join dom1 in
let _ = Domain.join dom2 in
()
let _ =
for i=1 to 1000 do
Printf.printf "%i %!" i;
dont_crash ()
done
Makefile:
all:
ocamlopt -g -shared libB.ml -o libB.cmxs
ocamlopt -g -I +dynlink dynlink.cmxa repro.ml -o repro.exe
clean:
rm -f libB.cmi libB.cmx libB.o libB.cmxs repro.cmi repro.cmx repro.o repro.exe
On MinGW (5.1.0, 5.1.1, 5.2.0~alpha1, trunk) this causes a range of different errors
- hangs
- segfaults
- early exits
- various
Dynlink.Errors (bad object, not an OCaml plugin, missing frametable for libB, ...)
On MinGW 5.0.0 the errors trigger more rarely (but can still occur). On Cygwin I've observed similar behaviour (no segfaults though). On Linux I've not been able to trigger the issue.
I've found this: https://github.com/ocaml/flexdll/issues/120 which ticks the right boxes, as I believe flexdll is involved on both MinGW and Cygwin (according to David's remark above). So there seem to be a flexdll issue remaining in addition to https://github.com/ocaml/flexdll/pull/112 @shym :grimacing:
The weekly 5.1.1 run triggered a Dynlink stress test crash on MinGW:
https://github.com/ocaml-multicore/multicoretests/actions/runs/8406318839/job/23020071322
random seed: 398767628
generated error fail pass / total time test name
[ ] 0 0 0 0 / 1000 0.0s Lin Dynlink stress test with Domain
File "src/dynlink/dune", line 14, characters 7-16:
14 | (name lin_tests)
^^^^^^^^^
(cd _build/default/src/dynlink && ./lin_tests.exe --verbose)
Command exited with code -1073741819.
FTR, while dusting off #399 for merging, I discovered that the parallel Dynlink issue also affects MSVC - because it also uses FlexDLL under the surface.
Here's an example MSVC trunk run (which I got running before bytecode): https://github.com/ocaml-multicore/multicoretests/actions/runs/8438568844/job/23111051221?pr=399
random seed: 373847262
generated error fail pass / total time test name
[ ] 0 0 0 0 / 1000 0.0s Lin Dynlink stress test with Domain
File "src/dynlink/dune", line 14, characters 7-16:
14 | (name lin_tests)
^^^^^^^^^
(cd _build/default/src/dynlink && ./lin_tests.exe --verbose)
Command exited with code -1073741819.
[ ] 0 0 0 0 / 1000 0.0s Lin Dynlink stress test with Domain (generating)