multicoretests icon indicating copy to clipboard operation
multicoretests copied to clipboard

[ocaml5-issue] Deadlock in Dynlink test on Cygwin+MinGW+MSVC

Open shym opened this issue 2 years ago • 8 comments

Deadlock observed in a run on trunk Cygwin: https://github.com/shym/multicoretests/actions/runs/4367430739/jobs/7638729550#step:21:764

Wed, 08 Mar 2023 22:46:19 GMT random seed: 366632243
Wed, 08 Mar 2023 22:46:19 GMT generated error fail pass / total     time test name
Wed, 08 Mar 2023 22:46:19 GMT
Wed, 08 Mar 2023 22:46:19 GMT [ ]    0    0    0    0 /  100     0.0s negative Lin DSL Dynlink test with Domain
Wed, 08 Mar 2023 22:47:55 GMT [ ]    0    0    0    0 /  100     0.0s negative Lin DSL Dynlink test with Domain (generating)
Wed, 08 Mar 2023 22:49:08 GMT [ ]    0    0    0    0 /  100    96.0s negative Lin DSL Dynlink test with Domain (shrinking:    1)
Thu, 09 Mar 2023 00:38:34 GMT [ ]    0    0    0    0 /  100   168.9s negative Lin DSL Dynlink test with Domain (shrinking:    3)
Thu, 09 Mar 2023 00:38:34 GMT Error: The operation was canceled.

As the code paths are completely different between Cygwin (which provides a dlopen()) and Windows (which uses flexdll), this is probably not related to #290.

shym avatar Mar 09 '23 09:03 shym

Ah - actually, we do still use flexdll on Cygwin, so I expect this is related.

dra27 avatar Mar 09 '23 12:03 dra27

Another point of interest, related but maybe involving something else: the Dynlink test on Cygwin can end up abruptly (like a segfault) but reporting no error ($? is 0). There might be another issue there, in the way some errors get dropped? :thinking:

shym avatar Apr 05 '23 15:04 shym

We are seeing several Cygwin timeouts during Dynlink, which may well be this bug being triggered

  • https://github.com/ocaml-multicore/multicoretests/actions/runs/6049285277/job/16416279067 5.1
  • https://github.com/ocaml-multicore/multicoretests/actions/runs/5950931125/job/16139719760 5.1
  • https://github.com/ocaml-multicore/multicoretests/actions/runs/5942185755/job/16114588868 5.1
  • https://github.com/ocaml-multicore/multicoretests/actions/runs/6042013562/job/16396215009 trunk

jmid avatar Sep 04 '23 13:09 jmid

Seen again on Cygwin 5.1.0~rc2 when merging #389 into main https://github.com/ocaml-multicore/multicoretests/actions/runs/6077015167/job/16485985404

random seed: 153981
generated error fail pass / total     time test name

[ ]    0    0    0    0 /  100     0.0s negative Lin DSL Dynlink test with Domain
[ ]    0    0    0    0 /  100     0.0s negative Lin DSL Dynlink test with Domain (generating)
Error: The operation was canceled.

jmid avatar Sep 05 '23 14:09 jmid

This triggered again on the 0.3 branch for Cygwin trunk part1 https://github.com/ocaml-multicore/multicoretests/actions/runs/6481561492/job/17599326796

random seed: 406303381
generated error fail pass / total     time test name

[ ]    0    0    0    0 /  100     0.0s negative Lin DSL Dynlink test with Domain
[ ]    0    0    0    0 /  100     0.0s negative Lin DSL Dynlink test with Domain (generating)
Error: The operation was canceled.

jmid avatar Oct 12 '23 07:10 jmid

I've spent some time creating a reproducer for this:

libB.ml:

let value = 34

repro.ml:

let loadfile f =
  try Dynlink.loadfile (Dynlink.adapt_filename f)
  with Dynlink.Error (Dynlink.Module_already_loaded _) -> ()

let dont_crash () =
  let wait = Atomic.make true in
  let dom1 = Domain.spawn (fun () ->
			     while Atomic.get wait do Domain.cpu_relax() done;
  		             loadfile "libB.cmxs") in
  let dom2 = Domain.spawn (fun () ->
			     Atomic.set wait false;
			     loadfile "libB.cmxs") in
  let _ = Domain.join dom1 in
  let _ = Domain.join dom2 in
  ()

let _ =
  for i=1 to 1000 do
    Printf.printf "%i %!" i;
    dont_crash ()
  done

Makefile:

all:
	ocamlopt -g -shared libB.ml -o libB.cmxs
	ocamlopt -g -I +dynlink dynlink.cmxa repro.ml -o repro.exe

clean:
	rm -f libB.cmi libB.cmx libB.o libB.cmxs repro.cmi repro.cmx repro.o repro.exe

On MinGW (5.1.0, 5.1.1, 5.2.0~alpha1, trunk) this causes a range of different errors

  • hangs
  • segfaults
  • early exits
  • various Dynlink.Errors (bad object, not an OCaml plugin, missing frametable for libB, ...)

On MinGW 5.0.0 the errors trigger more rarely (but can still occur). On Cygwin I've observed similar behaviour (no segfaults though). On Linux I've not been able to trigger the issue.

I've found this: https://github.com/ocaml/flexdll/issues/120 which ticks the right boxes, as I believe flexdll is involved on both MinGW and Cygwin (according to David's remark above). So there seem to be a flexdll issue remaining in addition to https://github.com/ocaml/flexdll/pull/112 @shym :grimacing:

jmid avatar Mar 15 '24 11:03 jmid

The weekly 5.1.1 run triggered a Dynlink stress test crash on MinGW: https://github.com/ocaml-multicore/multicoretests/actions/runs/8406318839/job/23020071322

random seed: 398767628
generated error fail pass / total     time test name

[ ]    0    0    0    0 / 1000     0.0s Lin Dynlink stress test with Domain
File "src/dynlink/dune", line 14, characters 7-16:
14 |  (name lin_tests)
            ^^^^^^^^^
(cd _build/default/src/dynlink && ./lin_tests.exe --verbose)
Command exited with code -1073741819.

jmid avatar Mar 26 '24 14:03 jmid

FTR, while dusting off #399 for merging, I discovered that the parallel Dynlink issue also affects MSVC - because it also uses FlexDLL under the surface.

Here's an example MSVC trunk run (which I got running before bytecode): https://github.com/ocaml-multicore/multicoretests/actions/runs/8438568844/job/23111051221?pr=399

random seed: 373847262
generated error fail pass / total     time test name

[ ]    0    0    0    0 / 1000     0.0s Lin Dynlink stress test with Domain
File "src/dynlink/dune", line 14, characters 7-16:
14 |  (name lin_tests)
            ^^^^^^^^^
(cd _build/default/src/dynlink && ./lin_tests.exe --verbose)
Command exited with code -1073741819.
[ ]    0    0    0    0 / 1000     0.0s Lin Dynlink stress test with Domain (generating)

jmid avatar Mar 26 '24 16:03 jmid