[ocaml5-issue] Windows trunk bytecode domain_spawntree crash or deadlock
Today surfaced a Windows trunk bytecode crash on src/domain/domain_spawntree.ml https://github.com/ocaml-multicore/multicoretests/actions/runs/5154525696/jobs/9283085877
random seed: 502502158
generated error fail pass / total time test name
[ ] 0 0 0 0 / 100 0.0s domain_spawntree - with Atomic
File "src/domain/dune", line 14, characters 7-23:
14 | (name domain_spawntree)
^^^^^^^^^^^^^^^^
(cd _build/default/src/domain && ./domain_spawntree.exe --verbose)
Command exited with code -1073741819.
[ ] 0 0 0 0 / 100 0.0s domain_spawntree - with Atomic (generating)
Found another occurrence of this causing a live/deadlock: https://github.com/ocaml-multicore/multicoretests/actions/runs/5242663626/jobs/9466351902
random seed: 320533040
generated error fail pass / total time test name
[ ] 0 0 0 0 / 100 0.0s domain_spawntree - with Atomic
[ ] 0 0 0 0 / 100 0.0s domain_spawntree - with Atomic (generating)Terminate batch job (Y/N)?
^CFatal error: exception User interruption
Error: The operation was canceled.
Observed another variant of this on the Mingw Windows 5.0.0 workflow https://github.com/ocaml-multicore/multicoretests/actions/runs/5565429087/job/15072781545
random seed: 221745155
generated error fail pass / total time test name
[ ] 0 0 0 0 / 100 0.0s domain_spawntree - with Atomic
Fatal error: no domain lock held
File "src/domain/dune", line 14, characters 7-23:
14 | (name domain_spawntree)
^^^^^^^^^^^^^^^^
(cd _build/default/src/domain && ./domain_spawntree.exe --verbose)
Command exited with code 3.
[ ] 0 0 0 0 / 100 0.0s domain_spawntree - with Atomic (generating)
Crash seen again on Mingw bytecode trunk: https://github.com/ocaml-multicore/multicoretests/actions/runs/6093487146/job/16533243147
random seed: 373304996
generated error fail pass / total time test name
[ ] 0 0 0 0 / 100 0.0s domain_spawntree - with Atomic
File "src/domain/dune", line 14, characters 7-23:
14 | (name domain_spawntree)
^^^^^^^^^^^^^^^^
(cd _build/default/src/domain && ./domain_spawntree.exe --verbose)
Command exited with code -1073741819.
[ ] 0 0 0 0 / 100 0.0s domain_spawntree - with Atomic (generating)
Just saw this as a deadlock on Mingw 5.1.0~rc3 (native, not bytecode):
https://github.com/ocaml-multicore/multicoretests/actions/runs/6160240834/job/16716723779?pr=395
random seed: 238601704
generated error fail pass / total time test name
[ ] 0 0 0 0 / 100 0.0s domain_spawntree - with Atomic
[ ] 0 0 0 0 / 100 0.0s domain_spawntree - with Atomic (generating)
[ ] 13 0 0 13 / 100 91.4s domain_spawntree - with Atomic
[ ] 21 0 0 21 / 100 199.6s domain_spawntree - with AtomicTerminate batch job (Y/N)?
^CFatal error: exception User interruption
Error: The operation was canceled.
Observed on a MSVC-restoring branch (so on current trunk): https://github.com/shym/multicoretests/actions/runs/7169794449/job/19520999732#step:17:92
random seed: 529644456
generated error fail pass / total time test name
[ ] 0 0 0 0 / 100 0.0s domain_spawntree - with Atomic
Fatal error: Failed to create domain
Fatal error: File "src/domain/dune", line 14, characters 7-23:
14 | (name domain_spawntree)
^^^^^^^^^^^^^^^^
(cd _build/default/src/domain && ./domain_spawntree.exe --verbose)
Command exited with code -1073740791.
[ ] 0 0 0 0 / [100](https://github.com/shym/multicoretests/actions/runs/7169794449/job/19520999732#step:17:101) 0.0s domain_spawntree - with Atomic (generating)
Error -1073740791 seems to happen very consistently on the MSVC port, the latest instance being:
random seed: 405994358
generated error fail pass / total time test name
[ ] 0 0 0 0 / 100 0.0s domain_spawntree - with Atomic
[ ] 0 0 0 0 / 100 0.0s domain_spawntree - with Atomic (generating)
Fatal error: Failed to create domain
File "src/domain/dune", line 14, characters 7-23:
14 | (name domain_spawntree)
^^^^^^^^^^^^^^^^
(cd _build/default/src/domain && ./domain_spawntree.exe --verbose)
Command exited with code -1073740791.
but also with seeds 437567822, 428257872,...
According to MS documentation, -1073740791 (aka 0xc0000409) is:
STATUS_STACK_BUFFER_OVERRUN: The system detected an overrun of a stack-based buffer in this application. This overrun could potentially allow a malicious user to gain control of this application.
and -1073741819 (aka 0xc0000005) is:
STATUS_ACCESS_VIOLATION: The instruction at 0x%08lx referenced memory at 0x%08lx. The memory could not be %s.
Those sound like two nuances of segfaults. Could the differences of error codes bring any light on the cause or, on the contrary, suggest they are separate issues?
Debugging this further, it seems that the 0xC0000409 errors I saw on the MSVC port where caused by the abort as tracked in #428. So it would be two different things indeed.
Closing this as we haven't seen it in a year. We can reopen if it reappears.