multicoretests icon indicating copy to clipboard operation
multicoretests copied to clipboard

[ocaml5-issue] Windows trunk bytecode domain_spawntree crash or deadlock

Open jmid opened this issue 2 years ago • 7 comments

Today surfaced a Windows trunk bytecode crash on src/domain/domain_spawntree.ml https://github.com/ocaml-multicore/multicoretests/actions/runs/5154525696/jobs/9283085877

random seed: 502502158
generated error fail pass / total     time test name

[ ]    0    0    0    0 /  100     0.0s domain_spawntree - with Atomic
File "src/domain/dune", line 14, characters 7-23:
14 |  (name domain_spawntree)
            ^^^^^^^^^^^^^^^^
(cd _build/default/src/domain && ./domain_spawntree.exe --verbose)
Command exited with code -1073741819.
[ ]    0    0    0    0 /  100     0.0s domain_spawntree - with Atomic (generating)

jmid avatar Jun 02 '23 15:06 jmid

Found another occurrence of this causing a live/deadlock: https://github.com/ocaml-multicore/multicoretests/actions/runs/5242663626/jobs/9466351902

random seed: 320533040
generated error fail pass / total     time test name

[ ]    0    0    0    0 /  100     0.0s domain_spawntree - with Atomic
[ ]    0    0    0    0 /  100     0.0s domain_spawntree - with Atomic (generating)Terminate batch job (Y/N)? 
^CFatal error: exception User interruption
Error: The operation was canceled.

jmid avatar Jun 15 '23 21:06 jmid

Observed another variant of this on the Mingw Windows 5.0.0 workflow https://github.com/ocaml-multicore/multicoretests/actions/runs/5565429087/job/15072781545

random seed: 221745155
generated error fail pass / total     time test name

[ ]    0    0    0    0 /  100     0.0s domain_spawntree - with Atomic
Fatal error: no domain lock held
File "src/domain/dune", line 14, characters 7-23:
14 |  (name domain_spawntree)
            ^^^^^^^^^^^^^^^^
(cd _build/default/src/domain && ./domain_spawntree.exe --verbose)
Command exited with code 3.
[ ]    0    0    0    0 /  100     0.0s domain_spawntree - with Atomic (generating)

jmid avatar Aug 14 '23 15:08 jmid

Crash seen again on Mingw bytecode trunk: https://github.com/ocaml-multicore/multicoretests/actions/runs/6093487146/job/16533243147

random seed: 373304996
generated error fail pass / total     time test name

[ ]    0    0    0    0 /  100     0.0s domain_spawntree - with Atomic
File "src/domain/dune", line 14, characters 7-23:
14 |  (name domain_spawntree)
            ^^^^^^^^^^^^^^^^
(cd _build/default/src/domain && ./domain_spawntree.exe --verbose)
Command exited with code -1073741819.
[ ]    0    0    0    0 /  100     0.0s domain_spawntree - with Atomic (generating)

jmid avatar Sep 07 '23 07:09 jmid

Just saw this as a deadlock on Mingw 5.1.0~rc3 (native, not bytecode): https://github.com/ocaml-multicore/multicoretests/actions/runs/6160240834/job/16716723779?pr=395

random seed: 238601704
generated error fail pass / total     time test name

[ ]    0    0    0    0 /  100     0.0s domain_spawntree - with Atomic
[ ]    0    0    0    0 /  100     0.0s domain_spawntree - with Atomic (generating)
[ ]   13    0    0   13 /  100    91.4s domain_spawntree - with Atomic
[ ]   21    0    0   21 /  100   199.6s domain_spawntree - with AtomicTerminate batch job (Y/N)? 
^CFatal error: exception User interruption
Error: The operation was canceled.

jmid avatar Sep 13 '23 07:09 jmid

Observed on a MSVC-restoring branch (so on current trunk): https://github.com/shym/multicoretests/actions/runs/7169794449/job/19520999732#step:17:92

random seed: 529644456
generated error fail pass / total     time test name

[ ]    0    0    0    0 /  100     0.0s domain_spawntree - with Atomic
Fatal error: Failed to create domain
Fatal error: File "src/domain/dune", line 14, characters 7-23:
14 |  (name domain_spawntree)
            ^^^^^^^^^^^^^^^^
(cd _build/default/src/domain && ./domain_spawntree.exe --verbose)
Command exited with code -1073740791.
[ ]    0    0    0    0 /  [100](https://github.com/shym/multicoretests/actions/runs/7169794449/job/19520999732#step:17:101)     0.0s domain_spawntree - with Atomic (generating)

shym avatar Dec 11 '23 17:12 shym

Error -1073740791 seems to happen very consistently on the MSVC port, the latest instance being:

random seed: 405994358
generated error fail pass / total     time test name

[ ]    0    0    0    0 /  100     0.0s domain_spawntree - with Atomic
[ ]    0    0    0    0 /  100     0.0s domain_spawntree - with Atomic (generating)
Fatal error: Failed to create domain
File "src/domain/dune", line 14, characters 7-23:
14 |  (name domain_spawntree)
            ^^^^^^^^^^^^^^^^
(cd _build/default/src/domain && ./domain_spawntree.exe --verbose)
Command exited with code -1073740791.

but also with seeds 437567822, 428257872,...

According to MS documentation, -1073740791 (aka 0xc0000409) is:

STATUS_STACK_BUFFER_OVERRUN: The system detected an overrun of a stack-based buffer in this application. This overrun could potentially allow a malicious user to gain control of this application.

and -1073741819 (aka 0xc0000005) is:

STATUS_ACCESS_VIOLATION: The instruction at 0x%08lx referenced memory at 0x%08lx. The memory could not be %s.

Those sound like two nuances of segfaults. Could the differences of error codes bring any light on the cause or, on the contrary, suggest they are separate issues?

shym avatar Dec 19 '23 16:12 shym

Debugging this further, it seems that the 0xC0000409 errors I saw on the MSVC port where caused by the abort as tracked in #428. So it would be two different things indeed.

shym avatar Dec 21 '23 19:12 shym

Closing this as we haven't seen it in a year. We can reopen if it reappears.

jmid avatar Dec 20 '24 15:12 jmid