multicoretests icon indicating copy to clipboard operation
multicoretests copied to clipboard

Add gc tests

Open jmid opened this issue 1 year ago • 10 comments

This PR adds initial tests of the Gc module - and in particular testing of Gc.compact.

Exercising Gc.counters already triggered a known issue locally, which has recently been fixed on trunk: https://github.com/ocaml/ocaml/pull/13370 I therefore expect the parallel test to fail on 5.2.0 and earlier.

This can be extended (and improved) in many ways adding (big)arrays, Weak, Ephemeron, custom blocks, finalizers, and dynamic Gc_ctrl changes.

jmid avatar Aug 27 '24 16:08 jmid

CI summary for ca8cf93:

  • 11 5.2 workflows failed as expected due to the Gc test
    • 32bit 5.2 failed with a segfault
    • Bytecode aborted with Fatal error: allocation failure during minor GC
    • Cygwin 5.2 aborted with Fatal error: allocation failure during minor GC
    • FP 5.2 aborted with Fatal error: allocation failure during minor GC
    • Linux 5.2 aborted with Fatal error: allocation failure during minor GC
    • Linux 5.2 debug aborted with Fatal error: allocation failure during minor GC
    • MinGW 5.2 exited with code -1073741819
    • MinGW bytecode 5.2 exited with code -1073741819
    • macOS-ARM64 5.2 failed with a segfault
    • macOS-intel 5.2 got signal BUS
    • freebsd-amd64-5.2 failed with a segfault
  • 3 Cygwin/MinGW trunk workflows failed with a version mismatch after the 5.3 branching
    • Cygwin trunk
    • MinGW bytecode trunk
    • MinGW trunk
  • Linux trunk debug failed with runtime/shared_heap.c; line 1392 ### Assertion failed: local->stats.pool_live_words == pool_stats.live #470

Out of 33 workflows 15 failed with 12 genuine issues and 3 ci-setup issues

jmid avatar Aug 27 '24 19:08 jmid

I removed the stress test, as the positive parallel test is already stress testing. I've also rebased based on #471 which should take care of the MinGW+Cygwin version mismatching.

I still expect the 5.2 workflows to trigger ocaml/ocaml#13370 (unsure whether the cmd should be disabled under 5.2 testing :thinking: )

jmid avatar Aug 29 '24 09:08 jmid

CI summary for 9ef90b5:

  • 7 5.2 workflows failed as expected due to the Gc test
    • 32bit 5.2 failed the parallel STM Gc test with Allocated_bytes : -7.47499188671e+196
    • Bytecode 5.2 aborted with Fatal error: allocation failure during minor GC
    • Cygwin 5.2 aborted with Fatal error: allocation failure during minor GC
    • FP 5.2 aborted with Fatal error: allocation failure during minor GC
    • MinGW 5.2 failed with Fatal error: allocation failure during minor GC
    • macOS-intel 5.2 failed with BUS error
    • linux-s390x-5.2 - aborted with Fatal error: allocation failure during minor GC
  • 3 debug runtime runs failed with an assertion error:
    • Linux 5.2 debug aborted with runtime/shared_heap.c; line 1392 ### Assertion failed: local->stats.pool_live_words == pool_stats.live
    • Linux 5.3 debug failed with runtime/shared_heap.c; line 1392 ### Assertion failed: local->stats.pool_live_words == pool_stats.live
    • Linux trunk debug failed with runtime/shared_heap.c; line 1392 ### Assertion failed: local->stats.pool_live_words == pool_stats.live

Out of 45 workflows 10 failed, all with genuine issues

jmid avatar Sep 03 '24 19:09 jmid

CI summary for 6b2669b

  • 10 5.2 workflows failed as expected due to the Gc test #474
    • 32bit 5.2 failed STM Gc test parallel with a negative allocated_bytes
    • Cygwin 5.2 aborted with Fatal error: allocation failure during minor GC
    • FP 5.2 aborted with Fatal error: allocation failure during minor GC
    • Linux 5.2 debug aborted with Fatal error: allocation failure during minor GC
    • MinGW bytecode 5.2 failed with Fatal error: allocation failure during minor GC
    • macOS-ARM64 5.2 failed with signal BUS
    • macOS-intel 5.2 failed with signal BUS
    • linux-arm64-5.2 aborted with Fatal error: allocation failure during minor GC
    • linux-s390x-5.2 aborted with Fatal error: allocation failure during minor GC
    • freebsd-amd64-5.2 aborted with Fatal error: allocation failure during minor GC
  • 2 debug runtime workflows failed with an assertion error #470
    • Linux 5.3 debug failed with runtime/shared_heap.c; line 1392 ### Assertion failed: local->stats.pool_live_words == pool_stats.live
    • Linux trunk debug failed with runtime/shared_heap.c; line 1392 ### Assertion failed: local->stats.pool_live_words == pool_stats.live
  • linux-ppc64le-5.2 found an unexpected STM Sys test parallel counterexample #466

Out of 45 workflows 13 failed with with 12 genuine issues and 1 false alarm

jmid avatar Sep 03 '24 20:09 jmid

CI summary for e9c9a6a

  • all 3 Linux debug runs aborted with runtime/shared_heap.c; line 1392 ### Assertion failed: local->stats.pool_live_words == pool_stats.live #470
    • Linux 5.2 debug
    • Linux 5.3 debug
    • Linux trunk debug
  • linux-s390x-5.2 failed STM Gc test parallel with a quick_stat output mismatch

Out of 45 workflows 4 failed with genuine issues

jmid avatar Sep 25 '24 14:09 jmid

CI summary for 6c1252f

  • 9 workflows failed due to minor_heap_size page size rounding
    • 32bit 5.2, 5.3, trunk
    • Cygwin 5.2, 5.3, trunk
    • macOS-ARM64 5.2, trunk
    • linux-ppc64le-5.2
  • all 3 Linux debug runs aborted with runtime/shared_heap.c; line 1392 ### Assertion failed: local->stats.pool_live_words == pool_stats.live` #470
    • Linux 5.2, 5.3, trunk debug
  • linux-s390x-5.2 timed out during STM Gc stress test parallel #421
  • macOS-ARM64 5.3 segfaulted in STM Gc stress test parallel #480

Out of 45 workflows 14 failed with 13 genuine issues and 1 CI infrastructure issue. The macOS-ARM64 5.3 crash is disturbing. fd0a3fd should address the pagesize rounding... :crossed_fingers:

jmid avatar Sep 25 '24 14:09 jmid

CI summary for fd0a3fd:

  • 3 Linux debug workflows failed with runtime/shared_heap.c; line 1392 ### Assertion failed: local->stats.pool_live_words == pool_stats.live #470
    • Linux 5.2, 5.3, trunk debug
  • 4 MSVC workflows failed to compile due to a header issue fatal error C1189: #error: "No Target Architecture"
    • MSVC 5.3, trunk
    • MSVC bytecode 5.3, trunk

Out of 45 workflows, 3 failed with a genuine error and 4 due to a buggy Windows header in the test-suite. 1ebdda4 should address that, leaving only genuine errors.

Note to self: Having removed v=0 in debug mode triggers v=63 and adds a bit of noise to the logs for the Gc test...

jmid avatar Sep 25 '24 16:09 jmid

CI summary for 1ebdda4:

  • 3 Linux debug workflows aborted STM Gc stress test parallel with Assertion failed: local->stats.pool_live_words == pool_stats.live
    • Linux 5.2, Linux 5.3, trunk debug
  • macOS-ARM64 5.2 segfaulted in STM Gc stress test parallel #480

Out of 45 workflows 4 failed - all with genuine issues

jmid avatar Sep 27 '24 12:09 jmid

CI summary for e5b93d3

  • 3 32bit workflows all segfaulted in STM implicit Gc stress test parallel or STM implicit Gc test parallel
    • 32bit 5.2, 5.3, trunk
  • 3 Linux debug workflows aborted STM Gc stress test parallel with ### Assertion failed: local->stats.pool_live_words == pool_stats.live
    • 5.2, 5.3, trunk debug
    • In addition, Linux 5.2 debug also failed STM implicit Gc test parallel with a negative allocated_bytes result
  • macOS-ARM64 trunk segfaulted in STM Gc stress test parallel

Out of 45 workflows 7 failed all with genuine errors

jmid avatar Sep 30 '24 08:09 jmid

CI summary for 6a86f20

  • 32bit 5.2 segfaulted in STM implicit Gc test parallel
  • 32bit 5.3 segfaulted in STM implicit Gc stress test parallel
  • 32bit trunk segfaulted in STM implicit Gc stress test parallel
  • Bytecode 5.2 aborted STM implicit Gc test parallel with allocation failure during minor GC
  • Bytecode 5.3 segfaulted in STM implicit Gc stress test parallel
  • Bytecode trunk segfaulted in STM implicit Gc test parallel
  • Linux 5.2 debug
    • aborted STM Gc stress test parallel with Assertion failed: local->stats.pool_live_words == pool_stats.live #470 and
    • failed STM implicit Gc test parallel with a negative allocated_bytes result
  • Linux 5.3 debug failed STM Gc stress test parallel with Assertion failed: local->stats.pool_live_words == pool_stats.live #470
  • Linux trunk debug failed STM Gc stress test parallel with Assertion failed: local->stats.pool_live_words == pool_stats.live #470
  • linux-s390x-5.2 timedout in Lin Bytes test with Thread #421
  • linux-ppc64le-5.2 found an unexpected counterexample in STM Sys test parallel #466

Out of 45 workflows, 11 failed with 9 genuine issues, 1 CI timeout, and 1 false alarm

jmid avatar Sep 30 '24 13:09 jmid

I've just pushed to trigger a fresh run, now that #13553 has been merged to trunk and cherry-picked to 5.3.

In other news, I've investigated the previously observed FreeBSD 5.2 crash. This was easy to reproduce, but turned out to be a known one, namely db586e0b27 - Merge pull request #13370 from DemiMarie/fix-missing-gc-rule

jmid avatar Oct 23 '24 12:10 jmid

CI summary for 4a93462:

  • Linux 5.2 debug failed with an assertion error #470
  • Linux 5.3 debug failed with an assertion error #470
  • Linux trunk debug failed with an assertion error #470
  • macOS-intel 5.2 crashed in STM Gc stress test parallel #480

Out of 41 workflows, 4 failed - all with genuine issues :tada:

(For some reason, the force push didn't trigger multicoretests-ci's four 5.2.0 workflows)

jmid avatar Oct 23 '24 15:10 jmid

CI summary for f7b2fa9

  • Linux 5.3 debug failed with assertion errors #470
  • Linux trunk debug failed with assertion errors #470

Out of 31 workflows 2 failed with genuine errors

jmid avatar Jan 06 '25 14:01 jmid

CI summary for 04f66ce

  • Linux 5.3 debug failed with assertion errors #470
  • Linux trunk debug failed with assertion errors #470

Out of 31 workflows 2 failed with genuine errors

jmid avatar Jan 06 '25 14:01 jmid

CI summary for 0374c67

  • Linux 5.3 debug failed with an assertion error #470
  • Linux trunk debug failed with an assertion error #470
  • MSVC bytecode 5.3 failed to trigger an STM bigarray parallel counterexample #467
  • macOS-ARM64 5.3 crashed in stm_tests_par_stress #480

Out of 33 workflows, 4 failed with 3 genuine issues and 1 false alarm.

jmid avatar Jan 06 '25 14:01 jmid

The experiment in 0374c67 replacing Gc.compact in cleanup worked fine - the previously seen issues are still triggering.

jmid avatar Jan 08 '25 13:01 jmid

First, I've rebased this on main.

Secondly, to make progress this PR d7ab558 temporarily omits Gc.compact to avoid triggering the two remaining issues: #470 and #480. If this is CI green, I plan to merge and handle reenabling of Gc.compact in a separate one-line PR.

jmid avatar Jan 08 '25 13:01 jmid

CI summary for d7ab558: all 33 workflows passed.

I'll merge this then :tada:

jmid avatar Jan 08 '25 16:01 jmid

CI summary for merge to main: All 34 workflows passed.

jmid avatar Jan 08 '25 21:01 jmid