openj9 icon indicating copy to clipboard operation
openj9 copied to clipboard

Recompile FSD Bodies generated under -XX:+DebugOnRestore

Open dsouzai opened this issue 1 year ago • 1 comments
trafficstars

  • Recompile FSD Bodies generated pre-checkpoint under -XX:+DebugOnRestore
  • Disable sample based recompilation pre-checkpoint under -XX:+DebugOnRestore

Parent issue: https://github.com/eclipse-openj9/openj9/issues/18866

dsouzai avatar Feb 20 '24 15:02 dsouzai

@mpirvu could you please review?

dsouzai avatar Feb 20 '24 15:02 dsouzai

@mpirvu good for review again.

dsouzai avatar Feb 26 '24 16:02 dsouzai

@mpirvu good for review again.

dsouzai avatar Mar 04 '24 15:03 dsouzai

jenkins test sanity all jdk17

mpirvu avatar Mar 04 '24 15:03 mpirvu

AIX failure due to https://github.com/eclipse-openj9/openj9/issues/8625:

[2024-03-04T16:49:04.498Z] === Output from failing command(s) repeated here ===
[2024-03-04T16:49:04.498Z] * For target buildtools_tools_jigsaw_classes__the.BUILD_JIGSAW_TOOLS_batch:
[2024-03-04T16:49:04.498Z] Sjavac server failed to initialize: Deadlock condition if locked
[2024-03-04T16:49:04.498Z] Process output:
[2024-03-04T16:49:04.498Z] <End of process output>
[2024-03-04T16:49:04.498Z] IOException caught during compilation: Server failed to initialize: Deadlock condition if locked

dsouzai avatar Mar 04 '24 18:03 dsouzai

Failed tests on aarch64,s390x are cmdLineTester_criu_jitPostRestore under -XX:+DebugOnRestore -Xjit:count=0; so I'll need to look into that.

dsouzai avatar Mar 05 '24 00:03 dsouzai

Probably related to the fact that the recompilations are triggered by the sampler thread with the comp monitor in hand, and so because the comp monitor was already in hand, the sampler thread doesn't call exit() enough times before waiting on the queue slot monitor. Will need to investigate to confirm this.

dsouzai avatar Mar 05 '24 00:03 dsouzai

The test is failing because on restore, the restoring thread calls triggerCompilationOfFailedCompilationsPreCheckpoint with the comp monitor in hand, which eventually (through a series of calls) calls compileOnSeparateThread, which also acquires the comp monitor at the start. In a sync compile, it will release the comp monitor and wait on the queue slot monitor, but because the monitor's entry count is now still 1, the restoring thread still owns it, causing the deadlock.

I spoke to Marius offline about this, and I think I'm going to need to rearchitect some of the infra, especially when considering future PRs to support Debug on Restore. Converting this PR to a draft until I get the infra work done.

dsouzai avatar Mar 05 '24 16:03 dsouzai

~@mpirvu This PR should be good for review now~

Still seeing a hang with -XX:+DebugOnRestore -Xjit:count=0 that I can't explain at all...

dsouzai avatar Apr 01 '24 13:04 dsouzai

@mpirvu good for review now; https://github.com/eclipse-openj9/openj9/pull/19260 fixed the hang issue and I've addressed the review comments.

dsouzai avatar Apr 05 '24 13:04 dsouzai

@mpirvu review reminder.

dsouzai avatar Apr 09 '24 23:04 dsouzai

jenkins test sanity all jdk17

mpirvu avatar Apr 09 '24 23:04 mpirvu

A timeout on aarch64

22:49:05  Testing: Create and Restore Criu Checkpoint Image once - TestDelayedOperations
22:49:05  Test start time: 2024/04/10 02:49:02 Coordinated Universal Time
22:49:05  Running command: bash /home/jenkins/workspace/Test_openjdk17_j9_sanity.functional_aarch64_linux_Personal_testList_0/aqa-tests/TKG/../../jvmtest/functional/cmdLineTests/criu/criuScript.sh /home/jenkins/workspace/Test_openjdk17_j9_sanity.functional_aarch64_linux_Personal_testList_0/aqa-tests/TKG/../../jvmtest/functional/cmdLineTests/criu /home/jenkins/workspace/Test_openjdk17_j9_sanity.functional_aarch64_linux_Personal_testList_0/jdkbinary/j2sdk-image/bin/java " -XX:+DebugOnRestore -Xjit:count=0 " org.openj9.criu.TestDelayedOperations 1 1 false false
22:49:05  Time spent starting: 6 milliseconds
22:54:12  ***[TEST INFO 2024/04/10 02:54:02] ProcessKiller detected a timeout after 300000 milliseconds!***
22:54:12  ***[TEST INFO 2024/04/10 02:54:02] executing /usr/bin/gdb -batch -x /tmp/debugger12621056414577160408.txt bash 1004471***
22:54:12  GDB OUT No shared libraries loaded at this time.
22:54:12  INFO: Running '/usr/bin/gdb' failed with rc = 1
22:54:12  GDB ERR Could not attach to process.  If your uid matches the uid of the target
22:54:12  GDB ERR process, check the setting of /proc/sys/kernel/yama/ptrace_scope, or try
22:54:12  GDB ERR again as the root user.  For more details, see /etc/sysctl.d/10-ptrace.conf
22:54:12  GDB ERR ptrace: Operation not permitted.
22:54:12  GDB ERR /home/jenkins/workspace/Test_openjdk17_j9_sanity.functional_aarch64_linux_Personal_testList_0/aqa-tests/TKG/output_17127138677385/cmdLineTester_criu_nonPortableRestore_10/1004471: No such file or directory.
22:54:12  GDB ERR /tmp/debugger12621056414577160408.txt:2: Error in sourced command file:
22:54:12  GDB ERR The program has no registers now.
22:54:12  
22:54:12  INFO: Sleep for 60000 millis before next capture.
22:55:08  ***[TEST INFO 2024/04/10 02:55:03] executing /usr/bin/gdb -batch -x /tmp/debugger12621056414577160408.txt bash 1004471***
22:55:08  GDB OUT No shared libraries loaded at this time.
22:55:08  INFO: Running '/usr/bin/gdb' failed with rc = 1
22:55:08  GDB ERR Could not attach to process.  If your uid matches the uid of the target
22:55:08  GDB ERR process, check the setting of /proc/sys/kernel/yama/ptrace_scope, or try
22:55:08  GDB ERR again as the root user.  For more details, see /etc/sysctl.d/10-ptrace.conf
22:55:08  GDB ERR ptrace: Operation not permitted.
22:55:08  GDB ERR /home/jenkins/workspace/Test_openjdk17_j9_sanity.functional_aarch64_linux_Personal_testList_0/aqa-tests/TKG/output_17127138677385/cmdLineTester_criu_nonPortableRestore_10/1004471: No such file or directory.
22:55:08  GDB ERR /tmp/debugger12621056414577160408.txt:2: Error in sourced command file:
22:55:08  GDB ERR The program has no registers now.

On x86 there are many failures like:

01:47:56  Testing: Envvar test6
01:47:56  Test start time: 2024/04/10 01:47:56 Eastern Standard Time
01:47:56  Running command: bash /home/jenkins/workspace/Test_openjdk17_j9_sanity.functional_x86-64_linux_Personal_testList_1/aqa-tests/TKG/../../jvmtest/functional/cmdLineTests/criu/criuScript.sh /home/jenkins/workspace/Test_openjdk17_j9_sanity.functional_x86-64_linux_Personal_testList_1/aqa-tests/TKG/../../jvmtest/functional/cmdLineTests/criu /home/jenkins/workspace/Test_openjdk17_j9_sanity.functional_x86-64_linux_Personal_testList_1/jdkbinary/j2sdk-image/bin/java " -Xjit -XX:+CRIURestoreNonPortableMode  -Xtrace:print=j9vm.735" org.openj9.criu.EnvVarFileTest EnvVarFileTest6 1 false false
01:47:56  Time spent starting: 7 milliseconds
01:48:00  Time spent executing: 3222 milliseconds
01:48:00  Test result: FAILED
01:48:00  Output from test:
01:48:00   [OUT] start running script
01:48:00   [OUT] export GLIBC_TUNABLES=glibc.cpu.hwcaps=-XSAVEC,-XSAVE,-AVX2,-ERMS,-AVX,-AVX_Fast_Unaligned_Load
01:48:00   [OUT] export LD_BIND_NOT=on
01:48:00   [OUT] /home/jenkins/workspace/Test_openjdk17_j9_sanity.functional_x86-64_linux_Personal_testList_1/jdkbinary/j2sdk-image/bin/java -XX:+EnableCRIUSupport  -Xjit -XX:+CRIURestoreNonPortableMode  -Xtrace:print=j9vm.735 -cp /home/jenkins/workspace/Test_openjdk17_j9_sanity.functional_x86-64_linux_Personal_testList_1/aqa-tests/TKG/../../jvmtest/functional/cmdLineTests/criu/criu.jar org.openj9.criu.EnvVarFileTest EnvVarFileTest6 1
01:48:00   [OUT] WARN: File already exists but should not
01:48:00   [OUT] Pre-checkpoint
01:48:00   [OUT] main: Wed Apr 10 01:47:56 EDT 2024, Performing CRIUSupport.checkpointJVM(), System.currentTimeMillis(): 1712728076866, System.nanoTime(): 29509063139907681
01:48:00   [OUT] initiate restore
01:48:00   [OUT] pie: 14762: Error (criu/pie/restorer.c:1839): prctl failed @1839 with -1
01:48:00   [OUT] pie: 14762: Error (criu/pie/restorer.c:1840): prctl failed @1840 with -1
01:48:00   [OUT] pie: 14762: Error (criu/pie/restorer.c:1841): prctl failed @1841 with -1

mpirvu avatar Apr 10 '24 13:04 mpirvu

@dsouzai is the timeout on aarch64 of concern?

mpirvu avatar Apr 10 '24 13:04 mpirvu

I wouldn't have thought so, but the two hangs happen under -XX:+DebugOnRestore -Xjit:count=0, so will need to look into it I guess.

dsouzai avatar Apr 10 '24 13:04 dsouzai

Latest force push closed quite a few number of holes such as:

  • Remove methods from various memoized lists in Post Restore Options processing when methods are filtered
  • Purge memoized lists at shutdown
  • Don't remember failed compilation during shutdown
  • Don't retry compilation during shutdown

Ultimately, the aarch64 issue was due to the CRRuntime Thread never being notified at shutdown.

dsouzai avatar Apr 11 '24 15:04 dsouzai

jenkins test sanity alinux64 jdk17

dsouzai avatar Apr 11 '24 15:04 dsouzai

@mpirvu good for review again

dsouzai avatar Apr 11 '24 15:04 dsouzai

jenkins test sanity all jdk21

mpirvu avatar Apr 11 '24 19:04 mpirvu

x86 failures seem to be the prctl failed @1839 with -1 issue; should finally be good for merging 🙌🏾

dsouzai avatar Apr 12 '24 13:04 dsouzai

xlinux had a bunch of CRIU related failures with

[OUT] pie: 11648: Error (criu/pie/restorer.c:1839): prctl failed @1839 with -1

This is a known issue, so I am going to merge this PR.

mpirvu avatar Apr 12 '24 13:04 mpirvu