openj9
openj9 copied to clipboard
Recompile FSD Bodies generated under -XX:+DebugOnRestore
- Recompile FSD Bodies generated pre-checkpoint under -XX:+DebugOnRestore
- Disable sample based recompilation pre-checkpoint under -XX:+DebugOnRestore
Parent issue: https://github.com/eclipse-openj9/openj9/issues/18866
@mpirvu could you please review?
@mpirvu good for review again.
@mpirvu good for review again.
jenkins test sanity all jdk17
AIX failure due to https://github.com/eclipse-openj9/openj9/issues/8625:
[2024-03-04T16:49:04.498Z] === Output from failing command(s) repeated here ===
[2024-03-04T16:49:04.498Z] * For target buildtools_tools_jigsaw_classes__the.BUILD_JIGSAW_TOOLS_batch:
[2024-03-04T16:49:04.498Z] Sjavac server failed to initialize: Deadlock condition if locked
[2024-03-04T16:49:04.498Z] Process output:
[2024-03-04T16:49:04.498Z] <End of process output>
[2024-03-04T16:49:04.498Z] IOException caught during compilation: Server failed to initialize: Deadlock condition if locked
Failed tests on aarch64,s390x are cmdLineTester_criu_jitPostRestore under -XX:+DebugOnRestore -Xjit:count=0; so I'll need to look into that.
Probably related to the fact that the recompilations are triggered by the sampler thread with the comp monitor in hand, and so because the comp monitor was already in hand, the sampler thread doesn't call exit() enough times before waiting on the queue slot monitor. Will need to investigate to confirm this.
The test is failing because on restore, the restoring thread calls triggerCompilationOfFailedCompilationsPreCheckpoint with the comp monitor in hand, which eventually (through a series of calls) calls compileOnSeparateThread, which also acquires the comp monitor at the start. In a sync compile, it will release the comp monitor and wait on the queue slot monitor, but because the monitor's entry count is now still 1, the restoring thread still owns it, causing the deadlock.
I spoke to Marius offline about this, and I think I'm going to need to rearchitect some of the infra, especially when considering future PRs to support Debug on Restore. Converting this PR to a draft until I get the infra work done.
~@mpirvu This PR should be good for review now~
Still seeing a hang with -XX:+DebugOnRestore -Xjit:count=0 that I can't explain at all...
@mpirvu good for review now; https://github.com/eclipse-openj9/openj9/pull/19260 fixed the hang issue and I've addressed the review comments.
@mpirvu review reminder.
jenkins test sanity all jdk17
A timeout on aarch64
22:49:05 Testing: Create and Restore Criu Checkpoint Image once - TestDelayedOperations
22:49:05 Test start time: 2024/04/10 02:49:02 Coordinated Universal Time
22:49:05 Running command: bash /home/jenkins/workspace/Test_openjdk17_j9_sanity.functional_aarch64_linux_Personal_testList_0/aqa-tests/TKG/../../jvmtest/functional/cmdLineTests/criu/criuScript.sh /home/jenkins/workspace/Test_openjdk17_j9_sanity.functional_aarch64_linux_Personal_testList_0/aqa-tests/TKG/../../jvmtest/functional/cmdLineTests/criu /home/jenkins/workspace/Test_openjdk17_j9_sanity.functional_aarch64_linux_Personal_testList_0/jdkbinary/j2sdk-image/bin/java " -XX:+DebugOnRestore -Xjit:count=0 " org.openj9.criu.TestDelayedOperations 1 1 false false
22:49:05 Time spent starting: 6 milliseconds
22:54:12 ***[TEST INFO 2024/04/10 02:54:02] ProcessKiller detected a timeout after 300000 milliseconds!***
22:54:12 ***[TEST INFO 2024/04/10 02:54:02] executing /usr/bin/gdb -batch -x /tmp/debugger12621056414577160408.txt bash 1004471***
22:54:12 GDB OUT No shared libraries loaded at this time.
22:54:12 INFO: Running '/usr/bin/gdb' failed with rc = 1
22:54:12 GDB ERR Could not attach to process. If your uid matches the uid of the target
22:54:12 GDB ERR process, check the setting of /proc/sys/kernel/yama/ptrace_scope, or try
22:54:12 GDB ERR again as the root user. For more details, see /etc/sysctl.d/10-ptrace.conf
22:54:12 GDB ERR ptrace: Operation not permitted.
22:54:12 GDB ERR /home/jenkins/workspace/Test_openjdk17_j9_sanity.functional_aarch64_linux_Personal_testList_0/aqa-tests/TKG/output_17127138677385/cmdLineTester_criu_nonPortableRestore_10/1004471: No such file or directory.
22:54:12 GDB ERR /tmp/debugger12621056414577160408.txt:2: Error in sourced command file:
22:54:12 GDB ERR The program has no registers now.
22:54:12
22:54:12 INFO: Sleep for 60000 millis before next capture.
22:55:08 ***[TEST INFO 2024/04/10 02:55:03] executing /usr/bin/gdb -batch -x /tmp/debugger12621056414577160408.txt bash 1004471***
22:55:08 GDB OUT No shared libraries loaded at this time.
22:55:08 INFO: Running '/usr/bin/gdb' failed with rc = 1
22:55:08 GDB ERR Could not attach to process. If your uid matches the uid of the target
22:55:08 GDB ERR process, check the setting of /proc/sys/kernel/yama/ptrace_scope, or try
22:55:08 GDB ERR again as the root user. For more details, see /etc/sysctl.d/10-ptrace.conf
22:55:08 GDB ERR ptrace: Operation not permitted.
22:55:08 GDB ERR /home/jenkins/workspace/Test_openjdk17_j9_sanity.functional_aarch64_linux_Personal_testList_0/aqa-tests/TKG/output_17127138677385/cmdLineTester_criu_nonPortableRestore_10/1004471: No such file or directory.
22:55:08 GDB ERR /tmp/debugger12621056414577160408.txt:2: Error in sourced command file:
22:55:08 GDB ERR The program has no registers now.
On x86 there are many failures like:
01:47:56 Testing: Envvar test6
01:47:56 Test start time: 2024/04/10 01:47:56 Eastern Standard Time
01:47:56 Running command: bash /home/jenkins/workspace/Test_openjdk17_j9_sanity.functional_x86-64_linux_Personal_testList_1/aqa-tests/TKG/../../jvmtest/functional/cmdLineTests/criu/criuScript.sh /home/jenkins/workspace/Test_openjdk17_j9_sanity.functional_x86-64_linux_Personal_testList_1/aqa-tests/TKG/../../jvmtest/functional/cmdLineTests/criu /home/jenkins/workspace/Test_openjdk17_j9_sanity.functional_x86-64_linux_Personal_testList_1/jdkbinary/j2sdk-image/bin/java " -Xjit -XX:+CRIURestoreNonPortableMode -Xtrace:print=j9vm.735" org.openj9.criu.EnvVarFileTest EnvVarFileTest6 1 false false
01:47:56 Time spent starting: 7 milliseconds
01:48:00 Time spent executing: 3222 milliseconds
01:48:00 Test result: FAILED
01:48:00 Output from test:
01:48:00 [OUT] start running script
01:48:00 [OUT] export GLIBC_TUNABLES=glibc.cpu.hwcaps=-XSAVEC,-XSAVE,-AVX2,-ERMS,-AVX,-AVX_Fast_Unaligned_Load
01:48:00 [OUT] export LD_BIND_NOT=on
01:48:00 [OUT] /home/jenkins/workspace/Test_openjdk17_j9_sanity.functional_x86-64_linux_Personal_testList_1/jdkbinary/j2sdk-image/bin/java -XX:+EnableCRIUSupport -Xjit -XX:+CRIURestoreNonPortableMode -Xtrace:print=j9vm.735 -cp /home/jenkins/workspace/Test_openjdk17_j9_sanity.functional_x86-64_linux_Personal_testList_1/aqa-tests/TKG/../../jvmtest/functional/cmdLineTests/criu/criu.jar org.openj9.criu.EnvVarFileTest EnvVarFileTest6 1
01:48:00 [OUT] WARN: File already exists but should not
01:48:00 [OUT] Pre-checkpoint
01:48:00 [OUT] main: Wed Apr 10 01:47:56 EDT 2024, Performing CRIUSupport.checkpointJVM(), System.currentTimeMillis(): 1712728076866, System.nanoTime(): 29509063139907681
01:48:00 [OUT] initiate restore
01:48:00 [OUT] pie: 14762: Error (criu/pie/restorer.c:1839): prctl failed @1839 with -1
01:48:00 [OUT] pie: 14762: Error (criu/pie/restorer.c:1840): prctl failed @1840 with -1
01:48:00 [OUT] pie: 14762: Error (criu/pie/restorer.c:1841): prctl failed @1841 with -1
@dsouzai is the timeout on aarch64 of concern?
I wouldn't have thought so, but the two hangs happen under -XX:+DebugOnRestore -Xjit:count=0, so will need to look into it I guess.
Latest force push closed quite a few number of holes such as:
- Remove methods from various memoized lists in Post Restore Options processing when methods are filtered
- Purge memoized lists at shutdown
- Don't remember failed compilation during shutdown
- Don't retry compilation during shutdown
Ultimately, the aarch64 issue was due to the CRRuntime Thread never being notified at shutdown.
jenkins test sanity alinux64 jdk17
@mpirvu good for review again
jenkins test sanity all jdk21
x86 failures seem to be the prctl failed @1839 with -1 issue; should finally be good for merging 🙌🏾
xlinux had a bunch of CRIU related failures with
[OUT] pie: 11648: Error (criu/pie/restorer.c:1839): prctl failed @1839 with -1
This is a known issue, so I am going to merge this PR.