openj9 icon indicating copy to clipboard operation
openj9 copied to clipboard

OpenJDK javax/crypto/Cipher/TestCipherMode Agent communication error

Open pshipton opened this issue 1 year ago • 5 comments

https://openj9-jenkins.osuosl.org/job/Test_openjdk17_j9_sanity.openjdk_ppc64_aix_OMR_testList_1/123 - p8-java1-ibm10 jdk_security2_1 javax/crypto/Cipher/TestCipherMode.java

11:02:54  AES/KWP/NoPadding with UNWRAP
11:02:54  => expected ISE thrown for update()
11:02:54  => expected ISE thrown for doFinal()
11:02:54  AES/KWP/NoPadding with NONEXISTENT
11:02:54  => expected IPE thrown for init()
11:02:54  
11:02:54  TEST RESULT: Error. Agent communication error: java.io.EOFException; check console log for any additional details
11:02:54  --------------------------------------------------
11:05:46  Test results: passed: 223; error: 1
11:06:02  Report written to /home/jenkins/workspace/Test_openjdk17_j9_sanity.openjdk_ppc64_aix_OMR_testList_1/aqa-tests/TKG/output_17226057497025/jdk_security2_1/report/html/report.html
11:06:02  Results written to /home/jenkins/workspace/Test_openjdk17_j9_sanity.openjdk_ppc64_aix_OMR_testList_1/aqa-tests/TKG/output_17226057497025/jdk_security2_1/work
11:06:02  Error: Some tests failed or other problems occurred.

pshipton avatar Aug 02 '24 16:08 pshipton

https://openj9-jenkins.osuosl.org/job/Test_openjdk21_j9_sanity.openjdk_ppc64_aix_OpenJDK21_testList_1/29 - p8-java1-ibm04 jdk_security2_0 javax/crypto/Cipher/TestCipherMode.java

15:33:41  AES/KWP/NoPadding with UNWRAP
15:33:41  => expected ISE thrown for update()
15:33:41  => expected ISE thrown for doFinal()
15:33:41  AES/KWP/NoPadding with NONEXISTENT
15:33:41  => expected IPE thrown for init()
15:33:41  
15:33:41  TEST RESULT: Error. Agent communication error: java.io.EOFException; check console log for any additional details
15:33:41  --------------------------------------------------
15:37:18  Test results: passed: 239; error: 1

pshipton avatar Aug 29 '24 00:08 pshipton

https://openj9-jenkins.osuosl.org/job/Test_openjdk17_j9_sanity.openjdk_ppc64_aix_Nightly_testList_0/296

pshipton avatar Oct 03 '24 12:10 pshipton

https://openj9-jenkins.osuosl.org/job/Test_openjdk17_j9_sanity.openjdk_ppc64_aix_Nightly_testList_0/300/

https://openj9-jenkins.osuosl.org/job/Test_openjdk21_j9_sanity.openjdk_ppc64_aix_Nightly_testList_1/299/ these both failed similarly javax/crypto/Cipher/TestCipherMode.java javax/crypto/Cipher/TestGetInstance.java

pshipton avatar Oct 10 '24 01:10 pshipton

https://openj9-jenkins.osuosl.org/job/Test_openjdk21_j9_sanity.openjdk_ppc64_aix_Nightly_testList_1/302 javax/crypto/Cipher/TestGetInstance.java

pshipton avatar Oct 12 '24 14:10 pshipton

This is 0.48 https://openj9-jenkins.osuosl.org/job/Test_openjdk21_j9_sanity.openjdk_ppc64_aix_Release_testList_1/28/ javax/crypto/Cipher/TestCipherMode.java

pshipton avatar Oct 17 '24 12:10 pshipton

I suspect this is the same cause, although no test is named. I've seen a number of these. https://openj9-jenkins.osuosl.org/job/Test_openjdk17_j9_sanity.openjdk_ppc64_aix_Nightly_testList_0/310 jdk_security2_0

00:51:01  Running test jdk_security2_0 ...
00:51:01  ===============================================
00:51:01  jdk_security2_0 Start Time: Wed Oct 23 04:34:08 2024 Epoch Time (ms): 1729658048837
00:51:01  variation: Mode150
00:51:01  JVM_OPTIONS:  -XX:+UseCompressedOops -Xverbosegclog 
00:51:01  { \
00:51:01  echo "";	echo "TEST SETUP:"; \
00:51:01  "/home/jenkins/workspace/Test_openjdk17_j9_sanity.openjdk_ppc64_aix_Nightly_testList_0/jdkbinary/j2sdk-image/bin/java" -Xshareclasses:destroyAll; "/home/jenkins/workspace/Test_openjdk17_j9_sanity.openjdk_ppc64_aix_Nightly_testList_0/jdkbinary/j2sdk-image/bin/java" -Xshareclasses:groupAccess,destroyAll; echo "cache cleanup done"; \
00:51:01  mkdir -p "/home/jenkins/workspace/Test_openjdk17_j9_sanity.openjdk_ppc64_aix_Nightly_testList_0/aqa-tests/TKG/../TKG/output_17296532351357/jdk_security2_0"; \
00:51:01  cd "/home/jenkins/workspace/Test_openjdk17_j9_sanity.openjdk_ppc64_aix_Nightly_testList_0/aqa-tests/TKG/../TKG/output_17296532351357/jdk_security2_0"; \
00:51:01  echo "";	echo "TESTING:"; \
00:51:01  "/home/jenkins/workspace/Test_openjdk17_j9_sanity.openjdk_ppc64_aix_Nightly_testList_0/jdkbinary/j2sdk-image/bin/java" -Xmx512m -jar "/home/jenkins/workspace/Test_openjdk17_j9_sanity.openjdk_ppc64_aix_Nightly_testList_0/aqa-tests/TKG/../../jvmtest/openjdk/jtreg/lib/jtreg.jar" \
00:51:01  -agentvm -a -ea -esa -v:fail,error,time,nopass -retain:fail,error,*.dmp,javacore.*,heapdump.*,*.trc -ignore:quiet -timeoutFactor:8 -xml:verify  -concurrency:1 -nativepath:"/home/jenkins/workspace/Test_openjdk17_j9_sanity.openjdk_ppc64_aix_Nightly_testList_0/jdkbinary/openjdk-test-image/jdk/jtreg/native" -vmoptions:"-Xmx512m  -XX:+UseCompressedOops -Xverbosegclog  " \
00:51:01  -timeoutHandler:jtreg.openj9.CoreDumpTimeoutHandler -timeoutHandlerDir:"/home/jenkins/workspace/Test_openjdk17_j9_sanity.openjdk_ppc64_aix_Nightly_testList_0/../../testDependency/lib/openj9jtregtimeouthandler.jar" \
00:51:01  -w ""/home/jenkins/workspace/Test_openjdk17_j9_sanity.openjdk_ppc64_aix_Nightly_testList_0/aqa-tests/TKG/../TKG/output_17296532351357/jdk_security2_0"/work" \
00:51:01  -r ""/home/jenkins/workspace/Test_openjdk17_j9_sanity.openjdk_ppc64_aix_Nightly_testList_0/aqa-tests/TKG/../TKG/output_17296532351357/jdk_security2_0"/report" \
00:51:01  -jdk:"/home/jenkins/workspace/Test_openjdk17_j9_sanity.openjdk_ppc64_aix_Nightly_testList_0/jdkbinary/j2sdk-image" \
00:51:01  -exclude:"/home/jenkins/workspace/Test_openjdk17_j9_sanity.openjdk_ppc64_aix_Nightly_testList_0/aqa-tests/TKG/../openjdk/openjdk-jdk/test/jdk/ProblemList.txt" \
00:51:01  -exclude:"/home/jenkins/workspace/Test_openjdk17_j9_sanity.openjdk_ppc64_aix_Nightly_testList_0/aqa-tests/TKG/../../jvmtest/openjdk/excludes/ProblemList_openjdk17-openj9.txt" \
00:51:01   \
00:51:01  -exclude:"/home/jenkins/workspace/Test_openjdk17_j9_sanity.openjdk_ppc64_aix_Nightly_testList_0/aqa-tests/TKG/../openjdk/excludes/vendors/eclipse/ProblemList_openjdk17.txt" \
00:51:01  "/home/jenkins/workspace/Test_openjdk17_j9_sanity.openjdk_ppc64_aix_Nightly_testList_0/aqa-tests/TKG/../openjdk/openjdk-jdk/test/jdk:jdk_security2"; \
00:51:01  if [ $? -eq 0 ]; then echo "-----------------------------------"; echo "jdk_security2_0""_PASSED"; echo "-----------------------------------"; cd /home/jenkins/workspace/Test_openjdk17_j9_sanity.openjdk_ppc64_aix_Nightly_testList_0/aqa-tests/TKG/..;  else echo "-----------------------------------"; echo "jdk_security2_0""_FAILED"; echo "-----------------------------------"; fi; \
00:51:01  echo "";	echo "TEST TEARDOWN:"; \
00:51:01  "/home/jenkins/workspace/Test_openjdk17_j9_sanity.openjdk_ppc64_aix_Nightly_testList_0/jdkbinary/j2sdk-image/bin/java" -Xshareclasses:destroyAll; "/home/jenkins/workspace/Test_openjdk17_j9_sanity.openjdk_ppc64_aix_Nightly_testList_0/jdkbinary/j2sdk-image/bin/java" -Xshareclasses:groupAccess,destroyAll; echo "cache cleanup done"; \
00:51:01   } 2>&1 | tee -a "/home/jenkins/workspace/Test_openjdk17_j9_sanity.openjdk_ppc64_aix_Nightly_testList_0/aqa-tests/TKG/../TKG/output_17296532351357/TestTargetResult";
00:51:01  
00:51:01  TEST SETUP:
00:51:01  JVMSHRC005I No shared class caches available
00:51:01  JVMSHRC005I No shared class caches available
00:51:01  cache cleanup done
00:51:01  
00:51:01  TESTING:
00:51:03  Directory "/home/jenkins/workspace/Test_openjdk17_j9_sanity.openjdk_ppc64_aix_Nightly_testList_0/aqa-tests/TKG/../TKG/output_17296532351357/jdk_security2_0/work" not found: creating
00:51:03  Directory "/home/jenkins/workspace/Test_openjdk17_j9_sanity.openjdk_ppc64_aix_Nightly_testList_0/aqa-tests/TKG/../TKG/output_17296532351357/jdk_security2_0/report" not found: creating
00:51:11  XML output with verification to /home/jenkins/workspace/Test_openjdk17_j9_sanity.openjdk_ppc64_aix_Nightly_testList_0/aqa-tests/TKG/output_17296532351357/jdk_security2_0/work
00:57:54  Cannot contact p8-java1-ibm08: java.lang.InterruptedException
01:04:08  gmake[4]: *** [autoGen.mk:271: jdk_security2_0] Killed

pshipton avatar Oct 23 '24 21:10 pshipton

https://openj9-jenkins.osuosl.org/job/Test_openjdk17_j9_sanity.openjdk_ppc64_aix_Nightly_testList_0/313/ - p8-java1-ibm06 00:02:49 Cannot contact p8-java1-ibm06: java.lang.InterruptedException

https://openj9-jenkins.osuosl.org/job/Test_openjdk21_j9_sanity.openjdk_ppc64_aix_Nightly_testList_1/312 - p8-java1-ibm12 21:13:32 Cannot contact p8-java1-ibm12: java.lang.InterruptedException

https://openj9-jenkins.osuosl.org/job/Test_openjdk21_j9_sanity.openjdk_ppc64_aix_OMR_testList_2/138 - p8-java1-ibm10 javax/crypto/Cipher/TestCipherMode.java

08:33:25  AES/KWP/NoPadding with NONEXISTENT
08:33:25  => expected IPE thrown for init()
08:33:25  
08:33:25  TEST RESULT: Error. Agent communication error: java.io.EOFException; check console log for any additional details

pshipton avatar Oct 29 '24 03:10 pshipton

https://openj9-jenkins.osuosl.org/job/Test_openjdk17_j9_sanity.openjdk_ppc64_aix_Nightly_testList_0/316 jdk_security2_0 javax/crypto/CryptoPermissions/TestExemption.java TEST RESULT: Error. Agent communication error: java.io.EOFException; check console log for any additional details

pshipton avatar Oct 31 '24 15:10 pshipton

https://openj9-jenkins.osuosl.org/job/Test_openjdk17_j9_sanity.openjdk_ppc64_aix_Nightly_testList_0/317 - p8-java1-ibm10 jdk_security2_0 javax/crypto/CryptoPermissions/CryptoPolicyFallback.java ACTION: main -- Failed. Unexpected exit from test [exit code: 137]

https://openj9-jenkins.osuosl.org/job/Test_openjdk17_j9_sanity.openjdk_ppc64_aix_Nightly_testList_1/316 jdk_security2_1 Cannot contact p8-java1-ibm11: java.lang.InterruptedException

pshipton avatar Nov 01 '24 14:11 pshipton

https://openj9-jenkins.osuosl.org/job/Test_openjdk17_j9_sanity.openjdk_ppc64_aix_Nightly_testList_0/319 - p8-java1-ibm05 jdk_security2_0 javax/crypto/Cipher/TestCipherMode.java

04:41:21  AES/KWP/NoPadding with NONEXISTENT
04:41:21  => expected IPE thrown for init()
04:41:21  
04:41:21  TEST RESULT: Error. Agent communication error: java.io.EOFException; check console log for any additional details

pshipton avatar Nov 05 '24 18:11 pshipton

https://openj9-jenkins.osuosl.org/job/Test_openjdk21_j9_sanity.openjdk_ppc64_aix_OMR_testList_2/145 - p8-java1-ibm10 jdk_security2_1 javax/crypto/Cipher/ByteBuffers.java ACTION: main -- Error. Agent communication error: java.io.EOFException; check console log for any additional details

pshipton avatar Nov 08 '24 18:11 pshipton

https://openj9-jenkins.osuosl.org/job/Test_openjdk21_j9_sanity.openjdk_ppc64_aix_OMR_testList_1/147 - p8-java1-ibm03 jdk_security2_0 javax/crypto/Cipher/TestCipherMode.java TEST RESULT: Error. Agent communication error: java.io.EOFException; check console log for any additional details

pshipton avatar Nov 11 '24 15:11 pshipton

https://openj9-jenkins.osuosl.org/job/Test_openjdk21_j9_sanity.openjdk_ppc64_aix_Nightly_testList_1/323 jdk_security2_0 Cannot contact p8-java1-ibm09: java.lang.InterruptedException

pshipton avatar Nov 12 '24 16:11 pshipton

https://openj9-jenkins.osuosl.org/job/Test_openjdk21_j9_sanity.openjdk_ppc64_aix_Nightly_testList_1/324 jdk_security2_0 javax/crypto/Cipher/TestCipherMode.java Cannot contact p8-java1-ibm04: java.lang.InterruptedException

pshipton avatar Nov 13 '24 20:11 pshipton

https://openj9-jenkins.osuosl.org/job/Test_openjdk17_j9_sanity.openjdk_ppc64_aix_Nightly_testList_1/325 Cannot contact p8-java1-ibm02: java.lang.InterruptedException

pshipton avatar Nov 14 '24 16:11 pshipton

@jasonkatonica I've set this as a blocker.

pshipton avatar Nov 14 '24 16:11 pshipton

https://openj9-jenkins.osuosl.org/job/Test_openjdk21_j9_sanity.openjdk_ppc64_aix_Nightly_testList_1/325 javax/crypto/CipherSpi/TestGCMWithByteBuffer.java.TestGCMWithByteBuffer javax/crypto/CryptoPermission/AllPermCheck.java.AllPermCheck Cannot contact p8-java1-ibm02: java.lang.InterruptedException

https://openj9-jenkins.osuosl.org/job/Test_openjdk17_j9_sanity.openjdk_ppc64_aix_OpenJDK17_testList_0/39 javax/crypto/Cipher/TestCipherMode.java Cannot contact p8-java1-ibm04: java.lang.InterruptedException

pshipton avatar Nov 14 '24 16:11 pshipton

https://openj9-jenkins.osuosl.org/job/Test_openjdk21_j9_sanity.openjdk_ppc64_aix_Nightly_testList_2/319/ https://openj9-jenkins.osuosl.org/job/Test_openjdk21_j9_sanity.openjdk_ppc64_aix_Nightly_testList_1/326

pshipton avatar Nov 15 '24 14:11 pshipton

Probably related to https://github.ibm.com/runtimes/infrastructure/issues/10081 It seems likely there is a huge consumption of memory, which causes swapping, perhaps running out of swap space, which may cause the machine to become unusable. Either a memory leak, or some big memory consumption.

pshipton avatar Nov 15 '24 20:11 pshipton

Checking the failures, they occur in different tests. Investigating these tests to see if there are any similarities.

taoliult avatar Nov 18 '24 14:11 taoliult

https://openj9-jenkins.osuosl.org/job/Test_openjdk21_j9_sanity.openjdk_ppc64_aix_Nightly_testList_1/329

pshipton avatar Nov 20 '24 14:11 pshipton

I borrowed an AIX 7.2 (ppc64) machine, rtj-aix72-hotspot-01.rtp.raleigh.ibm.com, to test JDK 21.

Because most test cases failed while running TestCipherMode. So, I first ran all tests under the javax/crypto package, which includes TestCipherMode. Using real-time memory usage monitoring, I checked the system after running the tests 10, 20, and 30 times. No Java process threads were found occupying significant paging space. Most threads used less than 50 MB.

I also run the test groups jdk_security1_0, jdk_security2_0, and jdk_security4_0 for 5 and 10 iterations each.

During the test runs, the Jenkins test agent process continuously increased its paging space usage until the tests completed. Running the three test groups took about 20 minutes per run, during which the Jenkins agent's paging space usage grew to approximately 180 MB. This usage would be increased further with more tests running. However, the paging space was released after the tests completed.

Except of the tests, a process related to the Jenkins user on the machine keep increasing in paging space usage over time. For instance, the usage grew from 73.3 MB last Thursday to 75.5 MB by this Monday.


Apart from the above findings, no Java process threads were found to occupy significant paging space without releasing it. Next, I plan to run the full suite of test cases, same as the failed Jenkins job, try to investigate further.

More details can be found in issue https://github.ibm.com/runtimes/jit-crypto/issues/594

taoliult avatar Nov 25 '24 22:11 taoliult

I suggest "borrowing" and trying on one of the machines where it fails, rather than some other machine.

pshipton avatar Nov 26 '24 16:11 pshipton

@pshipton

Yes, I initially attempted to book those machines where it fails, such as p8-java1-ibm08, and received the error "No nodes with that label". Not sure if it is a release machine or not.

Started by user [[email protected]](https://hyc-runtimes-jenkins.swg-devops.com/user/[email protected])
Rebuilds build #[11918](https://hyc-runtimes-jenkins.swg-devops.com/job/Machine-Reservation/11918)
[Pipeline] Start of Pipeline
[Pipeline] timestamps
[Pipeline] {
[Pipeline] timeout
11:51:24  Timeout set to expire in 12 hr
[Pipeline] {
[Pipeline] stage
[Pipeline] { (Queue)
[Pipeline] echo
11:51:24  NODE value passed is a set of label(s)
[Pipeline] node
11:51:39  Still waiting to schedule task
11:51:39  There are no nodes with the label ‘[p8-java1-ibm08&&!ci.role.build.release](https://hyc-runtimes-jenkins.swg-devops.com/label/p8-java1-ibm08&&!ci.role.build.release/)’

taoliult avatar Nov 27 '24 16:11 taoliult

The machines are at the openj9 jenkins instance. You can't automatically reserve them, but open an issue at https://github.ibm.com/runtimes/infrastructure/issues with the machine name, provide your public key and how long you need the access. https://openj9-jenkins.osuosl.org/label/sw.os.aix/

pshipton avatar Nov 27 '24 18:11 pshipton

paging space allocation is lazy on AIX, i.e. it is not allocated as the back-up of memory allocation, but rather at the point you need to page out because physical memory is low. if you run out of paging space in this situation, AIX kicks start to pick a relevant (to the degree it can) process to kill (in order to release the associated resources).

zl-wang avatar Dec 03 '24 14:12 zl-wang

As suggested, I borrowed the p8-java1-ibm08 to test JDK 21.

Since most Jenkins failures were reported for the TestCipherMode test, so I start from here and ran all tests under the javax/crypto package, including TestCipherMode. I run these tests over 50, 200, and 1000 iterations using the nightly build of JDK 21. During the runs, no Java process threads were observed consuming significant paging/momery space. The results showed that all tests passed, and the paging space remained unchanged before and after the tests.

I did not identify any issues with TestCipherMode or any other tests under the javax/crypto package. A difference from the Jenkins environment is that I logged into the machine directly and used JTReg to execute the tests. I am not sure if this difference may cause the failures. I did not found any issues with the code or the tests during my runs so far.

taoliult avatar Dec 03 '24 14:12 taoliult

Jinhang @WilburZjh mentioned an OpenJDK bug for the similar issue https://bugs.openjdk.org/browse/CODETOOLS-7903082. Same error message failed on different test and different machine (aarch64, os_x).

taoliult avatar Dec 03 '24 14:12 taoliult

paging space allocation is lazy on AIX, i.e. it is not allocated as the back-up of memory allocation, but rather at the point you need to page out because physical memory is low. if you run out of paging space in this situation, AIX kicks start to pick a relevant (to the degree it can) process to kill (in order to release the associated resources).

@zl-wang Thank you for your response. However, one thing we still haven't figured out is why, in almost all cases, the TestCipherMode test process is picked and killed, while no other test is affected.

taoliult avatar Dec 03 '24 14:12 taoliult

in almost all cases, the TestCipherMode test process is picked and killed, while no other test is affected.

no, it didn't sound like that. the machine was even rendered inaccessible, wasn't it? that meant sshd or some other critical process(es) were killed as well.

zl-wang avatar Dec 03 '24 14:12 zl-wang