infrastructure icon indicating copy to clipboard operation
infrastructure copied to clipboard

test-azure-win2012r2-x64-2 / test-azure-win2016-x64-1: openj9 SharedClasses.xxx tests fail (Memory issue?)

Open lumpfish opened this issue 4 years ago • 22 comments

The following openj9 shared classed test targets may fail when they land on test-azure-win2012r2-x64-2 or test-azure-win2016-x64-1.

SharedClassesAPI
SharedClasses.SCM01.MultiCL
SharedClasses.SCM01.MultiThread
SharedClasses.SCM01.MultiThreadMultiCL
SharedClasses.SCM23.MultiCL
SharedClasses.SCM23.MultiThread
SharedClasses.SCM23.MultiThreadMultiCL

The symptoms are various out of memory exceptions - e.g.

11:52:21  MT4 stderr JVMDUMP032I JVM requested Snap dump using 'C:\Users\jenkins\workspace\Grinder\openjdk-tests\TKG\output_1613993216963\SharedClasses.SCM23.MultiThread_1\20210222-114232-SharedClasses\results\Snap.20210222.114552.7872.0004.trc' in response to an event
11:52:21  MT4 stderr JVMDUMP010I Snap dump written to C:\Users\jenkins\workspace\Grinder\openjdk-tests\TKG\output_1613993216963\SharedClasses.SCM23.MultiThread_1\20210222-114232-SharedClasses\results\Snap.20210222.114552.7872.0004.trc
11:52:21  MT4 stderr JVMDUMP013I Processed dump event "systhrow", detail "java/lang/OutOfMemoryError".
11:52:21  MT4 stderr Exception in thread "main" java.lang.OutOfMemoryError: Failed to create a thread: retVal -1073741830, errno 22
11:52:21  MT4 stderr 	at java.lang.Thread.startImpl(Native Method)
11:52:21  MT4 stderr 	at java.lang.Thread.start(Thread.java:993)
11:52:21  MT4 stderr 	at net.openj9.test.sc.LoaderSlaveMultiThread.run(LoaderSlaveMultiThread.java:130)
11:52:21  MT4 stderr 	at net.openj9.test.sc.LoaderSlaveMultiThread.main(LoaderSlaveMultiThread.java:59)

Their Jenkins links show the machines have 4Gb RAM: https://ci.adoptopenjdk.net/computer/test-azure-win2012r2-x64-2/ - Failed https://ci.adoptopenjdk.net/computer/test-azure-win2016-x64-1/ - Failed

The links for two other machines also show them as having 4Gb memory, but the tests pass on those machines: https://ci.adoptopenjdk.net/computer/test-azure-win2012r2-x64-1/ - Passed https://ci.adoptopenjdk.net/computer/test-azure-win2012r2-x64-3/ - Passed

lumpfish avatar Feb 22 '21 12:02 lumpfish

Seems likely related to the memory on those machines. Next steps should probably be to verify the swap file settings, whether they can be increased with any effect, and if not we should look to increase the RAM on those systems to 6GB first, then 8GB if that doesn't work.

sxa avatar Feb 22 '21 14:02 sxa

This link will run all the above targets: https://ci.adoptopenjdk.net/job/Grinder/parambuild/?JDK_VERSION=11&JDK_IMPL=openj9&JDK_VENDOR=adoptopenjdk&BUILD_LIST=system&PLATFORM=x86-64_windows_xl&TARGET=testList%20TESTLIST=SharedClassesAPI,SharedClasses.SCM01.MultiCL,SharedClasses.SCM01.MultiThread,SharedClasses.SCM01.MultiThreadMultiCL,SharedClasses.SCM23.MultiCL,SharedClasses.SCM23.MultiThread,SharedClasses.SCM23.MultiThreadMultiCL

lumpfish avatar Feb 22 '21 15:02 lumpfish

Seems likely related to the memory on those machines. Next steps should probably be to verify the swap file settings, whether they can be increased with any effect, and if not we should look to increase the RAM on those systems to 6GB first, then 8GB if that doesn't work.

Could also be filehandles.

karianna avatar Feb 23 '21 10:02 karianna

Could also be filehandles.

What determines available file handles on a per-machine basis? Is that in any way a default set on RAM size or something else?

sxa avatar Feb 23 '21 11:02 sxa

(I've disabled the win2016 system by removing ci.role.test until this can be debugged/diagnosed)

sxa avatar Feb 23 '21 12:02 sxa

Could also be filehandles.

What determines available file handles on a per-machine basis? Is that in any way a default set on RAM size or something else?

On Windows? I've actually got no idea.

karianna avatar Feb 23 '21 12:02 karianna

Testing here with swap space increased on test-azure-win2016-x64-1 (assuming it goes live without a reboot) If that doesn't work I'll increase the RAM to 6Gb

sxa avatar Feb 24 '21 16:02 sxa

Hmmm 2012r2-2 has 16GB of RAM. Running a Grinder on there too to verify

sxa avatar Feb 24 '21 16:02 sxa

So the Grinder on the win2016 box failued but not with an obvious memory issue - @lumpfish can you check the log of that one to see if it's the same issue you've seen?

The win2012r2 did give an OutOfMemoryException - have made sure there is up to 12GB of swap and am re-running in this grinder

sxa avatar Feb 24 '21 22:02 sxa

Win2012 machine showed an OutOfMemory during one of the tests (different one in each run) in 7231 and 7237 I'm going to restart it, run the same test again while trying to watch the usage live on the machine and then see how easy it is to increase to 6GB ([EDIT: no I won't as Azure doens't have 6GB options so it'll have to be 8GB which is almost twice the cost unfortunately ... Maybe I'll just shut down the 2012 one and bump the 2016 up to 8GB B2ms spec)

sxa avatar Feb 25 '21 09:02 sxa

So the Grinder on the win2016 box failued but not with an obvious memory issue - @lumpfish can you check the log of that one to see if it's the same issue you've seen?

That test is similar in that it runs multiple jvms in parallel which share a shared class cache.

The stderr from the failing process (found by downloading the system_test_output.tar.gz file from the failing job (https://ci.adoptopenjdk.net/job/Grinder/7230/) ) contains:

JVMSHRC162E The wait for the creation mutex while opening shared memory has timed out
JVMSHRC662I Error recovery: destroyed semaphore set associated with shared class cache.
JVMSHRC840E Failed to start up the shared cache.
JVMJ9VM015W Initialization error for library j9shr29(11): JVMJ9VM009E J9VMDllMain failed
Error: Could not create the Java Virtual Machine.
Error: A fatal exception has occurred. Program will exit.

I've not seen (or noticed) that before.

lumpfish avatar Feb 25 '21 10:02 lumpfish

Hmmm https://ci.adoptopenjdk.net/job/Grinder/7260/ ran through without any failure on azure-win2012r2-2 after an earlier reboot.

Although trying again and this has popped up: image Upgrade time then! (FYI @smlambert looks like Windows tests can't complete on a 4GB Windows system)

sxa avatar Feb 25 '21 12:02 sxa

I've shut the Windows2012 machine down (it's also more expnsive than the new ones I've set up so shutting it down isn't a bad idea). I'm re-running a Grinder on the 2016 machine 7268 since the previous one passed, and I'll look to bumping it up to 8Gb if it fails (Will still be cheaper than the Win2012 one) [EDIT: 7268 passed - running again on the 4GB Win2016 box at 7277 and 7278

Side note: I'm also running a grinder on one of the larger 2012 boxes at 7269 - mostly because I'm curious as to whether there are any performance differences on that one (But I suspect on the system test suites it won't make much difference)

sxa avatar Feb 25 '21 12:02 sxa

7277 failed a test but did not through a visible OutOfMemory error so inconclusive

sxa avatar Feb 25 '21 16:02 sxa

7277 failed with the same mutex wait error:

JVMSHRC162E The wait for the creation mutex while opening shared memory has timed out
JVMSHRC662I Error recovery: destroyed semaphore set associated with shared class cache.
JVMSHRC840E Failed to start up the shared cache.
JVMJ9VM015W Initialization error for library j9shr29(11): JVMJ9VM009E J9VMDllMain failed
Error: Could not create the Java Virtual Machine.
Error: A fatal exception has occurred. Program will exit.

lumpfish avatar Feb 25 '21 17:02 lumpfish

Despite the above tests being inconclusive due to the failure on shared class setup, I'm going to go ahead with

Converted test-azure-win2016-x64-1 from B2s (left) to B2ms (right). Back online with ci.role.test label and queued up two Grinders 7288 and 7299 - hopefully that will resolve the OutOfMemoryErrors if not the class cache issue.

image

sxa avatar Feb 25 '21 19:02 sxa

I'm going to deprovision https://ci.adoptopenjdk.net/computer/test-azure-win2012r2-x64-2/ (test-2012r2-2 on the azure portal) - we can recreate it if required in the future but it's unfit for purpose in its current state and cannot easily be converted to a cost effective larger system.

sxa avatar Mar 01 '21 12:03 sxa

7288 failed but https://ci.adoptopenjdk.net/job/Grinder/7301/ succeeded - @lumpfish can you take a look at 7288 and let me know if you're concerned about the failure (in terms of whether it could still be a machine specific one-off)

sxa avatar Mar 01 '21 15:03 sxa

7288 (https://ci.adoptopenjdk.net/job/Grinder/7288/console) looks like it failed with a Jenkins connect issue?

lumpfish avatar Mar 02 '21 13:03 lumpfish

Updated links to re-run:

sxa avatar Jun 30 '21 17:06 sxa

Re-runs:

sxa avatar Feb 06 '23 13:02 sxa

We don't run impl=openj9 tests in adoptium , so can win2016 be enabled?

sophia-guo avatar Apr 15 '24 14:04 sophia-guo

Closing as this is OpenJ9 specific and was failing on two machines that have been decommissioned

sxa avatar Nov 05 '24 15:11 sxa