aqa-tests icon indicating copy to clipboard operation
aqa-tests copied to clipboard

Identify which tests seem unstable in docker containers

Open sxa opened this issue 4 years ago • 23 comments

This is partially for my own notes, but need to be looked at, and may also be covered elsewhere. Looks like the DDR stuff (not too surprising) will need some work

Other's (on initial look - not too deep!) seem ok

Memo to self - how to check for RAM/CPU limits in a container:

  • CPU: wc -l /sys/fs/cgroup/cpu,cpuacct/cgroup.procs (Not accurate)
  • RAM: expr cat /sys/fs/cgroup/memory/memory.limit_in_bytes / 1024 / 1024 / 1024 (Or divide by 1073741824)
  • Show stats: while true; do clear && uptime && docker stats --no-stream; sleep 60; done

sxa avatar Dec 29 '20 18:12 sxa

NOTE - runs on the Fedora docker image testing after patching and rebooting the server:

sxa avatar Jan 04 '21 18:01 sxa

Also trying on a couple of X64 docker images (Fedora 33 and Ubuntu 20.04)

sxa avatar Jan 04 '21 20:01 sxa

NUMA interrogation is failing in Docker

[EDIT: Issue shows up with just numactl -s in the container. A resolution is to use --cap=sys_nice which gives the container access to the CPU scheduling options - se docker docs for details]

sxa avatar Jan 06 '21 12:01 sxa

core dump generation is also failing (I've tried starting the container with various options that might help but to no avail ... so far) ... potentially same as described in https://github.com/AdoptOpenJDK/run-aqa/issues/59

[EDIT: The (host) systems on which core files were not being produced had |/usr/share/apport/apport %p %s %c %d %P %E in /proc/sys/kernel/core_pattern - changing it to core resolves it (but we'll need to make that persistent) - raised https://github.com/AdoptOpenJDK/openjdk-infrastructure/issues/1817]

sxa avatar Jan 06 '21 16:01 sxa

Also not specific to docker, but we have seen instances if this when LANG is not set to en_US.UTF-8. It occurs only on OpenJ9 sanity.openjdk on JDK11 and above (not seen on 8 so far)

21:41:41  ACTION: main -- Failed. Execution failed: `main' threw exception: java.util.IllformedLocaleException: Ill-formed language: c.u [at index 0]
21:41:41  REASON: User specified action: run main/othervm -Duser.language.display=ja -Duser.language.format=zh LocaleCategory 
21:41:41  TIME:   8.802 seconds
21:41:41  messages:

This will be progressed via https://github.com/AdoptOpenJDK/run-aqa/issues/59

sxa avatar Jan 09 '21 12:01 sxa

Grinder on testc-packet-fedora33-amd-2 and got

ERROR: Error cloning remote repo 'origin'
hudson.plugins.git.GitException: Command "git fetch --tags --force --progress -- https://github.com/AdoptOpenJDK/openjdk-tests.git +refs/heads/*:refs/remotes/origin/*" returned status code 128:
stdout: 
stderr: fatal: unable to access 'https://github.com/AdoptOpenJDK/openjdk-tests.git/': OpenSSL SSL_connect: Connection reset by peer in connection to github.com:443 

https://ci.adoptopenjdk.net/view/work-in-progress/job/grinder_sandbox/203/console

Suppose testc-packet-fedora33-amd-2 is one docker container?

sophia-guo avatar Jan 14 '21 21:01 sophia-guo

Suppose testc-packet-fedora33-amd-2 is one docker container?

Yes - it's a docker container.

Hmmm that's a bit odd ... It's also nothing to do with the test if it's failing that early in the process. I've re-run it as 205 and it completed without any fatal failures so hopefully that won't occur, but if you see any further instances let me know so we can see if it happens regularly.

sxa avatar Jan 15 '21 16:01 sxa

From https://adoptopenjdk.slack.com/archives/C5219G28G/p1612761729068300, we should check whether the timeouthandler added to openj9 openjdk test runs is able to write a System dump in dockerized environment.

smlambert avatar Feb 08 '21 13:02 smlambert

I wonder if https://github.com/eclipse/openj9/issues/12038 is another example of failure in docker environments or not. "AssertionError: Free Physical Memory size cannot be greater than total Physical Memory Size."

knn-k avatar Feb 25 '21 02:02 knn-k

I wonder if https://github.com/eclipse/openj9/issues/12038 is another example of failure in docker environments or not. "AssertionError: Free Physical Memory size cannot be greater than total Physical Memory Size."

Hmmm interesting thought. Certainly possibly but this is the first I've heard of it. Some of those containers we have are called in terms of CPU and RAM which could explain why you wouldn't necessarily be able to replicate locally without doing the same.

sxa avatar Feb 25 '21 08:02 sxa

sanity.openjdk on JDK 8 (Hotspot) seems to randomly fail for these tests:

java/util/Arrays/TimSortStackSize2.java.TimSortStackSize2
java.lang.OutOfMemoryError: Java heap space
	at TimSortStackSize2.createArray(TimSortStackSize2.java:164)
	at TimSortStackSize2.doTest(TimSortStackSize2.java:59)
	at TimSortStackSize2.main(TimSortStackSize2.java:43)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at com.sun.javatest.regtest.agent.MainWrapper$MainThread.run(MainWrapper.java:127)
	at java.lang.Thread.run(Thread.java:748)

java/util/ResourceBundle/Bug4168625Test.java.Bug4168625Test 
14:10:19  ACTION: main -- Error. Agent communication error: java.io.EOFException; check console log for any additional details

java/lang/invoke/LFCaching/LFSingleThreadCachingTest.java.LFSingleThreadCachingTest 
Unexpected exit from test [exit code: 137]

See: https://ci.adoptopenjdk.net/view/Test_upstream/job/Test_openjdk8_hs_sanity.openjdk_x86-64_linux_upstream/75/

Especially LFSingleThreadCachingTest.java looks like an OOM kill. Would be nice to overlay that failure with the kernel OOM kill logs.

jerboaa avatar Mar 02 '21 10:03 jerboaa

Above error was on test-docker-fedora33-x64-2 hosted on test-packet-ubuntu2004-amd-1. Those systems were all started with 4 cores and 6GB allocated to them. Re-testing at ~https://ci.adoptopenjdk.net/job/Grinder/7350 (Failed but I'm not sure if it's the same failure)~ Correct test from upstream at https://ci.adoptopenjdk.net/job/Grinder/7351

@smlambert In the log Severin referenced above it gives the Grinder re-run link for the individual test as https://ci.adoptopenjdk.net/job/Grinder/parambuild/?JDK_VERSION=8&JDK_IMPL=hotspot&JDK_VENDOR=oracle&BUILD_LIST=openjdk&PLATFORM=x86-64_linux_xl&TARGET=jdk_lang_1 which is clearly wrong as it doesn't reference upstream and the PLATFORM has _xl in it - is that a bug?

EDIT: https://ci.adoptopenjdk.net/job/Grinder/7353/console passed on a real machine (IBMCLOUD RHEL8) but https://ci.adoptopenjdk.net/job/Grinder/7350/console gfailed on the machine mentioned above (Both jdk_lang_1 target)

sxa avatar Mar 02 '21 14:03 sxa

Potential resource starvation reported by @lumpfish on build-docker-fedora33-armv8-3 in https://github.com/AdoptOpenJDK/openjdk-infrastructure/issues/2002 - I see a "docker day" in my near future ... (Will diagnose using jdk_time-1):

06:58:21 TEST RESULT: Error. Program /home/jenkins/workspace/Test_openjdk16_hs_extended.openjdk_aarch64_linux/openjdkbinary/j2sdk-image/bin/java' timed out (timeout set to 960000ms, elapsed time including timeout handling was 1006476ms).`

sxa avatar Mar 04 '21 11:03 sxa

At the moment at least some docker images hosted on build-packet-ubuntu1804-armv8-1 (U1804b_2223 in particular) this job currently running and docker-packet-ubuntu2004-amd-1 (U2004_2224 (this job currently running) in particular) are using a lot of CPU so potentially need to be properly capped. The failures being seen above may well only be occurring on those systems.

When the systems are quiesced tomorrow (since we're running the weekend piplines for JDK16 again due to https://github.com/AdoptOpenJDK/ci-jenkins-pipelines/pull/87) I can look at adjusting the capping of the tests

Related to @kumpfish's jdk_time_1 failure I have one pass at https://ci.adoptopenjdk.net/job/Grinder/7515/ on build-docker-ubuntu1804-armv8-​2 but all other attempts on the machine failued

sxa avatar Mar 08 '21 16:03 sxa

OK I've brought the following offline for now while investigations occur as some of these have shown problems with jdk_time_1: build-docker--armv8- nodes hosted on build-packet-ubuntu1804-armv8-1 and docker-packet-ubuntu2004-intel-1)

  • fedora33-2 fedora33-3 fedora33-4 fedora33-5 ubuntu1804-2 ubuntu1804-3 ubuntu1804-4 ubuntu1804-5 ubuntu1804-6 ubuntu1804-armv8l-1 (Hosted on build-packet-ubuntu1804-armv8-1)
  • And test-docker-fedora33-x64-3 which has been showing issues too

jdk_time_1 has passed on the alibaba arm node and also test-docker-fedora-x64-1 (Failed at 7531 though) but at least it's just a recurring problem on all Fedora systems as it passed at 7506!)

sxa avatar Mar 09 '21 10:03 sxa

sanity.openjdk on JDK 8 (Hotspot) seems to randomly fail for these tests:

java/util/Arrays/TimSortStackSize2.java.TimSortStackSize2
java.lang.OutOfMemoryError: Java heap space
	at TimSortStackSize2.createArray(TimSortStackSize2.java:164)
	at TimSortStackSize2.doTest(TimSortStackSize2.java:59)
	at TimSortStackSize2.main(TimSortStackSize2.java:43)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at com.sun.javatest.regtest.agent.MainWrapper$MainThread.run(MainWrapper.java:127)
	at java.lang.Thread.run(Thread.java:748)

java/util/ResourceBundle/Bug4168625Test.java.Bug4168625Test 
14:10:19  ACTION: main -- Error. Agent communication error: java.io.EOFException; check console log for any additional details

java/lang/invoke/LFCaching/LFSingleThreadCachingTest.java.LFSingleThreadCachingTest 
Unexpected exit from test [exit code: 137]

See: https://ci.adoptopenjdk.net/view/Test_upstream/job/Test_openjdk8_hs_sanity.openjdk_x86-64_linux_upstream/75/

Especially LFSingleThreadCachingTest.java looks like an OOM kill. Would be nice to overlay that failure with the kernel OOM kill logs.

This looks to be the same issue that's covered in https://github.com/AdoptOpenJDK/openjdk-tests/issues/2310 and not specific to docker

sxa avatar Mar 09 '21 13:03 sxa

With the merging of https://github.com/AdoptOpenJDK/openjdk-tests/pull/2345 i've brought most systems back online - I've left build-docker-fedora33-armv8-5 build-docker-ubuntu1804-5 build-docker-ubuntu1804-6

[EDIT: Load on the machine during the nightly testing is sitting at under 16 and there are 64 cores so I have re-enabled these three remaining executors]

sxa avatar Mar 10 '21 16:03 sxa

Another one https://github.com/adoptium/adoptium/issues/63#issuecomment-894501202

sophia-guo avatar Aug 09 '21 14:08 sophia-guo

@sophia-guo That looks like the tests have a dependency on the fakeroot tool which I wasn't aware we required. Can yuou supply a Grinder re-run link for that problem, as I'm not sure it'll be specific to docker - we do not have fakeroot available on all of our systems at present.

sxa avatar Aug 10 '21 12:08 sxa

Example run in Grinder: https://ci.adoptopenjdk.net/job/Grinder/1203

Rerun in Grinder on same machine link

smlambert avatar Aug 10 '21 13:08 smlambert

@sxa if I login in test machine I can run fakeroot, which means it is installed by default in Linux probably. Though aarch64 has the same issue, which I will open an issue in infra. https://github.com/adoptium/infrastructure/issues/2291

sophia-guo avatar Aug 10 '21 18:08 sophia-guo

on arm jdk11: java/beans/PropertyChangeSupport/Test4682386.java.Test4682386 java/beans/XMLEncoder/Test4631471.java.Test4631471 java/beans/XMLEncoder/Test4903007.java.Test4903007 java/beans/XMLEncoder/javax_swing_DefaultCellEditor.java.javax_swing_DefaultCellEditor java/beans/XMLEncoder/javax_swing_JTree.java.javax_swing_JTree javax/imageio/plugins/shared/ImageWriterCompressionTest.java.ImageWriterCompressionTest

passed on non-docker and failed on docker ones consistently. https://github.com/adoptium/aqa-tests/issues/2989#issuecomment-947114275

https://ci.adoptopenjdk.net/job/Test_openjdk11_hs_extended.openjdk_arm_linux_testList_2/9/

sophia-guo avatar Oct 20 '21 20:10 sophia-guo

java/beans/PropertyEditor/TestFontClassJava.java.TestFontClassJava java/beans/PropertyEditor/TestFontClassValue.java.TestFontClassValue java/beans/XMLEncoder/Test4631471.java.Test4631471 java/beans/XMLEncoder/Test4903007.java.Test4903007 java/beans/XMLEncoder/javax_swing_DefaultCellEditor.java.javax_swing_DefaultCellEditor java/beans/XMLEncoder/javax_swing_JTree.java.javax_swing_JTree javax/imageio/plugins/shared/ImageWriterCompressionTest.java.ImageWriterCompressionTest

error message:

Stacktrace
Execution failed: `main' threw exception: java.lang.NullPointerException: Cannot load from short array because "sun.awt.FontConfiguration.head" is null    
Standard Output
Property class: class java.awt.Font
PropertyEditor class: class com.sun.beans.editors.FontEditor
    
Standard Error
java.lang.NullPointerException: Cannot load from short array because "sun.awt.FontConfiguration.head" is null
	at java.desktop/sun.awt.FontConfiguration.getVersion(FontConfiguration.java:1262)
	at java.desktop/sun.awt.FontConfiguration.readFontConfigFile(FontConfiguration.java:224)

https://ci.adoptopenjdk.net/job/Test_openjdk18_hs_extended.openjdk_x86-64_linux_testList_2/26/

#3640

sophia-guo avatar May 19 '22 15:05 sophia-guo