2 new AIX73 build machines
2 P10 lpars have been created by the IBM team
p10-aix-adopt03.osuosl.org 140.211.9.21 p10-aix-adopt04.osuosl.org 140.211.9.66
I have set up the machines using the ansible playbooks. However there are still some bits left to do.
https://github.com/adoptium/infrastructure/blob/4a5620117cd586b8194f0c050e754a500fc7c98c/ansible/playbooks/AdoptOpenJDK_AIX_Playbook/roles/dnf/tasks/main.yml#L126
- name: Install cmake 3.14.3 (See https://github.com/AdoptOpenJDK/openjdk-build/issues/2492)
dnf:
name: cmake-3.14.3
state: present
update_cache: yes
disable_excludes: all
tags:
- rpm_install
- cmake
I wasnt able to install Cmake using the above task.
root@p10-aix-adopt03:[/root]dnf install cmake-3.14.3
Last metadata expiration check: 4:21:41 ago on Fri Apr 4 12:34:39 2025.
No match for argument: cmake-3.14.3
Error: Unable to find a match: cmake-3.14.3
Both the v13 and 16 XL compilers installed fine, but give an error log which suggests they are not supported on AIX7.3
root@p10-aix-adopt03:[/root]/opt/IBM/xlC/13.1.3/bin/xlc -qversion
/opt/IBM/xlC/13.1.3/bin/.orig/xlc: 1501-287 (S) This compiler does not support AIX 7.3. Please check with IBM (http://www-01.ibm.com/support/docview.wss?rs=43&uid=swg21326972) to see if there is a PTF for this compiler that supports this AIX level.
root@p10-aix-adopt03:[/root]/opt/IBM/xlC/16.1.0/bin/xlc -qversion
/opt/IBM/xlC/16.1.0/bin/.orig/xlc: 1501-287 (S) This compiler does not support AIX 7.3. Please check with IBM (http://www-01.ibm.com/support/docview.wss?rs=43&uid=swg21326972) to see if there is a PTF for this compiler that supports this AIX level.
Had a bit of an error with the rbac role
- name: Create auth ojdk.rtclk
when: rtclk_exists.rc == 2
shell:
mkrole authorizations='ojdk.rtclk,ojdk.proccore' dfltmsg='Adoptium Role for testing' ojdk.rtclk
register: _rtclk
failed_when: _rtclk.rc != 0 and _rtclk.rc != 17
tags: rbac
- name: Create auth ojdk.proccore
when: rtcore_exists.rc == 2
shell:
mkauth dfltmsg="PV_PROC_CORE for to allow process core dumps" ojdk.proccore
register: _rtcore
failed_when: _rtcore.rc != 0 and _rtcore.rc != 17
tags: rbac
These tasks did not work. In fact the command from the second task, mkauth dfltmsg="PV_PROC_CORE for to allow process core dumps" ojdk.proccore needed to be run before the first. And in the command in the first task, mkrole authorizations='ojdk.rtclk,ojdk.proccore' dfltmsg='Adoptium Role for testing' ojdk.rtclk, I needed to remove ojdk.rtclk from authorizations for the command to work
We're seeing this memory error when the machines try to connect to jenkins
Expanded the channel window size to 4MB
[04/04/25 19:00:26] [SSH] Starting agent process: cd "/home/jenkins" && /usr/java17_64/bin/java -jar remoting.jar -workDir /home/jenkins -jar-cache /home/jenkins/remoting/jarCache
#
# There is insufficient memory for the Java Runtime Environment to continue.
# Native memory allocation (malloc) failed to allocate 1048576 bytes. Error detail: AllocateHeap
# An error report file with more information is saved as:
# /home/jenkins/hs_err_pid35127552.log
Agent JVM has terminated. Exit code=1
Theres certainly memory available
│ Physical PageSpace | pages/sec In Out | FileSystemCache │
│% Used 27.9% 1.3% | to Paging Space 0.0 0.0 | (numperm) 14.0% │
│% Free 72.1% 98.7% | to File System 0.0 0.0 | Process 5.4% │
│MB Used 11444.2MB 27.3MB | Page Scans 0.0 | System 8.6% │
│MB Free 29515.8MB 2020.7MB | Page Cycles 0.0 | Free 72.1% │
│Total(MB) 40960.0MB 2048.0MB | Page Steals 0.0 | ------
@Haroon-Khel Here are the details of some more POWER10 AIX boxes that we've been allocated:
sxa:.ssh$ host p10-aix-adopt03.osuosl.org
p10-aix-adopt03.osuosl.org has address 140.211.9.21
p10-aix-adopt03.osuosl.org has IPv6 address 2605:bc80:3010:104::8cd3:915
sxa:.ssh$ host p10-aix-adopt04.osuosl.org
p10-aix-adopt04.osuosl.org has address 140.211.9.66
p10-aix-adopt04.osuosl.org has IPv6 address 2605:bc80:3010:104::8cd3:942
~~I'll need to look at what credentials have been put on them since I can't seem to log directly into them at the moment.~~
EDIT: HK/SF keys have now been added to those two
Both are the same spec and with AIX 7.3:
sxa:.ssh$ ssh [email protected] "oslevel -s; lparstat -i |egrep 'Online Memory|Virtual CPU'"
7300-00-04-2320
Online Virtual CPUs : 24
Maximum Virtual CPUs : 24
Minimum Virtual CPUs : 12
Online Memory : 40960 MB
Desired Virtual CPUs : 24
sxa:.ssh$
Managed to get both machines up and running in jenkins. The trick was to modify the ulimit settings of the jenkins user to this
jenkins:
fsize = -1
core = -1
cpu = -1
data = 1048576
rss = 524288
stack = 8388608
nofiles = -1
Copied it from a working 72 build machine
Update, due to changing requirements the 2 build machines are now test machines https://ci.adoptium.net/computer/test-osuosl-aix73-ppc64-2 formerly build-osuosl-aix73-ppc64-1 https://ci.adoptium.net/computer/test-osuosl-aix73-ppc64-3 formerly build-osuosl-aix73-ppc64-2
Was seeing git issues and memory issues when running grinders, such as
12:22:38 > git config remote.origin.url https://github.com/adoptium/aqa-tests.git # timeout=10
12:22:38 > git config --add remote.origin.fetch +refs/heads/*:refs/remotes/origin/* # timeout=10
12:22:52 ERROR: Checkout failed
12:22:52 java.io.StreamCorruptedException: invalid stream header: 38372E33
12:22:52 at java.base/java.io.ObjectInputStream.readStreamHeader(ObjectInputStream.java:958)
12:22:52 at java.base/java.io.ObjectInputStream.<init>(ObjectInputStream.java:392)
Oct 09, 2025 11:28:46 AM hudson.slaves.ChannelPinger$1 onDead
INFO: Ping failed. Terminating the channel channel.
java.util.concurrent.TimeoutException: Ping started at 1760009086035 hasn't completed by 1760009326035
at hudson.remoting.PingThread.ping(PingThread.java:135)
at hudson.remoting.PingThread.run(PingThread.java:87)
ERROR: Connection terminated
java.io.StreamCorruptedException: invalid stream header: 38312E30
at java.base/java.io.ObjectInputStream.readStreamHeader(ObjectInputStream.java:958)
at java.base/java.io.ObjectInputStream.<init>(ObjectInputStream.java:392)
at hudson.remoting.ObjectInputStreamEx.<init>(ObjectInputStreamEx.java:50)
at hudson.remoting.Command.readFrom(Command.java:141)
at hudson.remoting.Command.readFrom(Command.java:127)
at hudson.remoting.AbstractSynchronousByteArrayCommandTransport.read(AbstractSynchronousByteArrayCommandTransport.java:35)
at hudson.remoting.SynchronousCommandTransport$ReaderThread.run(SynchronousCommandTransport.java:62)
Agent JVM has not reported exit code. Is it still running?
[10/09/25 13:29:55] [SSH] Connection closed.
But ive fixed them by adding -Xmx1048m to the jvm_options and export LDR_CNTRL=MAXDATA=0x80000000 && to the agent start command prefix, both in the jenkins node config of both nodes
AQA test pipelines running on the 3 73 nodes, JDK21
https://ci.adoptium.net/computer/test-osuosl-aix73-ppc64-1 https://ci.adoptium.net/job/AQA_Test_Pipeline/513/console https://ci.adoptium.net/computer/test-osuosl-aix73-ppc64-2 https://ci.adoptium.net/job/AQA_Test_Pipeline/514/console https://ci.adoptium.net/computer/test-osuosl-aix73-ppc64-3 https://ci.adoptium.net/job/AQA_Test_Pipeline/515/console
test-osuosl-aix73-ppc64-1
sanity perf
renaissance-naive-bayes_0
Rerunning with ea image https://ci.adoptium.net/job/Grinder/15029/console Also rerunning on test-osuosl-aix72-ppc64-2 to see if it fails there https://ci.adoptium.net/job/Grinder/15033/console ✅ UPDATE: no perf tests on 73 machines (temp)
sanity openjdk Quite a few failures on sanity openjdk, going to rerun with an ea image https://ci.adoptium.net/job/Grinder/15027/console
Still some failures, Ive added the hostname of the node into /etc/hosts, sun/security/krb5/auto/NoAddresses now passes https://ci.adoptium.net/job/Grinder/15041/testReport/
UPDATE 50 failures down to 34 https://ci.adoptium.net/job/Grinder/15049/testReport/ Possibly after the hostname change
Also rerunning on test-osuosl-aix72-ppc64-2 to see if it fails there https://ci.adoptium.net/job/Grinder/15035/console
extended perf
renaissance-als_0
renaissance-chi-square_0
renaissance-dec-tree_0
renaissance-gauss-mix_0
renaissance-log-regression_0
renaissance-movie-lens_0
Rerunning with ea image https://ci.adoptium.net/job/Grinder/15030/console Also rerunning on test-osuosl-aix72-ppc64-2 to see if it fails there https://ci.adoptium.net/job/Grinder/15034/console UPDATE: no perf tests on 73 machines (temp)
sanity system
MauveSingleThrdLoad_HS_5m_0
MauveSingleThrdLoad_HS_5m_1
MauveSingleInvocLoad_HS_5m_0
MauveSingleInvocLoad_HS_5m_1
MauveMultiThrdLoad_5m_0
MauveMultiThrdLoad_5m_1
Rerunning with ea image https://ci.adoptium.net/job/Grinder/15031/console Also rerunning on test-osuosl-aix72-ppc64-2 to see if it fails there https://ci.adoptium.net/job/Grinder/15036/console
Update passed on 73-1 using the v1.0.8-release branch https://ci.adoptium.net/job/Grinder/15070/ ✅
extended openjdk Alot of failures, will rerun with ea build https://ci.adoptium.net/job/Grinder/15032/console Also rerunning on test-osuosl-aix72-ppc64-2 to see if it fails there https://ci.adoptium.net/job/Grinder/15037/console
Seeing memory issues on -2
08:06:20 Uncaught error from thread [[3.814s][warning][os,thread] Failed to start thread "Unknown thread" - pthread_create failed (11=EAGAIN) for attributes: stacksize: 2112k, guardsize: 0k, detached.
08:06:20 UCT-akka.actor.default-dispatcher-9[3.814s][warning][os,thread] Number of threads approx. running in the VM: 617
08:06:20 [3.814s][warning][os,thread] Checking JVM parameter MaxExpectedDataSegmentSize (currently 8388608k) might be helpful
08:06:20 ]: unable to create native thread: possibly out of memory or process/resource limits reached, shutting down ActorSystem[[3.815s][warning][os,thread] Failed to start the native thread for java.lang.Thread "UCT-akka.actor.default-dispatcher-20"
08:06:20 UCT]
08:06:20 [3.815s][warning][os,thread] Failed to start thread "Unknown thread" - pthread_create failed (11=EAGAIN) for attributes: stacksize: 2112k, guardsize: 0k, detached.
08:06:20 [3.815s][warning][os,thread] Number of threads approx. running in the VM: 617
08:06:20 [3.815s][warning][os,thread] Checking JVM parameter MaxExpectedDataSegmentSize (currently 8388608k) might be helpful
08:06:20 [3.815s][warning][os,thread] Failed to start the native thread for java.lang.Thread "UCT-akka.actor.default-dispatcher-21"
08:06:20 java.lang.OutOfMemoryError: unable to create native thread: possibly out of memory or process/resource limits reached
08:06:20 at java.base/java.lang.Thread.start0(Native Method)
08:06:20 at java.base/java.lang.Thread.start(Thread.java:1553)
08:06:20 at java.base/java.lang.System$2.start(System.java:2577)
08:06:20 at java.base/jdk.internal.vm.SharedThreadContainer.start(SharedThreadContainer.java:152)
08:06:20 at java.base/java.util.concurrent.ForkJoinPool.createWorker(ForkJoinPool.java:1575)
Possible limit on the number of processes causing this to fail
Same thing on -3, extended perf
20:27:30
20:27:30 java.lang.OutOfMemoryError: Unable to allocate 1048576 bytes
20:27:30 at java.base/jdk.internal.misc.Unsafe.allocateMemory(Unsafe.java:632)
20:27:30 at java.base/java.nio.DirectByteBuffer.<init>(DirectByteBuffer.java:115)
20:27:30 at java.base/java.nio.ByteBuffer.allocateDirect(ByteBuffer.java:360)
Ive tried a few things with the memory/process management on -2 and -3 but have gotten no where. Some of the things ive tried include adding nproc = 24 into /etc/security/limits, copied the /etc/security/limits file of a working machine and pasted its contents into that of -2, various other ulimit tweaks.
In the meantime I am simply not going to run perf tests on -2 and -3
Aqa pipelines without perf tests https://ci.adoptium.net/job/AQA_Test_Pipeline/516/console test-osuosl-aix73-ppc64-2 https://ci.adoptium.net/job/AQA_Test_Pipeline/517/console test-osuosl-aix73-ppc64-3
The renaissance-naive-bayes_0 Grinder is becuase it can't resolve the output from hostname by the look of it - will likely need an entry in /etc/hosts
17:47:10 Caused by: java.net.UnknownHostException: adopt01: Hostname and service name not provided or found
Log from failing sanity system test on 73-1
18:00:45 LT 17:00:44.510 - First failure detected by thread: load-15. Not creating dumps as no dump generation is requested for this load test
Im sure coredumps is enabled for the jenkins user
jenkins@adopt01:[/home/jenkins]ulimit -a
...
coredump(blocks) unlimited
and
jenkins@adopt01:[/home/jenkins]lsattr -l sys0 -a fullcore -E
fullcore true Enable full CORE dump True
EDIT: this might be related to the fact that the rbac role didnt run properly, see top comment https://github.com/adoptium/infrastructure/issues/3920#issue-2972960320
Im sure coredumps is enabled for the jenkins user
Suggest logging in and running something like this to test if it's able to create the dumps without java getting in the way:
sleep 10 &
kill -SEGV %1
Which should cause the sleep process which is kicked off in the background to think it's had a segmentation fault and therefore dump core.
[1]+ Segmentation fault (core dumped) sleep 10
jenkins@adopt01:[/home/jenkins]ls -la
Core file exists
-rw-r--r-- 1 jenkins staff 1009320 Oct 16 11:47 core
I think I understand the rbac role a little more:
We want to create an ojdk role which has authorisations ojdk.rtclk and ojdk.proccore, and then assign those authorisations to the ojdk role. And then with
setsecattr -c accessauths=ojdk.rtclk,ojdk.proccore
innateprivs=PV_PROC_RTCLK,PV_PROC_CORE
inheritprivs=PV_PROC_RTCLK,PV_PROC_CORE
secflags=FSF_EPS
"{{ rbac_cmd }}"
Adds these auths to the rbac_cmd commands in https://github.com/adoptium/infrastructure/blob/master/ansible/playbooks/AdoptOpenJDK_AIX_Playbook/roles/rbac/defaults/main.yml
/usr/bin/ksh
/opt/freeware/bin/bash_32
/opt/freeware/bin/bash_64
However though the role ojdk is created, it itself is not referenced in the rest of the role, only the authorisations ojdk.rtclk and ojdk.proccore are referenced, but even these authorisations I am suspicious of whether they do anything:
root@adopt01:[/root]lsauth ALL | grep ojdk
ojdk.proccore id=10022 dfltmsg=PV_PROC_CORE for to allow process core dumps
ojdk.rtclk id=10023 dfltmsg=Adoptium Role for testing
The authorisations themselves only have a message variable, dfltmsg, associated with them and nothing else. These auths are again referenced in /etc/security/privcmds, due to the setsecattr command above
/usr/bin/ksh:
accessauths = ojdk.rtclk,ojdk.proccore
innateprivs = PV_PROC_RTCLK,PV_PROC_CORE
inheritprivs = PV_PROC_RTCLK,PV_PROC_CORE
secflags = FSF_EPS
/opt/freeware/bin/bash_32:
accessauths = ojdk.rtclk,ojdk.proccore
innateprivs = PV_PROC_RTCLK,PV_PROC_CORE
inheritprivs = PV_PROC_RTCLK,PV_PROC_CORE
secflags = FSF_EPS
/opt/freeware/bin/bash_64:
accessauths = ojdk.rtclk,ojdk.proccore
innateprivs = PV_PROC_RTCLK,PV_PROC_CORE
inheritprivs = PV_PROC_RTCLK,PV_PROC_CORE
secflags = FSF_EPS
tldr I think the only thing we need in the rbac role is
setsecattr -c accessauths=ojdk.rtclk,ojdk.proccore
innateprivs=PV_PROC_RTCLK,PV_PROC_CORE
inheritprivs=PV_PROC_RTCLK,PV_PROC_CORE
secflags=FSF_EPS
"{{ rbac_cmd }}"
but perhaps without the accessauths=ojdk.rtclk,ojdk.proccore part
ping @aixtools, if youre able to give any info on the comment above
Seeing more memory problems on the 3 73 machines. Tried cloning the jdk21u repo to run some tests locally
jenkins@adopt01:[/home/jenkins/sanity_system]git clone https://github.com/adoptium/jdk21u.git
Cloning into 'jdk21u'...
remote: Enumerating objects: 1387850, done.
remote: Counting objects: 100% (4741/4741), done.
remote: Compressing objects: 100% (1571/1571), done.
remote: Total 1387850 (delta 3437), reused 3481 (delta 3117), pack-reused 1383109 (from 4)
Receiving objects: 100% (1387850/1387850), 1.11 GiB | 30.85 MiB/s, done.
Resolving deltas: 24% (246260/1026083)
fatal: inflateInit: out of memory (no message)
fatal: fetch-pack: invalid index-pack output
Ulimit settings are the same as that of a working machine
re sanity system tests on 73-1 https://github.com/adoptium/infrastructure/issues/3920#issuecomment-3410337532
I managed to get the sanity system tests to pass on 73-1 using modified commands from the rbac role
I first created an ojdk role
mkrole dfltmsg="Top-Level authorization for AdoptOpenJava Project" ojdk
Then the 2 authorisations and update the kernel tables
mkauth dfltmsg="PV_PROC_CORE for to allow process core dumps" ojdk.proccore
mkauth dfltmsg="Adoptium Role for testing" ojdk.rtclk
setkst
Then I ran through the commands which give authority to the 3 shells
root@p10-aix-adopt03:[/root]setsecattr -c accessauths=ojdk.rtclk,ojdk.proccore innateprivs=PV_PROC_RTCLK,PV_PROC_CORE inheritprivs=PV_PROC_RTCLK,PV_PROC_CORE secflags=FSF_EPS /usr/bin/ksh
root@p10-aix-adopt03:[/root]setsecattr -c accessauths=ojdk.rtclk,ojdk.proccore innateprivs=PV_PROC_RTCLK,PV_PROC_CORE inheritprivs=PV_PROC_RTCLK,PV_PROC_CORE secflags=FSF_EPS /opt/freeware/bin/bash_32
root@p10-aix-adopt03:[/root]setsecattr -c accessauths=ojdk.rtclk,ojdk.proccore innateprivs=PV_PROC_RTCLK,PV_PROC_CORE inheritprivs=PV_PROC_RTCLK,PV_PROC_CORE secflags=FSF_EPS /opt/freeware/bin/bash_64
root@p10-aix-adopt03:[/root]setkst
I think the reason this allowed the sanity tests to pass on 73-1 is because the reason for failure was limited authority to creat core dumps. This same solution is not working on 73-2 I think because the reason for failure there is memory related.
The rbac role in the playbooks should be updated based on these changes, but I would like to get some more clarification on what exactly the commands do before making the update
Im probably going to take a break from this issue now :)
@Haroon-Khel ulimit -d was set to 262144 by default on the two failing systems. adopt04 (140.211.9.66) had it set to 524288 and that's what makes the git clone work currently.
adopt01 (test-1) passes extended.perf withe LDR_CNTRL variable set. adopt04 (test-3) passes all of extended perf except renaissance-finagle-http_0 which gives [error occurred during error reporting (), id 0xe0000001, Out of Memory Error (src/hotspot/share/memory/arena.cpp:168)] even with ulimit -d at the 524288 value.
It appears to run ok when run as the root user instead of jenkins though, which would indicate that the failure is due to the users/role settings somewhere. Noting that the machine has 40GiB of RAM available so it should not be a true memory limitation.
[EDIT: If the root account is restricted to ulimit -d 524288 instead of unlimited it fails there too, which is surprising since it passes with that value for jenkins on the test-1 machine]
renaissance-finagle-http_0 seems to require LDR_CNTRL=MAXDATA=0xB0000000 (Not the B in there instead of an 8 which isn't adequate). The need to inrease that value has been seen elsewhere.
The AIX 7.2 test machines all have it set to 0xA0000000 in the jenkins agent config (Build machines have 0x80000000) so that may be adequate.
The 7.3 ones are configured with an agent start prefix:
which I will now update to be 0xA0000000
Noting that test-osuosl-aix73-ppc64-1 seems to work ok despite the fact it was set to 0x80000000
Re-runs after a disconnect/reconnet cycle on the agent (except -1 as it was running ok):
| Machine | extended.perf | sanity.system |
|---|---|---|
| test-aix73-1 | Grinder 15146 | Grinder 15149 |
| test-aix73-2 | Grinder 15147 | Grinder 15153 |
| test-aix73-3 | Grinder 15148 | Grinder 15151 |
extended.perf all passed. sanity.system failed the same set of mauve tests on each run:
19:53:59 FAILED test targets:
19:53:59 MauveSingleThrdLoad_HS_5m_0
19:53:59 MauveSingleThrdLoad_HS_5m_1
19:53:59 MauveSingleInvocLoad_HS_5m_0
19:53:59 MauveSingleInvocLoad_HS_5m_1
19:53:59 MauveMultiThrdLoad_5m_0
19:53:59 MauveMultiThrdLoad_5m_1
Re-grinding those targets on aix72-5 and aix72-4 - both passed
failing log
19:27:30 LT 18:27:29.283 - 4127 Mauve[gnu.testlet.javax.swing.AbstractAction.clone] Weighting=1
19:27:30 LT 18:27:29.283 - 4603 Mauve[gnu.testlet.javax.xml.xpath.XPath] Weighting=1
19:27:30 LT 18:27:29.306 - Starting thread. Suite=0 thread=0
19:27:30 LT 18:27:29.735 - First failure detected by thread: load-0. Not creating dumps as no dump generation is requested for this load test
19:27:30 LT 18:27:29.740 - Test failed
19:27:30 LT Failure num. = 1
19:27:30 LT Test number = 1040
19:27:30 LT Test details = 'Mauve[gnu.testlet.java.lang.Boolean.classInfo.getAnnotation]'
19:27:30 LT Suite number = 0
19:27:30 LT Thread number = 0
19:27:30 LT >>> Captured test output >>>
19:27:30 LT Test failed:
19:27:30 LT java.lang.ClassNotFoundException: gnu.testlet.java.lang.Boolean.classInfo.getAnnotation
19:27:30 LT at java.base/jdk.internal.loader.BuiltinClassLoader.loadClass(BuiltinClassLoader.java:641)
19:27:30 LT at java.base/jdk.internal.loader.ClassLoaders$AppClassLoader.loadClass(ClassLoaders.java:188)
19:27:30 LT at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:526)
19:27:30 LT at java.base/java.lang.Class.forName0(Native Method)
19:27:30 LT at java.base/java.lang.Class.forName(Class.java:421)
19:27:30 LT at java.base/java.lang.Class.forName(Class.java:412)
19:27:30 LT at net.adoptopenjdk.loadTest.adaptors.MauveAdaptor.executeTest(MauveAdaptor.java:51)
19:27:30 LT at net.adoptopenjdk.loadTest.LoadTestRunner$2.run(LoadTestRunner.java:182)
19:27:30 LT at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144)
19:27:30 LT at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)
19:27:30 LT at java.base/java.lang.Thread.run(Thread.java:1583)
19:27:30 LT <<<
The above is showing two messages. One seems to be related to the known missing classes in mauve.jar although it's unclear why that's only throwing an error on the new machines...
19:27:30 LT 18:27:29.735 - First failure detected by thread: load-0. Not creating dumps as no dump generation is requested for this load test
Maybe the old one has been cached across invocations on the other machines (edit: Yes it's cached in /home/jenkins/externalDependency/system_lib/mauve/mauve.jar (The version in /home/jenkins/testDependency/system_lib/mauve/mauve.jar doesn't seem to affect things). New grinder on -3 are passing the tests.
@Haroon-Khel With extended.perf and sanity.system now passing does that cover everything that was outstanding on these machines or were there still other failures or configuration changes that need to be made on the machines?
Re-running AQA_Test_Pipeline on the latest JDK25 ea build which is showing as relatively clean on the October release build:
- ~~aix73-1 (some re-runs)~~ aix73-1 full re-run - jtreg version issue (31/Oct - should be fixed by https://github.com/adoptium/TKG/pull/746 and https://github.com/adoptium/TKG/pull/757) 6- re-running with v1.0.10-release branch - Grinder s.o Grinder e.o Label issue means these were re-run: sanity.perf ✔️ extended.perf ✔️ <<< currently stuck being a sanity.openjdk Grinder 4 failures (jdk_custom re-run link / hotspot_custom re-run link):
-
aix73-2 (some reruns) sanity.perf OOM, extended.system failed
DBBLoadTest_5m(AllocateHeaperror) extended.openjdk (and sanity) used jtreg 7.5.1 instead of 8+2 or above so failed. (Grinder e.o - aborted, re-run , Grinder s.o with v1.0.10-release branches) - similar failures to -1 but with the addition ofhotspot_custom's compiler/escapeAnalysis/TestFindInstMemRecursion.java.TestFindInstMemRecursion -
aix73-3 All failed with
Dependent module /usr/jdk-21.0.6+7/../lib/libc++.a(shr2_64.o) could not be loaded.Cannot run jdk25 -
aix73-3 JDK21
e.sDBBLoadTest_5mfailed one openjdk failure - regrinding -
aix73-3 JDK8 failed
DBBLoadTest_5min extended.system, failed 166 inextended.openjdk(Mostly GUI/XVFB related -1362-029 The X11.vfb fileset is not installed.)
I just tried to the short list of openjdk failures on the AIX 7.2 box with nightly, releases, aqa master and branches and they all failed trying to load a library from the system jdk17, despite 25 being the one under test. See Grinders 15336 through 15339. Not sure what's going on there...
DBBLoadTest_5m_0 seems to be consistently failing on the AIX 7.3 machines -2 and -3 but -1 AIX 7.2 passes on both -2 and -5 Errors include 11:13:52 DBLT java.lang.OutOfMemoryError: null. Running make _DBBLoadTest_5m_0 manually:
-
java.lang.OutOfMemoryError: unable to create native thread: possibly out of memory or process/resource limits reached- addexport LDR_CNTRL=MAXDATA=0xB0000000to resolve: -
java.lang.OutOfMemoryError: Unable to allocate 10000000 bytesTry running asroot: -
java.net.ConnectException: Connection timed out
I then tried running as root with the hostname mapped to 127.0.0.1 in /etc/hosts and got the same failure). Reverted it and it started passing. So we potentially have an intermittent connection issue showing up in this test but still need to understand the memory issue so we can run it as the jenkins user.
@Haroon-Khel Can you take a look at the remaining issues please since I think we're down to setup related differences in most cases now:
- What might be different between
-1and the other two machines which is causing the failure ofDBBLoadTest_5m_0? I note that on -1ulimit -m unlimitedworks (but isn't the default forjenkins) but not on-3where I'm getting the failures, although it does pass when run as root which has the same value ofmemory(kbytes) 262144 - Ensure we have the X11.vfb package on aix73-3 as that seems to be missing and causing failures
- We need to understand the small number of failures we're seeing in the openjdk suites ... Re-runs of jdk and hotspot on aix73-1. AIX 7.2 versions will be at jdk / hotspot
- aix73-3 cannot run jdk25. Possible missing openxl17 compiler runtime?
Also tagging @suchismith1993 as an FYI as this is related to the new AIX 7.3 machines that we have.
Not sure why but this job on aix73-3 failed in a similar way to when you try to run a jdk25u test on a "too old" AIX 7.2 system:
15:02:02 =JAVA VERSION OUTPUT BEGIN=
15:02:02 Error: dl failure on line 532
15:02:02 Error: failed /home/jenkins/workspace/Grinder_Simple/jdkbinary/j2sdk-image/lib/server/libjvm.so, because Could not load module /home/jenkins/workspace/Grinder_Simple/jdkbinary/j2sdk-image/lib/server/libjvm.so.
15:02:02 Dependent module /usr/jdk-21.0.6+7/../lib/libc++.a(shr2_64.o) could not be loaded.
15:02:02 Member shr2_64.o is not found in archive
The next job with it set to aix73-1 went through ok
Maybe the old one has been cached across invocations on the other machines (edit: Yes it's cached in /home/jenkins/externalDependency/system_lib/mauve/mauve.jar (The version in /home/jenkins/testDependency/system_lib/mauve/mauve.jar doesn't seem to affect things). New grinder on -3 are passing the tests.
This is interesting. I believe we were seeing this in the release triage on alpine x64, where certain perf or system tests were passing on test-docker-alpine320-x64-4 and none of the others. Ill raise an infra issue so we can discuss having a system setup which clears the cache on machines
EDIT: Or instead of deleting the caches, there could be a thing on the test side which checks if the cached directories are up to date
Raised https://github.com/adoptium/infrastructure/issues/4123