infrastructure icon indicating copy to clipboard operation
infrastructure copied to clipboard

2 new AIX73 build machines

Open Haroon-Khel opened this issue 9 months ago • 31 comments

2 P10 lpars have been created by the IBM team

p10-aix-adopt03.osuosl.org 140.211.9.21 p10-aix-adopt04.osuosl.org 140.211.9.66

I have set up the machines using the ansible playbooks. However there are still some bits left to do.

https://github.com/adoptium/infrastructure/blob/4a5620117cd586b8194f0c050e754a500fc7c98c/ansible/playbooks/AdoptOpenJDK_AIX_Playbook/roles/dnf/tasks/main.yml#L126

    - name: Install cmake 3.14.3 (See https://github.com/AdoptOpenJDK/openjdk-build/issues/2492)
      dnf:
        name: cmake-3.14.3
        state: present
        update_cache: yes
        disable_excludes: all
      tags:
        - rpm_install
        - cmake

I wasnt able to install Cmake using the above task.

root@p10-aix-adopt03:[/root]dnf install cmake-3.14.3
Last metadata expiration check: 4:21:41 ago on Fri Apr  4 12:34:39 2025.
No match for argument: cmake-3.14.3
Error: Unable to find a match: cmake-3.14.3

Both the v13 and 16 XL compilers installed fine, but give an error log which suggests they are not supported on AIX7.3

root@p10-aix-adopt03:[/root]/opt/IBM/xlC/13.1.3/bin/xlc -qversion
/opt/IBM/xlC/13.1.3/bin/.orig/xlc: 1501-287 (S) This compiler does not support AIX 7.3. Please check with IBM (http://www-01.ibm.com/support/docview.wss?rs=43&uid=swg21326972) to see if there is a PTF for this compiler that supports this AIX level.

root@p10-aix-adopt03:[/root]/opt/IBM/xlC/16.1.0/bin/xlc -qversion
/opt/IBM/xlC/16.1.0/bin/.orig/xlc: 1501-287 (S) This compiler does not support AIX 7.3. Please check with IBM (http://www-01.ibm.com/support/docview.wss?rs=43&uid=swg21326972) to see if there is a PTF for this compiler that supports this AIX level.

Had a bit of an error with the rbac role

- name: Create auth ojdk.rtclk
  when: rtclk_exists.rc == 2
  shell:
    mkrole authorizations='ojdk.rtclk,ojdk.proccore' dfltmsg='Adoptium Role for testing' ojdk.rtclk
  register: _rtclk
  failed_when: _rtclk.rc != 0 and _rtclk.rc != 17
  tags: rbac

- name: Create auth ojdk.proccore
  when: rtcore_exists.rc == 2
  shell:
    mkauth dfltmsg="PV_PROC_CORE for to allow process core dumps" ojdk.proccore
  register: _rtcore
  failed_when: _rtcore.rc != 0 and _rtcore.rc != 17
  tags: rbac

These tasks did not work. In fact the command from the second task, mkauth dfltmsg="PV_PROC_CORE for to allow process core dumps" ojdk.proccore needed to be run before the first. And in the command in the first task, mkrole authorizations='ojdk.rtclk,ojdk.proccore' dfltmsg='Adoptium Role for testing' ojdk.rtclk, I needed to remove ojdk.rtclk from authorizations for the command to work

Haroon-Khel avatar Apr 04 '25 17:04 Haroon-Khel

We're seeing this memory error when the machines try to connect to jenkins

Expanded the channel window size to 4MB
[04/04/25 19:00:26] [SSH] Starting agent process: cd "/home/jenkins" && /usr/java17_64/bin/java  -jar remoting.jar -workDir /home/jenkins -jar-cache /home/jenkins/remoting/jarCache
#
# There is insufficient memory for the Java Runtime Environment to continue.
# Native memory allocation (malloc) failed to allocate 1048576 bytes. Error detail: AllocateHeap
# An error report file with more information is saved as:
# /home/jenkins/hs_err_pid35127552.log
Agent JVM has terminated. Exit code=1

Haroon-Khel avatar Apr 07 '25 16:04 Haroon-Khel

Theres certainly memory available

│          Physical  PageSpace |        pages/sec  In     Out | FileSystemCache                                                                                                                                    │
│% Used       27.9%      1.3%  | to Paging Space   0.0    0.0 | (numperm) 14.0%                                                                                                                                    │
│% Free       72.1%     98.7%  | to File System    0.0    0.0 | Process    5.4%                                                                                                                                    │
│MB Used   11444.2MB    27.3MB | Page Scans        0.0        | System     8.6%                                                                                                                                    │
│MB Free   29515.8MB  2020.7MB | Page Cycles       0.0        | Free      72.1%                                                                                                                                    │
│Total(MB) 40960.0MB  2048.0MB | Page Steals       0.0        |           ------   

Haroon-Khel avatar Apr 07 '25 16:04 Haroon-Khel

@Haroon-Khel Here are the details of some more POWER10 AIX boxes that we've been allocated:

sxa:.ssh$ host p10-aix-adopt03.osuosl.org
p10-aix-adopt03.osuosl.org has address 140.211.9.21
p10-aix-adopt03.osuosl.org has IPv6 address 2605:bc80:3010:104::8cd3:915
sxa:.ssh$ host p10-aix-adopt04.osuosl.org
p10-aix-adopt04.osuosl.org has address 140.211.9.66
p10-aix-adopt04.osuosl.org has IPv6 address 2605:bc80:3010:104::8cd3:942

~~I'll need to look at what credentials have been put on them since I can't seem to log directly into them at the moment.~~

EDIT: HK/SF keys have now been added to those two

sxa avatar Sep 11 '25 09:09 sxa

Both are the same spec and with AIX 7.3:

sxa:.ssh$ ssh [email protected] "oslevel -s; lparstat -i |egrep 'Online Memory|Virtual CPU'"
7300-00-04-2320
Online Virtual CPUs                        : 24
Maximum Virtual CPUs                       : 24
Minimum Virtual CPUs                       : 12
Online Memory                              : 40960 MB
Desired Virtual CPUs                       : 24
sxa:.ssh$ 

sxa avatar Sep 11 '25 15:09 sxa

Managed to get both machines up and running in jenkins. The trick was to modify the ulimit settings of the jenkins user to this

jenkins:
        fsize = -1
        core = -1
        cpu = -1
        data = 1048576 
        rss = 524288
        stack = 8388608 
        nofiles = -1

Copied it from a working 72 build machine

Haroon-Khel avatar Oct 08 '25 12:10 Haroon-Khel

Update, due to changing requirements the 2 build machines are now test machines https://ci.adoptium.net/computer/test-osuosl-aix73-ppc64-2 formerly build-osuosl-aix73-ppc64-1 https://ci.adoptium.net/computer/test-osuosl-aix73-ppc64-3 formerly build-osuosl-aix73-ppc64-2

Was seeing git issues and memory issues when running grinders, such as

12:22:38   > git config remote.origin.url https://github.com/adoptium/aqa-tests.git # timeout=10
12:22:38   > git config --add remote.origin.fetch +refs/heads/*:refs/remotes/origin/* # timeout=10
12:22:52  ERROR: Checkout failed
12:22:52  java.io.StreamCorruptedException: invalid stream header: 38372E33
12:22:52  	at java.base/java.io.ObjectInputStream.readStreamHeader(ObjectInputStream.java:958)
12:22:52  	at java.base/java.io.ObjectInputStream.<init>(ObjectInputStream.java:392)
Oct 09, 2025 11:28:46 AM hudson.slaves.ChannelPinger$1 onDead
INFO: Ping failed. Terminating the channel channel.
java.util.concurrent.TimeoutException: Ping started at 1760009086035 hasn't completed by 1760009326035
	at hudson.remoting.PingThread.ping(PingThread.java:135)
	at hudson.remoting.PingThread.run(PingThread.java:87)

ERROR: Connection terminated
java.io.StreamCorruptedException: invalid stream header: 38312E30
	at java.base/java.io.ObjectInputStream.readStreamHeader(ObjectInputStream.java:958)
	at java.base/java.io.ObjectInputStream.<init>(ObjectInputStream.java:392)
	at hudson.remoting.ObjectInputStreamEx.<init>(ObjectInputStreamEx.java:50)
	at hudson.remoting.Command.readFrom(Command.java:141)
	at hudson.remoting.Command.readFrom(Command.java:127)
	at hudson.remoting.AbstractSynchronousByteArrayCommandTransport.read(AbstractSynchronousByteArrayCommandTransport.java:35)
	at hudson.remoting.SynchronousCommandTransport$ReaderThread.run(SynchronousCommandTransport.java:62)
Agent JVM has not reported exit code. Is it still running?
[10/09/25 13:29:55] [SSH] Connection closed.

But ive fixed them by adding -Xmx1048m to the jvm_options and export LDR_CNTRL=MAXDATA=0x80000000 && to the agent start command prefix, both in the jenkins node config of both nodes

Haroon-Khel avatar Oct 09 '25 12:10 Haroon-Khel

AQA test pipelines running on the 3 73 nodes, JDK21

https://ci.adoptium.net/computer/test-osuosl-aix73-ppc64-1 https://ci.adoptium.net/job/AQA_Test_Pipeline/513/console https://ci.adoptium.net/computer/test-osuosl-aix73-ppc64-2 https://ci.adoptium.net/job/AQA_Test_Pipeline/514/console https://ci.adoptium.net/computer/test-osuosl-aix73-ppc64-3 https://ci.adoptium.net/job/AQA_Test_Pipeline/515/console

Haroon-Khel avatar Oct 09 '25 12:10 Haroon-Khel

test-osuosl-aix73-ppc64-1

sanity perf

renaissance-naive-bayes_0

Rerunning with ea image https://ci.adoptium.net/job/Grinder/15029/console Also rerunning on test-osuosl-aix72-ppc64-2 to see if it fails there https://ci.adoptium.net/job/Grinder/15033/console ✅ UPDATE: no perf tests on 73 machines (temp)

sanity openjdk Quite a few failures on sanity openjdk, going to rerun with an ea image https://ci.adoptium.net/job/Grinder/15027/console

Still some failures, Ive added the hostname of the node into /etc/hosts, sun/security/krb5/auto/NoAddresses now passes https://ci.adoptium.net/job/Grinder/15041/testReport/

UPDATE 50 failures down to 34 https://ci.adoptium.net/job/Grinder/15049/testReport/ Possibly after the hostname change

Also rerunning on test-osuosl-aix72-ppc64-2 to see if it fails there https://ci.adoptium.net/job/Grinder/15035/console

extended perf

renaissance-als_0
renaissance-chi-square_0
renaissance-dec-tree_0
renaissance-gauss-mix_0
renaissance-log-regression_0
renaissance-movie-lens_0

Rerunning with ea image https://ci.adoptium.net/job/Grinder/15030/console Also rerunning on test-osuosl-aix72-ppc64-2 to see if it fails there https://ci.adoptium.net/job/Grinder/15034/console UPDATE: no perf tests on 73 machines (temp)

sanity system

MauveSingleThrdLoad_HS_5m_0
MauveSingleThrdLoad_HS_5m_1
MauveSingleInvocLoad_HS_5m_0
MauveSingleInvocLoad_HS_5m_1
MauveMultiThrdLoad_5m_0
MauveMultiThrdLoad_5m_1

Rerunning with ea image https://ci.adoptium.net/job/Grinder/15031/console Also rerunning on test-osuosl-aix72-ppc64-2 to see if it fails there https://ci.adoptium.net/job/Grinder/15036/console

Update passed on 73-1 using the v1.0.8-release branch https://ci.adoptium.net/job/Grinder/15070/ ✅

extended openjdk Alot of failures, will rerun with ea build https://ci.adoptium.net/job/Grinder/15032/console Also rerunning on test-osuosl-aix72-ppc64-2 to see if it fails there https://ci.adoptium.net/job/Grinder/15037/console

Haroon-Khel avatar Oct 14 '25 12:10 Haroon-Khel

Seeing memory issues on -2

08:06:20  Uncaught error from thread [[3.814s][warning][os,thread] Failed to start thread "Unknown thread" - pthread_create failed (11=EAGAIN) for attributes: stacksize: 2112k, guardsize: 0k, detached.
08:06:20  UCT-akka.actor.default-dispatcher-9[3.814s][warning][os,thread] Number of threads approx. running in the VM: 617
08:06:20  [3.814s][warning][os,thread] Checking JVM parameter MaxExpectedDataSegmentSize (currently 8388608k)  might be helpful
08:06:20  ]: unable to create native thread: possibly out of memory or process/resource limits reached, shutting down ActorSystem[[3.815s][warning][os,thread] Failed to start the native thread for java.lang.Thread "UCT-akka.actor.default-dispatcher-20"
08:06:20  UCT]
08:06:20  [3.815s][warning][os,thread] Failed to start thread "Unknown thread" - pthread_create failed (11=EAGAIN) for attributes: stacksize: 2112k, guardsize: 0k, detached.
08:06:20  [3.815s][warning][os,thread] Number of threads approx. running in the VM: 617
08:06:20  [3.815s][warning][os,thread] Checking JVM parameter MaxExpectedDataSegmentSize (currently 8388608k)  might be helpful
08:06:20  [3.815s][warning][os,thread] Failed to start the native thread for java.lang.Thread "UCT-akka.actor.default-dispatcher-21"
08:06:20  java.lang.OutOfMemoryError: unable to create native thread: possibly out of memory or process/resource limits reached
08:06:20  	at java.base/java.lang.Thread.start0(Native Method)
08:06:20  	at java.base/java.lang.Thread.start(Thread.java:1553)
08:06:20  	at java.base/java.lang.System$2.start(System.java:2577)
08:06:20  	at java.base/jdk.internal.vm.SharedThreadContainer.start(SharedThreadContainer.java:152)
08:06:20  	at java.base/java.util.concurrent.ForkJoinPool.createWorker(ForkJoinPool.java:1575)

Possible limit on the number of processes causing this to fail

Haroon-Khel avatar Oct 14 '25 12:10 Haroon-Khel

Same thing on -3, extended perf

20:27:30  
20:27:30  java.lang.OutOfMemoryError: Unable to allocate 1048576 bytes
20:27:30  	at java.base/jdk.internal.misc.Unsafe.allocateMemory(Unsafe.java:632)
20:27:30  	at java.base/java.nio.DirectByteBuffer.<init>(DirectByteBuffer.java:115)
20:27:30  	at java.base/java.nio.ByteBuffer.allocateDirect(ByteBuffer.java:360)

Haroon-Khel avatar Oct 14 '25 12:10 Haroon-Khel

Ive tried a few things with the memory/process management on -2 and -3 but have gotten no where. Some of the things ive tried include adding nproc = 24 into /etc/security/limits, copied the /etc/security/limits file of a working machine and pasted its contents into that of -2, various other ulimit tweaks.

In the meantime I am simply not going to run perf tests on -2 and -3

Aqa pipelines without perf tests https://ci.adoptium.net/job/AQA_Test_Pipeline/516/console test-osuosl-aix73-ppc64-2 https://ci.adoptium.net/job/AQA_Test_Pipeline/517/console test-osuosl-aix73-ppc64-3

Haroon-Khel avatar Oct 15 '25 14:10 Haroon-Khel

The renaissance-naive-bayes_0 Grinder is becuase it can't resolve the output from hostname by the look of it - will likely need an entry in /etc/hosts

17:47:10  Caused by: java.net.UnknownHostException: adopt01: Hostname and service name not provided or found

sxa avatar Oct 15 '25 15:10 sxa

Log from failing sanity system test on 73-1

18:00:45  LT  17:00:44.510 - First failure detected by thread: load-15. Not creating dumps as no dump generation is requested for this load test

Im sure coredumps is enabled for the jenkins user

jenkins@adopt01:[/home/jenkins]ulimit -a
...
coredump(blocks)     unlimited

and

jenkins@adopt01:[/home/jenkins]lsattr -l sys0 -a fullcore -E
fullcore true Enable full CORE dump True

EDIT: this might be related to the fact that the rbac role didnt run properly, see top comment https://github.com/adoptium/infrastructure/issues/3920#issue-2972960320

Haroon-Khel avatar Oct 16 '25 11:10 Haroon-Khel

Im sure coredumps is enabled for the jenkins user

Suggest logging in and running something like this to test if it's able to create the dumps without java getting in the way:

sleep 10 &
kill -SEGV %1

Which should cause the sleep process which is kicked off in the background to think it's had a segmentation fault and therefore dump core.

sxa avatar Oct 16 '25 11:10 sxa

[1]+  Segmentation fault      (core dumped) sleep 10
jenkins@adopt01:[/home/jenkins]ls -la 

Core file exists

-rw-r--r--    1 jenkins  staff       1009320 Oct 16 11:47 core

Haroon-Khel avatar Oct 16 '25 11:10 Haroon-Khel

I think I understand the rbac role a little more:

We want to create an ojdk role which has authorisations ojdk.rtclk and ojdk.proccore, and then assign those authorisations to the ojdk role. And then with

            setsecattr -c accessauths=ojdk.rtclk,ojdk.proccore
              innateprivs=PV_PROC_RTCLK,PV_PROC_CORE
              inheritprivs=PV_PROC_RTCLK,PV_PROC_CORE
              secflags=FSF_EPS
              "{{ rbac_cmd }}"

Adds these auths to the rbac_cmd commands in https://github.com/adoptium/infrastructure/blob/master/ansible/playbooks/AdoptOpenJDK_AIX_Playbook/roles/rbac/defaults/main.yml

/usr/bin/ksh
/opt/freeware/bin/bash_32
/opt/freeware/bin/bash_64

However though the role ojdk is created, it itself is not referenced in the rest of the role, only the authorisations ojdk.rtclk and ojdk.proccore are referenced, but even these authorisations I am suspicious of whether they do anything:

root@adopt01:[/root]lsauth ALL  | grep ojdk
ojdk.proccore id=10022 dfltmsg=PV_PROC_CORE for to allow process core dumps
ojdk.rtclk id=10023 dfltmsg=Adoptium Role for testing

The authorisations themselves only have a message variable, dfltmsg, associated with them and nothing else. These auths are again referenced in /etc/security/privcmds, due to the setsecattr command above

/usr/bin/ksh:
        accessauths = ojdk.rtclk,ojdk.proccore
        innateprivs = PV_PROC_RTCLK,PV_PROC_CORE
        inheritprivs = PV_PROC_RTCLK,PV_PROC_CORE
        secflags = FSF_EPS

/opt/freeware/bin/bash_32:
        accessauths = ojdk.rtclk,ojdk.proccore
        innateprivs = PV_PROC_RTCLK,PV_PROC_CORE
        inheritprivs = PV_PROC_RTCLK,PV_PROC_CORE
        secflags = FSF_EPS

/opt/freeware/bin/bash_64:
        accessauths = ojdk.rtclk,ojdk.proccore
        innateprivs = PV_PROC_RTCLK,PV_PROC_CORE
        inheritprivs = PV_PROC_RTCLK,PV_PROC_CORE
        secflags = FSF_EPS

tldr I think the only thing we need in the rbac role is

            setsecattr -c accessauths=ojdk.rtclk,ojdk.proccore
              innateprivs=PV_PROC_RTCLK,PV_PROC_CORE
              inheritprivs=PV_PROC_RTCLK,PV_PROC_CORE
              secflags=FSF_EPS
              "{{ rbac_cmd }}"

but perhaps without the accessauths=ojdk.rtclk,ojdk.proccore part

Haroon-Khel avatar Oct 16 '25 15:10 Haroon-Khel

ping @aixtools, if youre able to give any info on the comment above

Haroon-Khel avatar Oct 16 '25 15:10 Haroon-Khel

Seeing more memory problems on the 3 73 machines. Tried cloning the jdk21u repo to run some tests locally

jenkins@adopt01:[/home/jenkins/sanity_system]git clone https://github.com/adoptium/jdk21u.git
Cloning into 'jdk21u'...
remote: Enumerating objects: 1387850, done.
remote: Counting objects: 100% (4741/4741), done.
remote: Compressing objects: 100% (1571/1571), done.
remote: Total 1387850 (delta 3437), reused 3481 (delta 3117), pack-reused 1383109 (from 4)
Receiving objects: 100% (1387850/1387850), 1.11 GiB | 30.85 MiB/s, done.
Resolving deltas:  24% (246260/1026083)
fatal: inflateInit: out of memory (no message)
fatal: fetch-pack: invalid index-pack output

Ulimit settings are the same as that of a working machine

Haroon-Khel avatar Oct 17 '25 10:10 Haroon-Khel

re sanity system tests on 73-1 https://github.com/adoptium/infrastructure/issues/3920#issuecomment-3410337532

I managed to get the sanity system tests to pass on 73-1 using modified commands from the rbac role

I first created an ojdk role

mkrole dfltmsg="Top-Level authorization for AdoptOpenJava Project" ojdk

Then the 2 authorisations and update the kernel tables

mkauth dfltmsg="PV_PROC_CORE for to allow process core dumps" ojdk.proccore
mkauth dfltmsg="Adoptium Role for testing" ojdk.rtclk
setkst

Then I ran through the commands which give authority to the 3 shells

root@p10-aix-adopt03:[/root]setsecattr -c accessauths=ojdk.rtclk,ojdk.proccore innateprivs=PV_PROC_RTCLK,PV_PROC_CORE inheritprivs=PV_PROC_RTCLK,PV_PROC_CORE secflags=FSF_EPS /usr/bin/ksh
root@p10-aix-adopt03:[/root]setsecattr -c accessauths=ojdk.rtclk,ojdk.proccore innateprivs=PV_PROC_RTCLK,PV_PROC_CORE inheritprivs=PV_PROC_RTCLK,PV_PROC_CORE secflags=FSF_EPS /opt/freeware/bin/bash_32
root@p10-aix-adopt03:[/root]setsecattr -c accessauths=ojdk.rtclk,ojdk.proccore innateprivs=PV_PROC_RTCLK,PV_PROC_CORE inheritprivs=PV_PROC_RTCLK,PV_PROC_CORE secflags=FSF_EPS /opt/freeware/bin/bash_64
root@p10-aix-adopt03:[/root]setkst

I think the reason this allowed the sanity tests to pass on 73-1 is because the reason for failure was limited authority to creat core dumps. This same solution is not working on 73-2 I think because the reason for failure there is memory related.

The rbac role in the playbooks should be updated based on these changes, but I would like to get some more clarification on what exactly the commands do before making the update

Haroon-Khel avatar Oct 17 '25 15:10 Haroon-Khel

Im probably going to take a break from this issue now :)

Haroon-Khel avatar Oct 17 '25 15:10 Haroon-Khel

@Haroon-Khel ulimit -d was set to 262144 by default on the two failing systems. adopt04 (140.211.9.66) had it set to 524288 and that's what makes the git clone work currently.

sxa avatar Oct 21 '25 14:10 sxa

adopt01 (test-1) passes extended.perf withe LDR_CNTRL variable set. adopt04 (test-3) passes all of extended perf except renaissance-finagle-http_0 which gives [error occurred during error reporting (), id 0xe0000001, Out of Memory Error (src/hotspot/share/memory/arena.cpp:168)] even with ulimit -d at the 524288 value.

It appears to run ok when run as the root user instead of jenkins though, which would indicate that the failure is due to the users/role settings somewhere. Noting that the machine has 40GiB of RAM available so it should not be a true memory limitation.

[EDIT: If the root account is restricted to ulimit -d 524288 instead of unlimited it fails there too, which is surprising since it passes with that value for jenkins on the test-1 machine]

sxa avatar Oct 22 '25 11:10 sxa

renaissance-finagle-http_0 seems to require LDR_CNTRL=MAXDATA=0xB0000000 (Not the B in there instead of an 8 which isn't adequate). The need to inrease that value has been seen elsewhere.

The AIX 7.2 test machines all have it set to 0xA0000000 in the jenkins agent config (Build machines have 0x80000000) so that may be adequate.

Image

The 7.3 ones are configured with an agent start prefix:

Image

which I will now update to be 0xA0000000

Noting that test-osuosl-aix73-ppc64-1 seems to work ok despite the fact it was set to 0x80000000

sxa avatar Oct 23 '25 13:10 sxa

Re-runs after a disconnect/reconnet cycle on the agent (except -1 as it was running ok):

Machine extended.perf sanity.system
test-aix73-1 Grinder 15146 Grinder 15149
test-aix73-2 Grinder 15147 Grinder 15153
test-aix73-3 Grinder 15148 Grinder 15151

extended.perf all passed. sanity.system failed the same set of mauve tests on each run:

19:53:59  FAILED test targets:
19:53:59  	MauveSingleThrdLoad_HS_5m_0
19:53:59  	MauveSingleThrdLoad_HS_5m_1
19:53:59  	MauveSingleInvocLoad_HS_5m_0
19:53:59  	MauveSingleInvocLoad_HS_5m_1
19:53:59  	MauveMultiThrdLoad_5m_0
19:53:59  	MauveMultiThrdLoad_5m_1

Re-grinding those targets on aix72-5 and aix72-4 - both passed

failing log
19:27:30  LT  18:27:29.283 -   4127 Mauve[gnu.testlet.javax.swing.AbstractAction.clone]  Weighting=1 
19:27:30  LT  18:27:29.283 -   4603 Mauve[gnu.testlet.javax.xml.xpath.XPath]  Weighting=1 
19:27:30  LT  18:27:29.306 - Starting thread. Suite=0 thread=0
19:27:30  LT  18:27:29.735 - First failure detected by thread: load-0. Not creating dumps as no dump generation is requested for this load test
19:27:30  LT  18:27:29.740 - Test failed
19:27:30  LT    Failure num.  = 1
19:27:30  LT    Test number   = 1040
19:27:30  LT    Test details  = 'Mauve[gnu.testlet.java.lang.Boolean.classInfo.getAnnotation]'
19:27:30  LT    Suite number  = 0
19:27:30  LT    Thread number = 0
19:27:30  LT  >>> Captured test output >>>
19:27:30  LT  Test failed:
19:27:30  LT  java.lang.ClassNotFoundException: gnu.testlet.java.lang.Boolean.classInfo.getAnnotation
19:27:30  LT  	at java.base/jdk.internal.loader.BuiltinClassLoader.loadClass(BuiltinClassLoader.java:641)
19:27:30  LT  	at java.base/jdk.internal.loader.ClassLoaders$AppClassLoader.loadClass(ClassLoaders.java:188)
19:27:30  LT  	at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:526)
19:27:30  LT  	at java.base/java.lang.Class.forName0(Native Method)
19:27:30  LT  	at java.base/java.lang.Class.forName(Class.java:421)
19:27:30  LT  	at java.base/java.lang.Class.forName(Class.java:412)
19:27:30  LT  	at net.adoptopenjdk.loadTest.adaptors.MauveAdaptor.executeTest(MauveAdaptor.java:51)
19:27:30  LT  	at net.adoptopenjdk.loadTest.LoadTestRunner$2.run(LoadTestRunner.java:182)
19:27:30  LT  	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144)
19:27:30  LT  	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)
19:27:30  LT  	at java.base/java.lang.Thread.run(Thread.java:1583)
19:27:30  LT  <<<

The above is showing two messages. One seems to be related to the known missing classes in mauve.jar although it's unclear why that's only throwing an error on the new machines...

19:27:30  LT  18:27:29.735 - First failure detected by thread: load-0. Not creating dumps as no dump generation is requested for this load test

Maybe the old one has been cached across invocations on the other machines (edit: Yes it's cached in /home/jenkins/externalDependency/system_lib/mauve/mauve.jar (The version in /home/jenkins/testDependency/system_lib/mauve/mauve.jar doesn't seem to affect things). New grinder on -3 are passing the tests.

sxa avatar Oct 23 '25 14:10 sxa

@Haroon-Khel With extended.perf and sanity.system now passing does that cover everything that was outstanding on these machines or were there still other failures or configuration changes that need to be made on the machines?

sxa avatar Oct 24 '25 10:10 sxa

Re-running AQA_Test_Pipeline on the latest JDK25 ea build which is showing as relatively clean on the October release build:

sxa avatar Oct 26 '25 11:10 sxa

I just tried to the short list of openjdk failures on the AIX 7.2 box with nightly, releases, aqa master and branches and they all failed trying to load a library from the system jdk17, despite 25 being the one under test. See Grinders 15336 through 15339. Not sure what's going on there...

sxa avatar Oct 31 '25 09:10 sxa

DBBLoadTest_5m_0 seems to be consistently failing on the AIX 7.3 machines -2 and -3 but -1 AIX 7.2 passes on both -2 and -5 Errors include 11:13:52 DBLT java.lang.OutOfMemoryError: null. Running make _DBBLoadTest_5m_0 manually:

  • java.lang.OutOfMemoryError: unable to create native thread: possibly out of memory or process/resource limits reached - add export LDR_CNTRL=MAXDATA=0xB0000000 to resolve:
  • java.lang.OutOfMemoryError: Unable to allocate 10000000 bytes Try running as root:
  • java.net.ConnectException: Connection timed out

I then tried running as root with the hostname mapped to 127.0.0.1 in /etc/hosts and got the same failure). Reverted it and it started passing. So we potentially have an intermittent connection issue showing up in this test but still need to understand the memory issue so we can run it as the jenkins user.

@Haroon-Khel Can you take a look at the remaining issues please since I think we're down to setup related differences in most cases now:

  1. What might be different between -1 and the other two machines which is causing the failure of DBBLoadTest_5m_0? I note that on -1 ulimit -m unlimited works (but isn't the default for jenkins) but not on -3 where I'm getting the failures, although it does pass when run as root which has the same value of memory(kbytes) 262144
  2. Ensure we have the X11.vfb package on aix73-3 as that seems to be missing and causing failures
  3. We need to understand the small number of failures we're seeing in the openjdk suites ... Re-runs of jdk and hotspot on aix73-1. AIX 7.2 versions will be at jdk / hotspot
  4. aix73-3 cannot run jdk25. Possible missing openxl17 compiler runtime?

Also tagging @suchismith1993 as an FYI as this is related to the new AIX 7.3 machines that we have.

sxa avatar Oct 31 '25 12:10 sxa

Not sure why but this job on aix73-3 failed in a similar way to when you try to run a jdk25u test on a "too old" AIX 7.2 system:

15:02:02  =JAVA VERSION OUTPUT BEGIN=
15:02:02  Error: dl failure on line 532
15:02:02  Error: failed /home/jenkins/workspace/Grinder_Simple/jdkbinary/j2sdk-image/lib/server/libjvm.so, because Could not load module /home/jenkins/workspace/Grinder_Simple/jdkbinary/j2sdk-image/lib/server/libjvm.so.
15:02:02  	Dependent module /usr/jdk-21.0.6+7/../lib/libc++.a(shr2_64.o) could not be loaded.
15:02:02  	Member shr2_64.o is not found in archive 

The next job with it set to aix73-1 went through ok

sxa avatar Oct 31 '25 15:10 sxa

Maybe the old one has been cached across invocations on the other machines (edit: Yes it's cached in /home/jenkins/externalDependency/system_lib/mauve/mauve.jar (The version in /home/jenkins/testDependency/system_lib/mauve/mauve.jar doesn't seem to affect things). New grinder on -3 are passing the tests.

This is interesting. I believe we were seeing this in the release triage on alpine x64, where certain perf or system tests were passing on test-docker-alpine320-x64-4 and none of the others. Ill raise an infra issue so we can discuss having a system setup which clears the cache on machines

EDIT: Or instead of deleting the caches, there could be a thing on the test side which checks if the cached directories are up to date

Raised https://github.com/adoptium/infrastructure/issues/4123

Haroon-Khel avatar Nov 04 '25 12:11 Haroon-Khel