infrastructure High load on `test-osuosl-ubuntu1604-ppc64le-2`

We are getting warning messages from Nagios that the machine is sitting with a load of 17.00: HOST: test-osuosl-ubuntu1604-ppc64le-2 SERVICE: Current Load STATE: WARNING MESSAGE: WARNING - load average: 17.00, 17.00, 17.00 [See Nagios](https://nagios.adoptopenjdk.net/nagios/cgi-bin/status.cgi?host=test-osuosl-ubuntu1604-ppc64le-2)

Note that while it's currently stuck up there at 0904 this morning Nagios declared it good again with load averages of 0.04 0.05 1.02, but then it went up again.

This machine running Ubuntu 16.04.7 and has been online for over a year:

10:48:32 up 337 days, 16:52,  1 user,  load average: 17.14, 17.07, 17.02

There are no obvious processes using lots of CPU, although there has been a recent kernel exception:

[29108237.051171] kernel BUG at /build/linux-6rygVt/linux-4.4.0/mm/memory.c:3214!
[29108237.051730] Oops: Exception in kernel mode, sig: 5 [#1]
[29108237.051799] SMP NR_CPUS=2048 NUMA pSeries
[29108237.051889] Modules linked in: ufs msdos xfs ipt_MASQUERADE nf_nat_masquerade_ipv4 nf_conntrack_netlink nfnetlink xfrm_user xfrm_algo iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 xt_addrtype iptable_filter ip_tables xt_conntrack x_tables nf_nat nf_conntrack br_netfilter bridge stp llc overlay aufs input_leds joydev vmx_crypto gf128mul ib_iser rdma_cm iw_cm ib_cm ib_sa ib_mad ib_core binfmt_misc ib_addr iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi autofs4 btrfs raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c raid1 raid0 multipath linear hid_generic usbhid hid
[29108237.053506] CPU: 2 PID: 3090 Comm: java Not tainted 4.4.0-210-generic #242-Ubuntu
[29108237.053611] task: c0000001fa94ff00 ti: c000000019d28000 task.ti: c000000019d28000
[29108237.053714] NIP: c0000000002803f4 LR: c00000000027fb64 CTR: 0000000000000000
[29108237.053816] REGS: c000000019d2b440 TRAP: 0700   Not tainted  (4.4.0-210-generic)
[29108237.053918] MSR: 8000000002029033 <SF,VEC,EE,ME,IR,DR,RI,LE>  CR: 44822882  XER: 00000000
[29108237.054179] CFAR: c00000000027ffa8 SOFTE: 1 
                  GPR00: c00000000027fb30 c000000019d2b6c0 c000000001654800 0000000000000001 
                  GPR04: c000000003bf7ff8 00003fff72a30000 c00000018d8ae000 0000000000000000 
                  GPR08: 0000000000000000 0000000000000001 bfffffffffffffff 0000000000000060 
                  GPR12: 0000000024822882 c00000000fb01400 0000000000000000 c00000000156b7b3 
                  GPR16: fffffffffffff000 00000000000000fd 0000000080000000 c00000018d8ae518 
                  GPR20: c00000017c801b90 0000000000000000 c00000018d8ae000 0000000000000518 
                  GPR24: 000050fa40000181 0000000000000001 0000000000001b90 c0000001fa9e2100 
                  GPR28: 0000000000000000 c00000017c950578 00003fff72a30000 c00000017c800000 
[29108237.055609] NIP [c0000000002803f4] handle_mm_fault+0x974/0x1940
[29108237.055697] LR [c00000000027fb64] handle_mm_fault+0xe4/0x1940
[29108237.055782] Call Trace:
[29108237.055819] [c000000019d2b6c0] [c00000000027fb30] handle_mm_fault+0xb0/0x1940 (unreliable)
[29108237.055980] [c000000019d2b790] [c000000000278b50] __get_user_pages+0x1a0/0x550
[29108237.056102] [c000000019d2b840] [c0000000002798fc] get_dump_page+0x4c/0x80
[29108237.056204] [c000000019d2b880] [c000000000375dd0] elf_core_dump+0x800/0x8e0
[29108237.056327] [c000000019d2ba60] [c00000000037e90c] do_coredump+0xddc/0x1250
[29108237.056435] [c000000019d2bc20] [c0000000000d4a70] get_signal+0x1b0/0x9e0
[29108237.056539] [c000000019d2bd10] [c00000000001a988] do_signal+0x68/0x2c0
[29108237.056668] [c000000019d2be00] [c00000000001addc] do_notify_resume+0xbc/0xd0
[29108237.056789] [c000000019d2be30] [c00000000000bf38] ret_from_except_lite+0x64/0x68
[29108237.056909] Instruction dump:
[29108237.056979] 912a0004 7ea3ab78 4bfce77d 60000000 4bfff88c 60000000 60420000 e93d0050 
[29108237.057158] 571c05ac 79290760 7d290074 7929d182 <0b090000> 7c7af82a 4bdd930d 60000000 
[29108240.947198] ---[ end trace 82a11c1081695ed3 ]---

root@test-osuosl-ubuntu1604-ppc64le-2:~# cat /proc/uptime
29177874.98 116091099.58
root@test-osuosl-ubuntu1604-ppc64le-2:~# uptime
 10:54:42 up 337 days, 16:59,  1 user,  load average: 18.38, 17.93, 17.40
root@test-osuosl-ubuntu1604-ppc64le-2:~#

I've kicked off https://ci.adoptopenjdk.net/view/Test_openjdk/job/Test_openjdk8_hs_sanity.openjdk_ppc64le_linux/742/ to see if the machine is actually acting slow due to the high load, but I expect a reboot will be in order.

Oct 06 '22 10:10 sxa

Time to execute the above job was not unduly affected my the machine load

Oct 06 '22 12:10 sxa

There are a few leftover jenkins processes from yesterday running as the jenkins user although not using significants amount of CPU time:

root@test-osuosl-ubuntu1604-ppc64le-2:/var/log# ps augwwx | grep jenkins
jenkins   1126  0.0  0.0  10560  7872 ?        Ss   Sep12   0:21 /lib/systemd/systemd --user
jenkins   1130  0.0  0.0 160448  5760 ?        S    Sep12   0:00 (sd-pam)
jenkins   3086  0.0  0.0   3072  1344 ?        S    Oct05   0:00 sh -c ulimit -c unlimited && /home/jenkins/workspace/Grinder/openjdkbinary/j2sdk-image/bin/java -ea -esa -Xmx512m --enable-preview -Xint -XX:+CreateCoredumpOnCrash -Djava.library.path=/home/jenkins/workspace/Grinder/openjdkbinary/openjdk-test-image/hotspot/jtreg/native -cp /home/jenkins/workspace/Grinder/aqa-tests/TKG/output_16649842363953/hotspot_serviceability_1/work/classes/0/serviceability/sa/ClhsdbFindPC_no-xcomp-core.d:/home/jenkins/workspace/Grinder/aqa-tests/TKG/output_16649842363953/hotspot_serviceability_1/work/classes/0/test/lib jdk.test.lib.apps.LingeredApp f51a16ce-8db7-4dd1-9bae-47f397683477.lck forceCrash
jenkins   3088  0.0  0.5 3023424 49600 ?       Dl   Oct05   0:03 /home/jenkins/workspace/Grinder/openjdkbinary/j2sdk-image/bin/java -ea -esa -Xmx512m --enable-preview -Xint -XX:+CreateCoredumpOnCrash -Djava.library.path=/home/jenkins/workspace/Grinder/openjdkbinary/openjdk-test-image/hotspot/jtreg/native -cp /home/jenkins/workspace/Grinder/aqa-tests/TKG/output_16649842363953/hotspot_serviceability_1/work/classes/0/serviceability/sa/ClhsdbFindPC_no-xcomp-core.d:/home/jenkins/workspace/Grinder/aqa-tests/TKG/output_16649842363953/hotspot_serviceability_1/work/classes/0/test/lib jdk.test.lib.apps.LingeredApp f51a16ce-8db7-4dd1-9bae-47f397683477.lck forceCrash
jenkins   3106  0.0  0.3  38656 30272 ?        S    Oct05   0:00 /usr/bin/python3 /usr/share/apport/apport 3088 6 18446744073709551615 1 3088 !home!jenkins!workspace!Grinder!openjdkbinary!j2sdk-image!bin!java
root      6087  0.0  0.0   9984  1920 pts/0    S+   13:38   0:00 grep --color=auto jenkins
root     19468  0.0  0.1  18176 13312 ?        Ss   Sep13   0:00 sshd: jenkins [priv]
jenkins  19521  0.0  0.1  18816 10304 ?        S    Sep13   1:57 sshd: jenkins@notty
jenkins  19562  0.0  0.0  10688  2112 ?        Ss   Sep13   0:00 bash -c cd "/home/jenkins" && java -Dorg.jenkinsci.plugins.durabletask.BourneShellScript.HEARTBEAT_CHECK_INTERVAL=300 -jar remoting.jar -workDir /home/jenkins -jar-cache /home/jenkins/remoting/jarCache
jenkins  19563  0.5  4.5 4590656 381440 ?      Sl   Sep13 175:24 java -Dorg.jenkinsci.plugins.durabletask.BourneShellScript.HEARTBEAT_CHECK_INTERVAL=300 -jar remoting.jar -workDir /home/jenkins -jar-cache /home/jenkins/remoting/jarCache
root@test-osuosl-ubuntu1604-ppc64le-2:/var/log#

Interestingly those hung Grinder processes are likely the ones I was using for replicating/testing https://github.com/adoptium/aqa-tests/issues/4006#issuecomment-1268330876 - the java process in the above output (3088) is not responding to a kill -KILL. I might leave it for a while to see if it disappears before triggering a reboot given that it doesn't seem to be disrupting any other execution at the moment.

Oct 06 '22 13:10 sxa

Rebooted.

Oct 06 '22 16:10 sxa

Generated some more alerts overnight so reopening. The -1 machine is running the same kernel (4.4.0-210-generic and is not having the same problem. At the time of writing it is has a load of zero.

Oct 07 '22 10:10 sxa