High load on `test-osuosl-ubuntu1604-ppc64le-2`
We are getting warning messages from Nagios that the machine is sitting with a load of 17.00:
HOST: test-osuosl-ubuntu1604-ppc64le-2 SERVICE: Current Load STATE: WARNING MESSAGE: WARNING - load average: 17.00, 17.00, 17.00 [See Nagios](https://nagios.adoptopenjdk.net/nagios/cgi-bin/status.cgi?host=test-osuosl-ubuntu1604-ppc64le-2)
Note that while it's currently stuck up there at 0904 this morning Nagios declared it good again with load averages of 0.04 0.05 1.02, but then it went up again.
This machine running Ubuntu 16.04.7 and has been online for over a year:
10:48:32 up 337 days, 16:52, 1 user, load average: 17.14, 17.07, 17.02
There are no obvious processes using lots of CPU, although there has been a recent kernel exception:
[29108237.051171] kernel BUG at /build/linux-6rygVt/linux-4.4.0/mm/memory.c:3214!
[29108237.051730] Oops: Exception in kernel mode, sig: 5 [#1]
[29108237.051799] SMP NR_CPUS=2048 NUMA pSeries
[29108237.051889] Modules linked in: ufs msdos xfs ipt_MASQUERADE nf_nat_masquerade_ipv4 nf_conntrack_netlink nfnetlink xfrm_user xfrm_algo iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 xt_addrtype iptable_filter ip_tables xt_conntrack x_tables nf_nat nf_conntrack br_netfilter bridge stp llc overlay aufs input_leds joydev vmx_crypto gf128mul ib_iser rdma_cm iw_cm ib_cm ib_sa ib_mad ib_core binfmt_misc ib_addr iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi autofs4 btrfs raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c raid1 raid0 multipath linear hid_generic usbhid hid
[29108237.053506] CPU: 2 PID: 3090 Comm: java Not tainted 4.4.0-210-generic #242-Ubuntu
[29108237.053611] task: c0000001fa94ff00 ti: c000000019d28000 task.ti: c000000019d28000
[29108237.053714] NIP: c0000000002803f4 LR: c00000000027fb64 CTR: 0000000000000000
[29108237.053816] REGS: c000000019d2b440 TRAP: 0700 Not tainted (4.4.0-210-generic)
[29108237.053918] MSR: 8000000002029033 <SF,VEC,EE,ME,IR,DR,RI,LE> CR: 44822882 XER: 00000000
[29108237.054179] CFAR: c00000000027ffa8 SOFTE: 1
GPR00: c00000000027fb30 c000000019d2b6c0 c000000001654800 0000000000000001
GPR04: c000000003bf7ff8 00003fff72a30000 c00000018d8ae000 0000000000000000
GPR08: 0000000000000000 0000000000000001 bfffffffffffffff 0000000000000060
GPR12: 0000000024822882 c00000000fb01400 0000000000000000 c00000000156b7b3
GPR16: fffffffffffff000 00000000000000fd 0000000080000000 c00000018d8ae518
GPR20: c00000017c801b90 0000000000000000 c00000018d8ae000 0000000000000518
GPR24: 000050fa40000181 0000000000000001 0000000000001b90 c0000001fa9e2100
GPR28: 0000000000000000 c00000017c950578 00003fff72a30000 c00000017c800000
[29108237.055609] NIP [c0000000002803f4] handle_mm_fault+0x974/0x1940
[29108237.055697] LR [c00000000027fb64] handle_mm_fault+0xe4/0x1940
[29108237.055782] Call Trace:
[29108237.055819] [c000000019d2b6c0] [c00000000027fb30] handle_mm_fault+0xb0/0x1940 (unreliable)
[29108237.055980] [c000000019d2b790] [c000000000278b50] __get_user_pages+0x1a0/0x550
[29108237.056102] [c000000019d2b840] [c0000000002798fc] get_dump_page+0x4c/0x80
[29108237.056204] [c000000019d2b880] [c000000000375dd0] elf_core_dump+0x800/0x8e0
[29108237.056327] [c000000019d2ba60] [c00000000037e90c] do_coredump+0xddc/0x1250
[29108237.056435] [c000000019d2bc20] [c0000000000d4a70] get_signal+0x1b0/0x9e0
[29108237.056539] [c000000019d2bd10] [c00000000001a988] do_signal+0x68/0x2c0
[29108237.056668] [c000000019d2be00] [c00000000001addc] do_notify_resume+0xbc/0xd0
[29108237.056789] [c000000019d2be30] [c00000000000bf38] ret_from_except_lite+0x64/0x68
[29108237.056909] Instruction dump:
[29108237.056979] 912a0004 7ea3ab78 4bfce77d 60000000 4bfff88c 60000000 60420000 e93d0050
[29108237.057158] 571c05ac 79290760 7d290074 7929d182 <0b090000> 7c7af82a 4bdd930d 60000000
[29108240.947198] ---[ end trace 82a11c1081695ed3 ]---
root@test-osuosl-ubuntu1604-ppc64le-2:~# cat /proc/uptime
29177874.98 116091099.58
root@test-osuosl-ubuntu1604-ppc64le-2:~# uptime
10:54:42 up 337 days, 16:59, 1 user, load average: 18.38, 17.93, 17.40
root@test-osuosl-ubuntu1604-ppc64le-2:~#
I've kicked off https://ci.adoptopenjdk.net/view/Test_openjdk/job/Test_openjdk8_hs_sanity.openjdk_ppc64le_linux/742/ to see if the machine is actually acting slow due to the high load, but I expect a reboot will be in order.
Time to execute the above job was not unduly affected my the machine load
There are a few leftover jenkins processes from yesterday running as the jenkins user although not using significants amount of CPU time:
root@test-osuosl-ubuntu1604-ppc64le-2:/var/log# ps augwwx | grep jenkins
jenkins 1126 0.0 0.0 10560 7872 ? Ss Sep12 0:21 /lib/systemd/systemd --user
jenkins 1130 0.0 0.0 160448 5760 ? S Sep12 0:00 (sd-pam)
jenkins 3086 0.0 0.0 3072 1344 ? S Oct05 0:00 sh -c ulimit -c unlimited && /home/jenkins/workspace/Grinder/openjdkbinary/j2sdk-image/bin/java -ea -esa -Xmx512m --enable-preview -Xint -XX:+CreateCoredumpOnCrash -Djava.library.path=/home/jenkins/workspace/Grinder/openjdkbinary/openjdk-test-image/hotspot/jtreg/native -cp /home/jenkins/workspace/Grinder/aqa-tests/TKG/output_16649842363953/hotspot_serviceability_1/work/classes/0/serviceability/sa/ClhsdbFindPC_no-xcomp-core.d:/home/jenkins/workspace/Grinder/aqa-tests/TKG/output_16649842363953/hotspot_serviceability_1/work/classes/0/test/lib jdk.test.lib.apps.LingeredApp f51a16ce-8db7-4dd1-9bae-47f397683477.lck forceCrash
jenkins 3088 0.0 0.5 3023424 49600 ? Dl Oct05 0:03 /home/jenkins/workspace/Grinder/openjdkbinary/j2sdk-image/bin/java -ea -esa -Xmx512m --enable-preview -Xint -XX:+CreateCoredumpOnCrash -Djava.library.path=/home/jenkins/workspace/Grinder/openjdkbinary/openjdk-test-image/hotspot/jtreg/native -cp /home/jenkins/workspace/Grinder/aqa-tests/TKG/output_16649842363953/hotspot_serviceability_1/work/classes/0/serviceability/sa/ClhsdbFindPC_no-xcomp-core.d:/home/jenkins/workspace/Grinder/aqa-tests/TKG/output_16649842363953/hotspot_serviceability_1/work/classes/0/test/lib jdk.test.lib.apps.LingeredApp f51a16ce-8db7-4dd1-9bae-47f397683477.lck forceCrash
jenkins 3106 0.0 0.3 38656 30272 ? S Oct05 0:00 /usr/bin/python3 /usr/share/apport/apport 3088 6 18446744073709551615 1 3088 !home!jenkins!workspace!Grinder!openjdkbinary!j2sdk-image!bin!java
root 6087 0.0 0.0 9984 1920 pts/0 S+ 13:38 0:00 grep --color=auto jenkins
root 19468 0.0 0.1 18176 13312 ? Ss Sep13 0:00 sshd: jenkins [priv]
jenkins 19521 0.0 0.1 18816 10304 ? S Sep13 1:57 sshd: jenkins@notty
jenkins 19562 0.0 0.0 10688 2112 ? Ss Sep13 0:00 bash -c cd "/home/jenkins" && java -Dorg.jenkinsci.plugins.durabletask.BourneShellScript.HEARTBEAT_CHECK_INTERVAL=300 -jar remoting.jar -workDir /home/jenkins -jar-cache /home/jenkins/remoting/jarCache
jenkins 19563 0.5 4.5 4590656 381440 ? Sl Sep13 175:24 java -Dorg.jenkinsci.plugins.durabletask.BourneShellScript.HEARTBEAT_CHECK_INTERVAL=300 -jar remoting.jar -workDir /home/jenkins -jar-cache /home/jenkins/remoting/jarCache
root@test-osuosl-ubuntu1604-ppc64le-2:/var/log#
Interestingly those hung Grinder processes are likely the ones I was using for replicating/testing https://github.com/adoptium/aqa-tests/issues/4006#issuecomment-1268330876 - the java process in the above output (3088) is not responding to a kill -KILL. I might leave it for a while to see if it disappears before triggering a reboot given that it doesn't seem to be disrupting any other execution at the moment.
Rebooted.
Generated some more alerts overnight so reopening. The -1 machine is running the same kernel (4.4.0-210-generic and is not having the same problem.
At the time of writing it is has a load of zero.