build
build copied to clipboard
investigate offline RHEL 8 PPC64LE instances
Only one of our test RHEL 8 PPC64LE is online:
I can't ssh into any of the three offline instances (connection times out).
According to build history the last jobs that ran on the three offline hosts are:
host | job | time since |
---|---|---|
test-osuosl-rhel8-ppc64_le-1 | https://ci.nodejs.org/job/citgm-smoker-nobuild/nodes=rhel8-ppc64le/1194/ | 7 days 13 hours |
test-osuosl-rhel8-ppc64_le-2 | https://ci.nodejs.org/job/citgm-smoker/nodes=rhel8-ppc64le/2913/ | 15 days |
test-osuosl-rhel8-ppc64_le-3 | https://ci.nodejs.org/job/citgm-smoker/nodes=rhel8-ppc64le/2914/ | 15 days |
All of the jobs exited with the Jenkins agent disconnecting. The "time since" roughly corresponds to the "last check-in" time for these instances in the Red Hat Customer Portal:
FWIW test-osuosl-rhel8-ppc64_le-4, the one online instance, has run CITGM jobs over the last week and remains online.
Trying a soft reboot in the OpenStack UI for test-osuosl-rhel8-ppc64_le-3.
Soft rebooting appears to have brought test-osuosl-rhel8-ppc64_le-3 back online in Jenkins and I can ssh into 🎉. Going to soft reboot the other two and then rerun our Ansible scripts on them.
All four test instances are now online and updated.
Noticed that test-osuosl-rhel8-ppc64_le-3 and test-osuosl-rhel8-ppc64_le-4 are offline again and I'm unable to ssh into them.
Again the build history (test-osuosl-rhel8-ppc64_le-3 and test-osuosl-rhel8-ppc64_le-4) indicates that the last job to run on these was citgm-smoker and the Red Hat Customer Portal indicates that they last checked in around the same time frame (6-7 days ago).
Worth noting that test-osuosl-rhel8-ppc64_le-4 wasn't one of the ones that went offline the first time we noticed this and has citgm-smoker runs in the build history that didn't cause it to remain offline 😕.
I've soft rebooted the two offline instances and they're back. Still not sure what's causing them to get into whatever state they end up in 😞.
test-osuosl-rhel8-ppc64_le-4 is offline again. Same scenario where the last job that ran on the machine was a CITGM job and we can no longer ssh into the machine.
(note there was a CITGM run the day before which didn't put the machine into this state.)
I'm seeing this in the OpenStack console for this machine, which I assume is not good:
Soft rebooted test-osuosl-rhel8-ppc64_le-4.
Happened again today. All 4 test machines were off line. Soft rebooting now.
After soft reboot they have come back and are building.
I've started a CITGM run on one of the RHEL 8 PPC64 LE machines with CITGM's verbose loglevel set to silly
to see if we get any clues as to how far into CITGM we got (assuming the problem recreates):
https://ci.nodejs.org/view/Node.js-citgm/job/citgm-smoker/2949/nodes=rhel8-ppc64le/
Problem didn't recreate (job completed, machine still online). Trying again: https://ci.nodejs.org/view/Node.js-citgm/job/citgm-smoker/2950/nodes=rhel8-ppc64le/
Problem didn't recreate (job completed, machine still online). Trying again: https://ci.nodejs.org/view/Node.js-citgm/job/citgm-smoker/2950/nodes=rhel8-ppc64le/
Still haven't been able to recreate the machine going offline with that run nor these: https://ci.nodejs.org/view/Node.js-citgm/job/citgm-smoker/2951/nodes=rhel8-ppc64le/ https://ci.nodejs.org/view/Node.js-citgm/job/citgm-smoker/2952/nodes=rhel8-ppc64le/
The release machine went offline yesterday and had to be soft rebooted https://github.com/nodejs/build/issues/2989. Might be the same issue as we've had with the test machines, although we do not run CITGM on the release machine.
This issue is stale because it has been open many days with no activity. It will be closed soon unless the stale label is removed or a comment is made.