build icon indicating copy to clipboard operation
build copied to clipboard

investigate offline RHEL 8 PPC64LE instances

Open richardlau opened this issue 2 years ago • 15 comments

Only one of our test RHEL 8 PPC64LE is online: image

richardlau avatar Apr 27 '22 11:04 richardlau

I can't ssh into any of the three offline instances (connection times out).

According to build history the last jobs that ran on the three offline hosts are:

host job time since
test-osuosl-rhel8-ppc64_le-1 https://ci.nodejs.org/job/citgm-smoker-nobuild/nodes=rhel8-ppc64le/1194/ 7 days 13 hours
test-osuosl-rhel8-ppc64_le-2 https://ci.nodejs.org/job/citgm-smoker/nodes=rhel8-ppc64le/2913/ 15 days
test-osuosl-rhel8-ppc64_le-3 https://ci.nodejs.org/job/citgm-smoker/nodes=rhel8-ppc64le/2914/ 15 days

All of the jobs exited with the Jenkins agent disconnecting. The "time since" roughly corresponds to the "last check-in" time for these instances in the Red Hat Customer Portal: image

FWIW test-osuosl-rhel8-ppc64_le-4, the one online instance, has run CITGM jobs over the last week and remains online.

richardlau avatar Apr 27 '22 12:04 richardlau

Trying a soft reboot in the OpenStack UI for test-osuosl-rhel8-ppc64_le-3.

richardlau avatar Apr 27 '22 12:04 richardlau

Soft rebooting appears to have brought test-osuosl-rhel8-ppc64_le-3 back online in Jenkins and I can ssh into 🎉. Going to soft reboot the other two and then rerun our Ansible scripts on them.

richardlau avatar Apr 27 '22 12:04 richardlau

All four test instances are now online and updated.

richardlau avatar Apr 27 '22 14:04 richardlau

Noticed that test-osuosl-rhel8-ppc64_le-3 and test-osuosl-rhel8-ppc64_le-4 are offline again and I'm unable to ssh into them.

image

Again the build history (test-osuosl-rhel8-ppc64_le-3 and test-osuosl-rhel8-ppc64_le-4) indicates that the last job to run on these was citgm-smoker and the Red Hat Customer Portal indicates that they last checked in around the same time frame (6-7 days ago). image

richardlau avatar May 09 '22 11:05 richardlau

Worth noting that test-osuosl-rhel8-ppc64_le-4 wasn't one of the ones that went offline the first time we noticed this and has citgm-smoker runs in the build history that didn't cause it to remain offline 😕.

richardlau avatar May 09 '22 11:05 richardlau

I've soft rebooted the two offline instances and they're back. Still not sure what's causing them to get into whatever state they end up in 😞.

richardlau avatar May 09 '22 12:05 richardlau

test-osuosl-rhel8-ppc64_le-4 is offline again. Same scenario where the last job that ran on the machine was a CITGM job and we can no longer ssh into the machine.

image

(note there was a CITGM run the day before which didn't put the machine into this state.)

I'm seeing this in the OpenStack console for this machine, which I assume is not good: image

richardlau avatar May 13 '22 11:05 richardlau

Soft rebooted test-osuosl-rhel8-ppc64_le-4.

richardlau avatar May 13 '22 11:05 richardlau

Happened again today. All 4 test machines were off line. Soft rebooting now.

mhdawson avatar Jun 03 '22 13:06 mhdawson

After soft reboot they have come back and are building.

mhdawson avatar Jun 03 '22 13:06 mhdawson

I've started a CITGM run on one of the RHEL 8 PPC64 LE machines with CITGM's verbose loglevel set to silly to see if we get any clues as to how far into CITGM we got (assuming the problem recreates): https://ci.nodejs.org/view/Node.js-citgm/job/citgm-smoker/2949/nodes=rhel8-ppc64le/

richardlau avatar Jun 06 '22 16:06 richardlau

Problem didn't recreate (job completed, machine still online). Trying again: https://ci.nodejs.org/view/Node.js-citgm/job/citgm-smoker/2950/nodes=rhel8-ppc64le/

richardlau avatar Jun 06 '22 18:06 richardlau

Problem didn't recreate (job completed, machine still online). Trying again: https://ci.nodejs.org/view/Node.js-citgm/job/citgm-smoker/2950/nodes=rhel8-ppc64le/

Still haven't been able to recreate the machine going offline with that run nor these: https://ci.nodejs.org/view/Node.js-citgm/job/citgm-smoker/2951/nodes=rhel8-ppc64le/ https://ci.nodejs.org/view/Node.js-citgm/job/citgm-smoker/2952/nodes=rhel8-ppc64le/

richardlau avatar Jun 07 '22 16:06 richardlau

The release machine went offline yesterday and had to be soft rebooted https://github.com/nodejs/build/issues/2989. Might be the same issue as we've had with the test machines, although we do not run CITGM on the release machine.

richardlau avatar Jul 06 '22 14:07 richardlau

This issue is stale because it has been open many days with no activity. It will be closed soon unless the stale label is removed or a comment is made.

github-actions[bot] avatar May 03 '23 00:05 github-actions[bot]