build icon indicating copy to clipboard operation
build copied to clipboard

Identify why Ampere altras are restarting and not booting properly

Open sxa opened this issue 3 years ago • 25 comments

This has happened multiple times recently. For some reason it's restarting itself and not coming back. We need to identify why it's rebooting (Error condition, patching, or something else) and then see why it's not coming back (Separate test - perhaps try rebooting in an idle time and see if it comes back)

Current recovery process it to connect to the out-of-band console (details in the Equinix UI) and exit from the Shell> prompt.

sxa avatar Mar 14 '22 12:03 sxa

I thought the problematic one was ubuntu2004_docker-arm64-1? Refs: https://github.com/nodejs/build/issues/2820#issuecomment-986960037 Refs: https://github.com/nodejs/build/issues/2835#issuecomment-1058633378

richardlau avatar Mar 14 '22 12:03 richardlau

Changed the title

sxa avatar Mar 14 '22 12:03 sxa

And today it looks like test-equinix-ubuntu2004_docker-arm64-2 is down 😞. Logged into the out-of-band console and it was on the UEFI CLI. Typed exit at the prompt and then selected GNU/Linux at the GRUB menu and the machine booted.

richardlau avatar Apr 14 '22 14:04 richardlau

Looks like test-equinix-ubuntu2004_docker-arm64-2 is down again. It was stuck on the UEFI CLI again -- I've exited it and it's booting.

richardlau avatar Apr 27 '22 14:04 richardlau

And again test-equinix-ubuntu2004_docker-arm64-2 had restarted and was stuck on the UEFI CLI.

richardlau avatar Apr 29 '22 15:04 richardlau

test-equinix-ubuntu2004_docker-arm64-2 had restarted again and was stuck on the UEFI CLI. Logged into to the OOB console and exited the CLI.

richardlau avatar May 09 '22 11:05 richardlau

Noticed the containers on test-equinix-ubuntu2004_docker-arm64-2 are all down again. Logged into the OOB console and exited the UEFI CLI again.

richardlau avatar May 13 '22 11:05 richardlau

Containers on test-equinix-ubuntu2004_docker-arm64-2 are all offline again.

richardlau avatar May 16 '22 11:05 richardlau

(Is it too optimistic to hope the planned maintenance makes a difference? 🙂)

richardlau avatar May 16 '22 11:05 richardlau

(Is it too optimistic to hope the https://github.com/nodejs/build/issues/2948 makes a difference? slightly_smiling_face)

I suspect so ;-)

I brought it back online earlier today and will contact WorksOnArm regarding the failures.

It seems to be throwing a few of these before it dies, although it manages to recover from quite a lot of them too:

May 13 17:56:46 test-equinix-ubuntu2004-docker-arm64-2 kernel: [23448.790563] "node" (999554) uses deprecated CP15 Barrier instruction at 0x11a4a9c
May 13 18:03:10 test-equinix-ubuntu2004-docker-arm64-2 kernel: [23832.526304] {73}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 5
May 13 18:03:10 test-equinix-ubuntu2004-docker-arm64-2 kernel: [23832.526311] {73}[Hardware Error]: It has been corrected by h/w and requires no further action
May 13 18:03:10 test-equinix-ubuntu2004-docker-arm64-2 kernel: [23832.526314] {73}[Hardware Error]: event severity: corrected
May 13 18:03:10 test-equinix-ubuntu2004-docker-arm64-2 kernel: [23832.526317] {73}[Hardware Error]:  Error 0, type: corrected
May 13 18:03:10 test-equinix-ubuntu2004-docker-arm64-2 kernel: [23832.526324] {73}[Hardware Error]:   section type: unknown, e8ed898d-df16-43cc-8ecc-54f060ef157f
May 13 18:03:10 test-equinix-ubuntu2004-docker-arm64-2 kernel: [23832.526326] {73}[Hardware Error]:   section length: 0x30
May 13 18:03:10 test-equinix-ubuntu2004-docker-arm64-2 kernel: [23832.526332] {73}[Hardware Error]:   00000000: 40000003 00000000 00400000 00462030  ...@[email protected] F.
May 13 18:03:10 test-equinix-ubuntu2004-docker-arm64-2 kernel: [23832.526336] {73}[Hardware Error]:   00000010: 00000001 0000e000 00000000 00000000  ................
May 13 18:03:10 test-equinix-ubuntu2004-docker-arm64-2 kernel: [23832.526338] {73}[Hardware Error]:   00000020: 00000000 00000003 00000000 00000000  ................
May 13 18:08:18 test-equinix-ubuntu2004-docker-arm64-2 kernel: [24140.666503] {74}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 5
May 13 18:08:18 test-equinix-ubuntu2004-docker-arm64-2 kernel: [24140.666509] {74}[Hardware Error]: It has been corrected by h/w and requires no further action
May 13 18:08:18 test-equinix-ubuntu2004-docker-arm64-2 kernel: [24140.666512] {74}[Hardware Error]: event severity: corrected
May 13 18:08:18 test-equinix-ubuntu2004-docker-arm64-2 kernel: [24140.666515] {74}[Hardware Error]:  Error 0, type: corrected
May 13 18:08:18 test-equinix-ubuntu2004-docker-arm64-2 kernel: [24140.666522] {74}[Hardware Error]:   section type: unknown, e8ed898d-df16-43cc-8ecc-54f060ef157f
May 13 18:08:18 test-equinix-ubuntu2004-docker-arm64-2 kernel: [24140.666524] {74}[Hardware Error]:   section length: 0x30
May 13 18:08:18 test-equinix-ubuntu2004-docker-arm64-2 kernel: [24140.666531] {74}[Hardware Error]:   00000000: 40010003 00000000 00400000 00462030  ...@[email protected] F.
May 13 18:08:18 test-equinix-ubuntu2004-docker-arm64-2 kernel: [24140.666534] {74}[Hardware Error]:   00000010: 00000001 0000e000 00000000 00000000  ................
May 13 18:08:18 test-equinix-ubuntu2004-docker-arm64-2 kernel: [24140.666537] {74}[Hardware Error]:   00000020: 00000000 00000003 00000000 00000000  ................
May 13 18:13:31 test-equinix-ubuntu2004-docker-arm64-2 kernel: [24453.879202] {75}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 5
May 13 18:13:31 test-equinix-ubuntu2004-docker-arm64-2 kernel: [24453.879208] {75}[Hardware Error]: It has been corrected by h/w and requires no further action
May 13 18:13:31 test-equinix-ubuntu2004-docker-arm64-2 kernel: [24453.879211] {75}[Hardware Error]: event severity: corrected
May 13 18:13:31 test-equinix-ubuntu2004-docker-arm64-2 kernel: [24453.879214] {75}[Hardware Error]:  Error 0, type: corrected
May 13 18:13:31 test-equinix-ubuntu2004-docker-arm64-2 kernel: [24453.879221] {75}[Hardware Error]:   section type: unknown, e8ed898d-df16-43cc-8ecc-54f060ef157f
May 13 18:13:31 test-equinix-ubuntu2004-docker-arm64-2 kernel: [24453.879222] {75}[Hardware Error]:   section length: 0x30
May 13 18:13:31 test-equinix-ubuntu2004-docker-arm64-2 kernel: [24453.879229] {75}[Hardware Error]:   00000000: 40010003 00000000 00400000 00462030  ...@[email protected] F.
May 13 18:13:31 test-equinix-ubuntu2004-docker-arm64-2 kernel: [24453.879232] {75}[Hardware Error]:   00000010: 00000001 0000e000 00000000 00000000  ................
May 13 18:13:31 test-equinix-ubuntu2004-docker-arm64-2 kernel: [24453.879235] {75}[Hardware Error]:   00000020: 00000000 00000003 00000000 00000000  ................
May 13 18:16:18 test-equinix-ubuntu2004-docker-arm64-2 kernel: [24621.326137] {76}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 5
May 13 18:16:18 test-equinix-ubuntu2004-docker-arm64-2 kernel: [24621.326145] {76}[Hardware Error]: It has been corrected by h/w and requires no further action
May 13 18:16:18 test-equinix-ubuntu2004-docker-arm64-2 kernel: [24621.326147] {76}[Hardware Error]: event severity: corrected
May 13 18:16:18 test-equinix-ubuntu2004-docker-arm64-2 kernel: [24621.326150] {76}[Hardware Error]:  Error 0, type: corrected
May 13 18:16:18 test-equinix-ubuntu2004-docker-arm64-2 kernel: [24621.326157] {76}[Hardware Error]:   section type: unknown, e8ed898d-df16-43cc-8ecc-54f060ef157f
May 13 18:16:18 test-equinix-ubuntu2004-docker-arm64-2 kernel: [24621.326158] {76}[Hardware Error]:   section length: 0x30
May 13 18:16:18 test-equinix-ubuntu2004-docker-arm64-2 kernel: [24621.326166] {76}[Hardware Error]:   00000000: 40010003 00000000 00400000 00462030  ...@[email protected] F.
May 13 18:16:18 test-equinix-ubuntu2004-docker-arm64-2 kernel: [24621.326169] {76}[Hardware Error]:   00000010: 00000001 0000e000 00000000 00000000  ................
May 13 18:16:18 test-equinix-ubuntu2004-docker-arm64-2 kernel: [24621.326172] {76}[Hardware Error]:   00000020: 00000000 00000003 00000000 00000000  ................
May 13 18:24:49 test-equinix-ubuntu2004-docker-arm64-2 kernel: [25131.754400] {77}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 5
May 13 18:24:49 test-equinix-ubuntu2004-docker-arm64-2 kernel: [25131.754406] {77}[Hardware Error]: It has been corrected by h/w and requires no further action
May 13 18:24:49 test-equinix-ubuntu2004-docker-arm64-2 kernel: [25131.754408] {77}[Hardware Error]: event severity: corrected
May 13 18:24:49 test-equinix-ubuntu2004-docker-arm64-2 kernel: [25131.754411] {77}[Hardware Error]:  Error 0, type: corrected
May 13 18:24:49 test-equinix-ubuntu2004-docker-arm64-2 kernel: [25131.754418] {77}[Hardware Error]:   section type: unknown, e8ed898d-df16-43cc-8ecc-54f060ef157f
May 13 18:24:49 test-equinix-ubuntu2004-docker-arm64-2 kernel: [25131.754419] {77}[Hardware Error]:   section length: 0x30
May 13 18:24:49 test-equinix-ubuntu2004-docker-arm64-2 kernel: [25131.754427] {77}[Hardware Error]:   00000000: 40010003 00000000 00400000 00462030  ...@[email protected] F.
May 13 18:24:49 test-equinix-ubuntu2004-docker-arm64-2 kernel: [25131.754430] {77}[Hardware Error]:   00000010: 00000001 0000e000 00000000 00000000  ................
May 13 18:24:49 test-equinix-ubuntu2004-docker-arm64-2 kernel: [25131.754433] {77}[Hardware Error]:   00000020: 00000000 00000003 00000000 00000000  ................
May 13 18:38:53 test-equinix-ubuntu2004-docker-arm64-2 kernel: [25976.069449] {78}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 5
May 13 18:38:53 test-equinix-ubuntu2004-docker-arm64-2 kernel: [25976.069456] {78}[Hardware Error]: It has been corrected by h/w and requires no further action
May 13 18:38:53 test-equinix-ubuntu2004-docker-arm64-2 kernel: [25976.069458] {78}[Hardware Error]: event severity: corrected
May 13 18:38:53 test-equinix-ubuntu2004-docker-arm64-2 kernel: [25976.069461] {78}[Hardware Error]:  Error 0, type: corrected
May 13 18:38:53 test-equinix-ubuntu2004-docker-arm64-2 kernel: [25976.069470] {78}[Hardware Error]:   section type: unknown, e8ed898d-df16-43cc-8ecc-54f060ef157f
May 13 18:38:53 test-equinix-ubuntu2004-docker-arm64-2 kernel: [25976.069471] {78}[Hardware Error]:   section length: 0x30
May 13 18:38:53 test-equinix-ubuntu2004-docker-arm64-2 kernel: [25976.069478] {78}[Hardware Error]:   00000000: 40010003 00000000 00400000 00462030  ...@[email protected] F.
May 13 18:38:53 test-equinix-ubuntu2004-docker-arm64-2 kernel: [25976.069481] {78}[Hardware Error]:   00000010: 00000001 0000e000 00000000 00000000  ................
May 13 18:38:53 test-equinix-ubuntu2004-docker-arm64-2 kernel: [25976.069484] {78}[Hardware Error]:   00000020: 00000000 00000003 00000000 00000000  ................
May 13 18:42:51 test-equinix-ubuntu2004-docker-arm64-2 kernel: [26213.552450] {79}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 5
May 13 18:42:51 test-equinix-ubuntu2004-docker-arm64-2 kernel: [26213.552457] {79}[Hardware Error]: It has been corrected by h/w and requires no further action
May 13 18:42:51 test-equinix-ubuntu2004-docker-arm64-2 kernel: [26213.552460] {79}[Hardware Error]: event severity: corrected
May 13 18:42:51 test-equinix-ubuntu2004-docker-arm64-2 kernel: [26213.552463] {79}[Hardware Error]:  Error 0, type: corrected
May 13 18:42:51 test-equinix-ubuntu2004-docker-arm64-2 kernel: [26213.552471] {79}[Hardware Error]:   section type: unknown, e8ed898d-df16-43cc-8ecc-54f060ef157f
May 13 18:42:51 test-equinix-ubuntu2004-docker-arm64-2 kernel: [26213.552473] {79}[Hardware Error]:   section length: 0x30
May 13 18:42:51 test-equinix-ubuntu2004-docker-arm64-2 kernel: [26213.552480] {79}[Hardware Error]:   00000000: 40010003 00000000 00400000 00462030  ...@[email protected] F.
May 13 18:42:51 test-equinix-ubuntu2004-docker-arm64-2 kernel: [26213.552483] {79}[Hardware Error]:   00000010: 00000001 0000e000 00000000 00000000  ................
May 13 18:42:51 test-equinix-ubuntu2004-docker-arm64-2 kernel: [26213.552486] {79}[Hardware Error]:   00000020: 00000000 00000003 00000000 00000000  ................
May 13 18:46:50 test-equinix-ubuntu2004-docker-arm64-2 kernel: [26453.123337] {80}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 5
May 13 18:46:50 test-equinix-ubuntu2004-docker-arm64-2 kernel: [26453.123344] {80}[Hardware Error]: It has been corrected by h/w and requires no further action
May 13 18:46:50 test-equinix-ubuntu2004-docker-arm64-2 kernel: [26453.123346] {80}[Hardware Error]: event severity: corrected
May 13 18:46:50 test-equinix-ubuntu2004-docker-arm64-2 kernel: [26453.123349] {80}[Hardware Error]:  Error 0, type: corrected
May 13 18:46:50 test-equinix-ubuntu2004-docker-arm64-2 kernel: [26453.123356] {80}[Hardware Error]:   section type: unknown, e8ed898d-df16-43cc-8ecc-54f060ef157f
May 13 18:46:50 test-equinix-ubuntu2004-docker-arm64-2 kernel: [26453.123357] {80}[Hardware Error]:   section length: 0x30
May 13 18:46:50 test-equinix-ubuntu2004-docker-arm64-2 kernel: [26453.123364] {80}[Hardware Error]:   00000000: 40010003 00000000 00400000 00462030  ...@[email protected] F.
May 13 18:46:50 test-equinix-ubuntu2004-docker-arm64-2 kernel: [26453.123367] {80}[Hardware Error]:   00000010: 00000001 0000e000 00000000 00000000  ................
May 13 18:46:50 test-equinix-ubuntu2004-docker-arm64-2 kernel: [26453.123370] {80}[Hardware Error]:   00000020: 00000000 00000003 00000000 00000000  ................
May 13 19:02:34 test-equinix-ubuntu2004-docker-arm64-2 kernel: [27396.802232] {81}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 5
May 13 19:02:34 test-equinix-ubuntu2004-docker-arm64-2 kernel: [27396.802239] {81}[Hardware Error]: It has been corrected by h/w and requires no further action
May 13 19:02:34 test-equinix-ubuntu2004-docker-arm64-2 kernel: [27396.802242] {81}[Hardware Error]: event severity: corrected
May 13 19:02:34 test-equinix-ubuntu2004-docker-arm64-2 kernel: [27396.802245] {81}[Hardware Error]:  Error 0, type: corrected
May 13 19:02:34 test-equinix-ubuntu2004-docker-arm64-2 kernel: [27396.802253] {81}[Hardware Error]:   section type: unknown, e8ed898d-df16-43cc-8ecc-54f060ef157f
May 13 19:02:34 test-equinix-ubuntu2004-docker-arm64-2 kernel: [27396.802254] {81}[Hardware Error]:   section length: 0x30
May 13 19:02:34 test-equinix-ubuntu2004-docker-arm64-2 kernel: [27396.802262] {81}[Hardware Error]:   00000000: 40010003 00000000 00400000 00462030  ...@[email protected] F.
May 13 19:02:34 test-equinix-ubuntu2004-docker-arm64-2 kernel: [27396.802265] {81}[Hardware Error]:   00000010: 00000001 0000e000 00000000 00000000  ................
May 13 19:02:34 test-equinix-ubuntu2004-docker-arm64-2 kernel: [27396.802267] {81}[Hardware Error]:   00000020: 00000000 00000003 00000000 00000000  ................
May 13 19:08:18 test-equinix-ubuntu2004-docker-arm64-2 kernel: [27740.949286] {82}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 5
May 13 19:08:18 test-equinix-ubuntu2004-docker-arm64-2 kernel: [27740.949293] {82}[Hardware Error]: It has been corrected by h/w and requires no further action
May 13 19:08:18 test-equinix-ubuntu2004-docker-arm64-2 kernel: [27740.949295] {82}[Hardware Error]: event severity: corrected
May 13 19:08:18 test-equinix-ubuntu2004-docker-arm64-2 kernel: [27740.949298] {82}[Hardware Error]:  Error 0, type: corrected
May 13 19:08:18 test-equinix-ubuntu2004-docker-arm64-2 kernel: [27740.949306] {82}[Hardware Error]:   section type: unknown, e8ed898d-df16-43cc-8ecc-54f060ef157f
May 13 19:08:18 test-equinix-ubuntu2004-docker-arm64-2 kernel: [27740.949307] {82}[Hardware Error]:   section length: 0x30
May 13 19:08:18 test-equinix-ubuntu2004-docker-arm64-2 kernel: [27740.949315] {82}[Hardware Error]:   00000000: 40010003 00000000 00400000 00462030  ...@[email protected] F.
May 13 19:08:18 test-equinix-ubuntu2004-docker-arm64-2 kernel: [27740.949318] {82}[Hardware Error]:   00000010: 00000001 0000e000 00000000 00000000  ................
May 13 19:08:18 test-equinix-ubuntu2004-docker-arm64-2 kernel: [27740.949321] {82}[Hardware Error]:   00000020: 00000000 00000003 00000000 00000000  ................
May 16 11:54:43 test-equinix-ubuntu2004-docker-arm64-2 kernel: [    0.000000] Booting Linux on physical CPU 0x0000120000 [0x413fd0c1]

sxa avatar May 17 '22 19:05 sxa

Both machines were offline over the weekend, stuck on the UEFI CLI https://github.com/nodejs/build/issues/2959. I've logged into the OOB console on both and exited the CLI.

richardlau avatar Jun 13 '22 12:06 richardlau

It looks like one of them may not have been started after the previous maintenance window. For the other one (which has been unreliable for us) Equinix have provided me with a replacement which I'm provisioning with Ubuntu 20.04 just now and will be up as test-equinix-ubuntu2004-arm64-3 so we can migrate off the unstable one and leave it to them to analyse the fault.

sxa avatar Jun 16 '22 10:06 sxa

The second one (-2) was offline again. I've gone into the OOB console and exited the UEFI prompt.

richardlau avatar Jun 17 '22 12:06 richardlau

Rescued the second Altra again this morning.

richardlau avatar Jun 20 '22 10:06 richardlau

Looks to be down again. Let's not bring it back. I've got the playbook running at the moment which will bring up the -3 machine with direct replacements (same names) as the containers on the defective -2 system.

(For anyone watching along, the firewall rules have been switched to replace -2 with -3 so there should be no risk of both machines connecting together)

sxa avatar Jun 20 '22 17:06 sxa

@sxa , @richardlau , Request you to delete the problematic Altra server (Mt Jade under WoA) that is not used so that there is no confusion when the Equinix support team reclaims it. We need that deleted and freed for further investigation. Currently, all the 3 Mt Jade servers are showing as provisioned and active. @sxa Please confirm via response to the email dated 27th Jun w/ subject " Node.js - Works On Arm Sponsored - Stability issue". Thnx WoA Program Team

pgmwoa avatar Jun 29 '22 19:06 pgmwoa

I've deleted the Altra that had ip address 139.178.85.13.

richardlau avatar Jun 30 '22 12:06 richardlau

Confirmed via email

sxa avatar Jun 30 '22 14:06 sxa

Looks like the first Altra restarted around 5 and a half hours ago and was stuck on the UEFI prompt. I've logged into the OOB console and exited.

richardlau avatar Jun 30 '22 15:06 richardlau

Recovered test-equinix-ubuntu2004-docker-arm64-1 again from the UEFI prompt. I saw this while the machine was booting (after the prompt was exited):

[    0.925839] tpm_crb MSFT0101:00: [Firmware Bug]: ACPI region does not cover the entire command/response buffer. [mem 0x88500000-0x88500fff flags 0x201] vs 88500038 1000
[    1.011928] cma: cma_alloc: alloc failed, req-size: 256 pages, ret: -12
[    1.018605] cma: cma_alloc: alloc failed, req-size: 256 pages, ret: -12
[    1.025254] cma: cma_alloc: alloc failed, req-size: 256 pages, ret: -12
[    1.031897] cma: cma_alloc: alloc failed, req-size: 128 pages, ret: -12
[    1.039030] cma: cma_alloc: alloc failed, req-size: 256 pages, ret: -12
[    1.045686] cma: cma_alloc: alloc failed, req-size: 256 pages, ret: -12
[    1.052330] cma: cma_alloc: alloc failed, req-size: 256 pages, ret: -12
[    1.058972] cma: cma_alloc: alloc failed, req-size: 128 pages, ret: -12

Ubuntu 20.04.4 LTS test-equinix-ubuntu2004-docker-arm64-1 ttyAMA0

test-equinix-ubuntu2004-docker-arm64-1 login:

richardlau avatar Jul 12 '22 14:07 richardlau

Recovered test-equinix-ubuntu2004-docker-arm64-1 again from the UEFI prompt. Same messages as before when booting:

[    0.892690] tpm_crb MSFT0101:00: [Firmware Bug]: ACPI region does not cover the entire command/response buffer. [mem 0x88500000-0x88500fff flags 0x201] vs 88500038 1000
[    0.980799] cma: cma_alloc: alloc failed, req-size: 256 pages, ret: -12
[    0.987482] cma: cma_alloc: alloc failed, req-size: 256 pages, ret: -12
[    0.994141] cma: cma_alloc: alloc failed, req-size: 256 pages, ret: -12
[    1.000805] cma: cma_alloc: alloc failed, req-size: 128 pages, ret: -12
[    1.008286] cma: cma_alloc: alloc failed, req-size: 256 pages, ret: -12
[    1.014963] cma: cma_alloc: alloc failed, req-size: 256 pages, ret: -12
[    1.021617] cma: cma_alloc: alloc failed, req-size: 256 pages, ret: -12
[    1.028270] cma: cma_alloc: alloc failed, req-size: 128 pages, ret: -12

Ubuntu 20.04.4 LTS test-equinix-ubuntu2004-docker-arm64-1 ttyAMA0

test-equinix-ubuntu2004-docker-arm64-1 login:

richardlau avatar Jul 18 '22 12:07 richardlau

Most recent jobs before the crash seem to have been centos7-arm64-gcc6 ones -although they were listed as SUCCESS (This is from the jenkins server log):

2022-07-16 06:08:53:620 - AuditLog - node-test-commit-arm » centos7-arm64-gcc6 #42769 Started by upstream project "node-test-commit-arm" build number 42,769, Parameters:[NODEJS_VERSION: {12.22.13}, NODEJS_MAJOR_VERSION: {12}] on node test-equinix-centos7_container-arm64-2 started at 2022-07-16T10:02:12Z completed in 392437ms completed: SUCCESS
2022-07-17 06:09:04:086 - AuditLog - node-test-commit-arm » centos7-arm64-gcc6 #42781 Started by upstream project "node-test-commit-arm" build number 42,781, Parameters:[NODEJS_VERSION: {12.22.13}, NODEJS_MAJOR_VERSION: {12}] on node test-equinix-centos7_container-arm64-2 started at 2022-07-17T10:02:14Z completed in 400434ms completed: SUCCESS
2022-07-18 06:11:07:881 - AuditLog - node-test-commit-arm » centos7-arm64-gcc6 #42808 Started by upstream project "node-test-commit-arm" build number 42,808, Parameters:[NODEJS_VERSION: {12.22.13}, NODEJS_MAJOR_VERSION: {12}] on node test-equinix-centos7_container-arm64-2 started at 2022-07-18T10:02:16Z completed in 523189ms completed: SUCCESS

NOTES: The above is from using the output of using egrep - "test-equinix-centos7_container-arm64-2|test-equinix-ubuntu2004_sharedlibs_container-arm64-2|test-equinix-ubuntu1804_sharedlibs_container-arm64-2|test-equinix-ubuntu2004_sharedlibs_container-arm64-1|test-equinix-ubuntu1804_container-arm64-1|test-equinix-centos8_container-arm64-1|test-equinix-rhel8_container-arm64-1|test-equinix-ubuntu2004_container-armv7l-1|test-equinix-centos7_container-arm64-1|test-equinix-ubuntu2004_sharedlibs_container-arm64-3|test-equinix-ubuntu1804_sharedlibs_container-arm64-1|test-equinix-ubuntu2004_container-arm64-1|test-equinix-debian10_container-armv7l-1|test-equinix-ubuntu1804_sharedlibs_container-arm64-3" against the jenkins log which shows all the stuff about the containers on that host.

In case there are any issues specific to centos7-arm64-gcc6 I'm going to run a few rebuids of https://ci.nodejs.org/job/node-test-commit-arm/42880 which is ONLY building that one.

sxa avatar Jul 22 '22 13:07 sxa

Have taken the second centos7 container offline and currently repeatedly running the centos7 gcc6 job repeatedly on the "failing" altra. I will also add in the ubuntu2004-armv7l combination in future runs as that is potentially more suspect than the others and bring test-equinix-centos7_container-arm64-2 from the other machine offline for now too.

Running as builds https://ci.nodejs.org/job/node-test-commit-arm 42988 up to 43000 which is running:

  • https://ci.nodejs.org/job/node-test-commit-arm/nodes=centos7-arm64-gcc6 42988 up to 43000

And builds https://ci.nodejs.org/job/node-test-commit-arm 43001 up to 43010 which is running:

  • https://ci.nodejs.org/job/node-test-commit-arm/nodes=ubuntu2004-armv7l 43001 up to 43010

sxa avatar Jul 28 '22 12:07 sxa

It seems the issue is happening again https://github.com/nodejs/build/issues/3022, it has been blocking the CI for a while

joyeecheung avatar Aug 30 '22 05:08 joyeecheung

I've brought https://ci.nodejs.org/computer/test-equinix-ubuntu2004_container-armv7l-2/ back online to clear the backlog.

test-equinix-ubuntu2004-arm64-1 - 145.40.81.219 - had gone offline for the first time in a while so we'll need to re-evaluate what's going on here. That's the first outage we've had in a few weeks on that server. It's now back and so there are two executors for the ubuntu2004-armv7l jobs available again.

sxa avatar Aug 30 '22 09:08 sxa

Had to log into the oob console for test-equinix-ubuntu2004-arm64-1 today to exit the UEFI prompt.

richardlau avatar Oct 17 '22 16:10 richardlau

Had to recover test-equinix-ubuntu2004-arm64-1 today in the usual way.

richardlau avatar Nov 01 '22 12:11 richardlau

test-equinix-ubuntu2004-arm64-1 had rebooted/was stuck again today 😞. I've recovered it.

richardlau avatar Nov 02 '22 12:11 richardlau

Have taken the second centos7 container offline

@sxa FYI I've brought back the second container to help process the job queue.

richardlau avatar Nov 02 '22 12:11 richardlau

test-equinix-ubuntu2004-arm64-1 was stuck again and has now been recovered.

richardlau avatar Nov 24 '22 21:11 richardlau