build
build copied to clipboard
Identify why Ampere altras are restarting and not booting properly
This has happened multiple times recently. For some reason it's restarting itself and not coming back. We need to identify why it's rebooting (Error condition, patching, or something else) and then see why it's not coming back (Separate test - perhaps try rebooting in an idle time and see if it comes back)
Current recovery process it to connect to the out-of-band console (details in the Equinix UI) and exit from the Shell> prompt.
I thought the problematic one was ubuntu2004_docker-arm64-1? Refs: https://github.com/nodejs/build/issues/2820#issuecomment-986960037 Refs: https://github.com/nodejs/build/issues/2835#issuecomment-1058633378
Changed the title
And today it looks like test-equinix-ubuntu2004_docker-arm64-2 is down 😞. Logged into the out-of-band console and it was on the UEFI CLI. Typed exit at the prompt and then selected GNU/Linux at the GRUB menu and the machine booted.
Looks like test-equinix-ubuntu2004_docker-arm64-2 is down again. It was stuck on the UEFI CLI again -- I've exited it and it's booting.
And again test-equinix-ubuntu2004_docker-arm64-2 had restarted and was stuck on the UEFI CLI.
test-equinix-ubuntu2004_docker-arm64-2 had restarted again and was stuck on the UEFI CLI. Logged into to the OOB console and exited the CLI.
Noticed the containers on test-equinix-ubuntu2004_docker-arm64-2 are all down again. Logged into the OOB console and exited the UEFI CLI again.
Containers on test-equinix-ubuntu2004_docker-arm64-2 are all offline again.
(Is it too optimistic to hope the planned maintenance makes a difference? 🙂)
(Is it too optimistic to hope the https://github.com/nodejs/build/issues/2948 makes a difference? slightly_smiling_face)
I suspect so ;-)
I brought it back online earlier today and will contact WorksOnArm regarding the failures.
It seems to be throwing a few of these before it dies, although it manages to recover from quite a lot of them too:
May 13 17:56:46 test-equinix-ubuntu2004-docker-arm64-2 kernel: [23448.790563] "node" (999554) uses deprecated CP15 Barrier instruction at 0x11a4a9c
May 13 18:03:10 test-equinix-ubuntu2004-docker-arm64-2 kernel: [23832.526304] {73}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 5
May 13 18:03:10 test-equinix-ubuntu2004-docker-arm64-2 kernel: [23832.526311] {73}[Hardware Error]: It has been corrected by h/w and requires no further action
May 13 18:03:10 test-equinix-ubuntu2004-docker-arm64-2 kernel: [23832.526314] {73}[Hardware Error]: event severity: corrected
May 13 18:03:10 test-equinix-ubuntu2004-docker-arm64-2 kernel: [23832.526317] {73}[Hardware Error]: Error 0, type: corrected
May 13 18:03:10 test-equinix-ubuntu2004-docker-arm64-2 kernel: [23832.526324] {73}[Hardware Error]: section type: unknown, e8ed898d-df16-43cc-8ecc-54f060ef157f
May 13 18:03:10 test-equinix-ubuntu2004-docker-arm64-2 kernel: [23832.526326] {73}[Hardware Error]: section length: 0x30
May 13 18:03:10 test-equinix-ubuntu2004-docker-arm64-2 kernel: [23832.526332] {73}[Hardware Error]: 00000000: 40000003 00000000 00400000 00462030 ...@[email protected] F.
May 13 18:03:10 test-equinix-ubuntu2004-docker-arm64-2 kernel: [23832.526336] {73}[Hardware Error]: 00000010: 00000001 0000e000 00000000 00000000 ................
May 13 18:03:10 test-equinix-ubuntu2004-docker-arm64-2 kernel: [23832.526338] {73}[Hardware Error]: 00000020: 00000000 00000003 00000000 00000000 ................
May 13 18:08:18 test-equinix-ubuntu2004-docker-arm64-2 kernel: [24140.666503] {74}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 5
May 13 18:08:18 test-equinix-ubuntu2004-docker-arm64-2 kernel: [24140.666509] {74}[Hardware Error]: It has been corrected by h/w and requires no further action
May 13 18:08:18 test-equinix-ubuntu2004-docker-arm64-2 kernel: [24140.666512] {74}[Hardware Error]: event severity: corrected
May 13 18:08:18 test-equinix-ubuntu2004-docker-arm64-2 kernel: [24140.666515] {74}[Hardware Error]: Error 0, type: corrected
May 13 18:08:18 test-equinix-ubuntu2004-docker-arm64-2 kernel: [24140.666522] {74}[Hardware Error]: section type: unknown, e8ed898d-df16-43cc-8ecc-54f060ef157f
May 13 18:08:18 test-equinix-ubuntu2004-docker-arm64-2 kernel: [24140.666524] {74}[Hardware Error]: section length: 0x30
May 13 18:08:18 test-equinix-ubuntu2004-docker-arm64-2 kernel: [24140.666531] {74}[Hardware Error]: 00000000: 40010003 00000000 00400000 00462030 ...@[email protected] F.
May 13 18:08:18 test-equinix-ubuntu2004-docker-arm64-2 kernel: [24140.666534] {74}[Hardware Error]: 00000010: 00000001 0000e000 00000000 00000000 ................
May 13 18:08:18 test-equinix-ubuntu2004-docker-arm64-2 kernel: [24140.666537] {74}[Hardware Error]: 00000020: 00000000 00000003 00000000 00000000 ................
May 13 18:13:31 test-equinix-ubuntu2004-docker-arm64-2 kernel: [24453.879202] {75}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 5
May 13 18:13:31 test-equinix-ubuntu2004-docker-arm64-2 kernel: [24453.879208] {75}[Hardware Error]: It has been corrected by h/w and requires no further action
May 13 18:13:31 test-equinix-ubuntu2004-docker-arm64-2 kernel: [24453.879211] {75}[Hardware Error]: event severity: corrected
May 13 18:13:31 test-equinix-ubuntu2004-docker-arm64-2 kernel: [24453.879214] {75}[Hardware Error]: Error 0, type: corrected
May 13 18:13:31 test-equinix-ubuntu2004-docker-arm64-2 kernel: [24453.879221] {75}[Hardware Error]: section type: unknown, e8ed898d-df16-43cc-8ecc-54f060ef157f
May 13 18:13:31 test-equinix-ubuntu2004-docker-arm64-2 kernel: [24453.879222] {75}[Hardware Error]: section length: 0x30
May 13 18:13:31 test-equinix-ubuntu2004-docker-arm64-2 kernel: [24453.879229] {75}[Hardware Error]: 00000000: 40010003 00000000 00400000 00462030 ...@[email protected] F.
May 13 18:13:31 test-equinix-ubuntu2004-docker-arm64-2 kernel: [24453.879232] {75}[Hardware Error]: 00000010: 00000001 0000e000 00000000 00000000 ................
May 13 18:13:31 test-equinix-ubuntu2004-docker-arm64-2 kernel: [24453.879235] {75}[Hardware Error]: 00000020: 00000000 00000003 00000000 00000000 ................
May 13 18:16:18 test-equinix-ubuntu2004-docker-arm64-2 kernel: [24621.326137] {76}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 5
May 13 18:16:18 test-equinix-ubuntu2004-docker-arm64-2 kernel: [24621.326145] {76}[Hardware Error]: It has been corrected by h/w and requires no further action
May 13 18:16:18 test-equinix-ubuntu2004-docker-arm64-2 kernel: [24621.326147] {76}[Hardware Error]: event severity: corrected
May 13 18:16:18 test-equinix-ubuntu2004-docker-arm64-2 kernel: [24621.326150] {76}[Hardware Error]: Error 0, type: corrected
May 13 18:16:18 test-equinix-ubuntu2004-docker-arm64-2 kernel: [24621.326157] {76}[Hardware Error]: section type: unknown, e8ed898d-df16-43cc-8ecc-54f060ef157f
May 13 18:16:18 test-equinix-ubuntu2004-docker-arm64-2 kernel: [24621.326158] {76}[Hardware Error]: section length: 0x30
May 13 18:16:18 test-equinix-ubuntu2004-docker-arm64-2 kernel: [24621.326166] {76}[Hardware Error]: 00000000: 40010003 00000000 00400000 00462030 ...@[email protected] F.
May 13 18:16:18 test-equinix-ubuntu2004-docker-arm64-2 kernel: [24621.326169] {76}[Hardware Error]: 00000010: 00000001 0000e000 00000000 00000000 ................
May 13 18:16:18 test-equinix-ubuntu2004-docker-arm64-2 kernel: [24621.326172] {76}[Hardware Error]: 00000020: 00000000 00000003 00000000 00000000 ................
May 13 18:24:49 test-equinix-ubuntu2004-docker-arm64-2 kernel: [25131.754400] {77}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 5
May 13 18:24:49 test-equinix-ubuntu2004-docker-arm64-2 kernel: [25131.754406] {77}[Hardware Error]: It has been corrected by h/w and requires no further action
May 13 18:24:49 test-equinix-ubuntu2004-docker-arm64-2 kernel: [25131.754408] {77}[Hardware Error]: event severity: corrected
May 13 18:24:49 test-equinix-ubuntu2004-docker-arm64-2 kernel: [25131.754411] {77}[Hardware Error]: Error 0, type: corrected
May 13 18:24:49 test-equinix-ubuntu2004-docker-arm64-2 kernel: [25131.754418] {77}[Hardware Error]: section type: unknown, e8ed898d-df16-43cc-8ecc-54f060ef157f
May 13 18:24:49 test-equinix-ubuntu2004-docker-arm64-2 kernel: [25131.754419] {77}[Hardware Error]: section length: 0x30
May 13 18:24:49 test-equinix-ubuntu2004-docker-arm64-2 kernel: [25131.754427] {77}[Hardware Error]: 00000000: 40010003 00000000 00400000 00462030 ...@[email protected] F.
May 13 18:24:49 test-equinix-ubuntu2004-docker-arm64-2 kernel: [25131.754430] {77}[Hardware Error]: 00000010: 00000001 0000e000 00000000 00000000 ................
May 13 18:24:49 test-equinix-ubuntu2004-docker-arm64-2 kernel: [25131.754433] {77}[Hardware Error]: 00000020: 00000000 00000003 00000000 00000000 ................
May 13 18:38:53 test-equinix-ubuntu2004-docker-arm64-2 kernel: [25976.069449] {78}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 5
May 13 18:38:53 test-equinix-ubuntu2004-docker-arm64-2 kernel: [25976.069456] {78}[Hardware Error]: It has been corrected by h/w and requires no further action
May 13 18:38:53 test-equinix-ubuntu2004-docker-arm64-2 kernel: [25976.069458] {78}[Hardware Error]: event severity: corrected
May 13 18:38:53 test-equinix-ubuntu2004-docker-arm64-2 kernel: [25976.069461] {78}[Hardware Error]: Error 0, type: corrected
May 13 18:38:53 test-equinix-ubuntu2004-docker-arm64-2 kernel: [25976.069470] {78}[Hardware Error]: section type: unknown, e8ed898d-df16-43cc-8ecc-54f060ef157f
May 13 18:38:53 test-equinix-ubuntu2004-docker-arm64-2 kernel: [25976.069471] {78}[Hardware Error]: section length: 0x30
May 13 18:38:53 test-equinix-ubuntu2004-docker-arm64-2 kernel: [25976.069478] {78}[Hardware Error]: 00000000: 40010003 00000000 00400000 00462030 ...@[email protected] F.
May 13 18:38:53 test-equinix-ubuntu2004-docker-arm64-2 kernel: [25976.069481] {78}[Hardware Error]: 00000010: 00000001 0000e000 00000000 00000000 ................
May 13 18:38:53 test-equinix-ubuntu2004-docker-arm64-2 kernel: [25976.069484] {78}[Hardware Error]: 00000020: 00000000 00000003 00000000 00000000 ................
May 13 18:42:51 test-equinix-ubuntu2004-docker-arm64-2 kernel: [26213.552450] {79}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 5
May 13 18:42:51 test-equinix-ubuntu2004-docker-arm64-2 kernel: [26213.552457] {79}[Hardware Error]: It has been corrected by h/w and requires no further action
May 13 18:42:51 test-equinix-ubuntu2004-docker-arm64-2 kernel: [26213.552460] {79}[Hardware Error]: event severity: corrected
May 13 18:42:51 test-equinix-ubuntu2004-docker-arm64-2 kernel: [26213.552463] {79}[Hardware Error]: Error 0, type: corrected
May 13 18:42:51 test-equinix-ubuntu2004-docker-arm64-2 kernel: [26213.552471] {79}[Hardware Error]: section type: unknown, e8ed898d-df16-43cc-8ecc-54f060ef157f
May 13 18:42:51 test-equinix-ubuntu2004-docker-arm64-2 kernel: [26213.552473] {79}[Hardware Error]: section length: 0x30
May 13 18:42:51 test-equinix-ubuntu2004-docker-arm64-2 kernel: [26213.552480] {79}[Hardware Error]: 00000000: 40010003 00000000 00400000 00462030 ...@[email protected] F.
May 13 18:42:51 test-equinix-ubuntu2004-docker-arm64-2 kernel: [26213.552483] {79}[Hardware Error]: 00000010: 00000001 0000e000 00000000 00000000 ................
May 13 18:42:51 test-equinix-ubuntu2004-docker-arm64-2 kernel: [26213.552486] {79}[Hardware Error]: 00000020: 00000000 00000003 00000000 00000000 ................
May 13 18:46:50 test-equinix-ubuntu2004-docker-arm64-2 kernel: [26453.123337] {80}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 5
May 13 18:46:50 test-equinix-ubuntu2004-docker-arm64-2 kernel: [26453.123344] {80}[Hardware Error]: It has been corrected by h/w and requires no further action
May 13 18:46:50 test-equinix-ubuntu2004-docker-arm64-2 kernel: [26453.123346] {80}[Hardware Error]: event severity: corrected
May 13 18:46:50 test-equinix-ubuntu2004-docker-arm64-2 kernel: [26453.123349] {80}[Hardware Error]: Error 0, type: corrected
May 13 18:46:50 test-equinix-ubuntu2004-docker-arm64-2 kernel: [26453.123356] {80}[Hardware Error]: section type: unknown, e8ed898d-df16-43cc-8ecc-54f060ef157f
May 13 18:46:50 test-equinix-ubuntu2004-docker-arm64-2 kernel: [26453.123357] {80}[Hardware Error]: section length: 0x30
May 13 18:46:50 test-equinix-ubuntu2004-docker-arm64-2 kernel: [26453.123364] {80}[Hardware Error]: 00000000: 40010003 00000000 00400000 00462030 ...@[email protected] F.
May 13 18:46:50 test-equinix-ubuntu2004-docker-arm64-2 kernel: [26453.123367] {80}[Hardware Error]: 00000010: 00000001 0000e000 00000000 00000000 ................
May 13 18:46:50 test-equinix-ubuntu2004-docker-arm64-2 kernel: [26453.123370] {80}[Hardware Error]: 00000020: 00000000 00000003 00000000 00000000 ................
May 13 19:02:34 test-equinix-ubuntu2004-docker-arm64-2 kernel: [27396.802232] {81}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 5
May 13 19:02:34 test-equinix-ubuntu2004-docker-arm64-2 kernel: [27396.802239] {81}[Hardware Error]: It has been corrected by h/w and requires no further action
May 13 19:02:34 test-equinix-ubuntu2004-docker-arm64-2 kernel: [27396.802242] {81}[Hardware Error]: event severity: corrected
May 13 19:02:34 test-equinix-ubuntu2004-docker-arm64-2 kernel: [27396.802245] {81}[Hardware Error]: Error 0, type: corrected
May 13 19:02:34 test-equinix-ubuntu2004-docker-arm64-2 kernel: [27396.802253] {81}[Hardware Error]: section type: unknown, e8ed898d-df16-43cc-8ecc-54f060ef157f
May 13 19:02:34 test-equinix-ubuntu2004-docker-arm64-2 kernel: [27396.802254] {81}[Hardware Error]: section length: 0x30
May 13 19:02:34 test-equinix-ubuntu2004-docker-arm64-2 kernel: [27396.802262] {81}[Hardware Error]: 00000000: 40010003 00000000 00400000 00462030 ...@[email protected] F.
May 13 19:02:34 test-equinix-ubuntu2004-docker-arm64-2 kernel: [27396.802265] {81}[Hardware Error]: 00000010: 00000001 0000e000 00000000 00000000 ................
May 13 19:02:34 test-equinix-ubuntu2004-docker-arm64-2 kernel: [27396.802267] {81}[Hardware Error]: 00000020: 00000000 00000003 00000000 00000000 ................
May 13 19:08:18 test-equinix-ubuntu2004-docker-arm64-2 kernel: [27740.949286] {82}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 5
May 13 19:08:18 test-equinix-ubuntu2004-docker-arm64-2 kernel: [27740.949293] {82}[Hardware Error]: It has been corrected by h/w and requires no further action
May 13 19:08:18 test-equinix-ubuntu2004-docker-arm64-2 kernel: [27740.949295] {82}[Hardware Error]: event severity: corrected
May 13 19:08:18 test-equinix-ubuntu2004-docker-arm64-2 kernel: [27740.949298] {82}[Hardware Error]: Error 0, type: corrected
May 13 19:08:18 test-equinix-ubuntu2004-docker-arm64-2 kernel: [27740.949306] {82}[Hardware Error]: section type: unknown, e8ed898d-df16-43cc-8ecc-54f060ef157f
May 13 19:08:18 test-equinix-ubuntu2004-docker-arm64-2 kernel: [27740.949307] {82}[Hardware Error]: section length: 0x30
May 13 19:08:18 test-equinix-ubuntu2004-docker-arm64-2 kernel: [27740.949315] {82}[Hardware Error]: 00000000: 40010003 00000000 00400000 00462030 ...@[email protected] F.
May 13 19:08:18 test-equinix-ubuntu2004-docker-arm64-2 kernel: [27740.949318] {82}[Hardware Error]: 00000010: 00000001 0000e000 00000000 00000000 ................
May 13 19:08:18 test-equinix-ubuntu2004-docker-arm64-2 kernel: [27740.949321] {82}[Hardware Error]: 00000020: 00000000 00000003 00000000 00000000 ................
May 16 11:54:43 test-equinix-ubuntu2004-docker-arm64-2 kernel: [ 0.000000] Booting Linux on physical CPU 0x0000120000 [0x413fd0c1]
Both machines were offline over the weekend, stuck on the UEFI CLI https://github.com/nodejs/build/issues/2959. I've logged into the OOB console on both and exited the CLI.
It looks like one of them may not have been started after the previous maintenance window. For the other one (which has been unreliable for us) Equinix have provided me with a replacement which I'm provisioning with Ubuntu 20.04 just now and will be up as test-equinix-ubuntu2004-arm64-3 so we can migrate off the unstable one and leave it to them to analyse the fault.
The second one (-2) was offline again. I've gone into the OOB console and exited the UEFI prompt.
Rescued the second Altra again this morning.
Looks to be down again. Let's not bring it back. I've got the playbook running at the moment which will bring up the -3 machine with direct replacements (same names) as the containers on the defective -2 system.
(For anyone watching along, the firewall rules have been switched to replace -2 with -3 so there should be no risk of both machines connecting together)
@sxa , @richardlau , Request you to delete the problematic Altra server (Mt Jade under WoA) that is not used so that there is no confusion when the Equinix support team reclaims it. We need that deleted and freed for further investigation. Currently, all the 3 Mt Jade servers are showing as provisioned and active. @sxa Please confirm via response to the email dated 27th Jun w/ subject " Node.js - Works On Arm Sponsored - Stability issue". Thnx WoA Program Team
I've deleted the Altra that had ip address 139.178.85.13.
Confirmed via email
Looks like the first Altra restarted around 5 and a half hours ago and was stuck on the UEFI prompt. I've logged into the OOB console and exited.
Recovered test-equinix-ubuntu2004-docker-arm64-1 again from the UEFI prompt. I saw this while the machine was booting (after the prompt was exited):
[ 0.925839] tpm_crb MSFT0101:00: [Firmware Bug]: ACPI region does not cover the entire command/response buffer. [mem 0x88500000-0x88500fff flags 0x201] vs 88500038 1000
[ 1.011928] cma: cma_alloc: alloc failed, req-size: 256 pages, ret: -12
[ 1.018605] cma: cma_alloc: alloc failed, req-size: 256 pages, ret: -12
[ 1.025254] cma: cma_alloc: alloc failed, req-size: 256 pages, ret: -12
[ 1.031897] cma: cma_alloc: alloc failed, req-size: 128 pages, ret: -12
[ 1.039030] cma: cma_alloc: alloc failed, req-size: 256 pages, ret: -12
[ 1.045686] cma: cma_alloc: alloc failed, req-size: 256 pages, ret: -12
[ 1.052330] cma: cma_alloc: alloc failed, req-size: 256 pages, ret: -12
[ 1.058972] cma: cma_alloc: alloc failed, req-size: 128 pages, ret: -12
Ubuntu 20.04.4 LTS test-equinix-ubuntu2004-docker-arm64-1 ttyAMA0
test-equinix-ubuntu2004-docker-arm64-1 login:
Recovered test-equinix-ubuntu2004-docker-arm64-1 again from the UEFI prompt. Same messages as before when booting:
[ 0.892690] tpm_crb MSFT0101:00: [Firmware Bug]: ACPI region does not cover the entire command/response buffer. [mem 0x88500000-0x88500fff flags 0x201] vs 88500038 1000
[ 0.980799] cma: cma_alloc: alloc failed, req-size: 256 pages, ret: -12
[ 0.987482] cma: cma_alloc: alloc failed, req-size: 256 pages, ret: -12
[ 0.994141] cma: cma_alloc: alloc failed, req-size: 256 pages, ret: -12
[ 1.000805] cma: cma_alloc: alloc failed, req-size: 128 pages, ret: -12
[ 1.008286] cma: cma_alloc: alloc failed, req-size: 256 pages, ret: -12
[ 1.014963] cma: cma_alloc: alloc failed, req-size: 256 pages, ret: -12
[ 1.021617] cma: cma_alloc: alloc failed, req-size: 256 pages, ret: -12
[ 1.028270] cma: cma_alloc: alloc failed, req-size: 128 pages, ret: -12
Ubuntu 20.04.4 LTS test-equinix-ubuntu2004-docker-arm64-1 ttyAMA0
test-equinix-ubuntu2004-docker-arm64-1 login:
Most recent jobs before the crash seem to have been centos7-arm64-gcc6 ones -although they were listed as SUCCESS (This is from the jenkins server log):
2022-07-16 06:08:53:620 - AuditLog - node-test-commit-arm » centos7-arm64-gcc6 #42769 Started by upstream project "node-test-commit-arm" build number 42,769, Parameters:[NODEJS_VERSION: {12.22.13}, NODEJS_MAJOR_VERSION: {12}] on node test-equinix-centos7_container-arm64-2 started at 2022-07-16T10:02:12Z completed in 392437ms completed: SUCCESS
2022-07-17 06:09:04:086 - AuditLog - node-test-commit-arm » centos7-arm64-gcc6 #42781 Started by upstream project "node-test-commit-arm" build number 42,781, Parameters:[NODEJS_VERSION: {12.22.13}, NODEJS_MAJOR_VERSION: {12}] on node test-equinix-centos7_container-arm64-2 started at 2022-07-17T10:02:14Z completed in 400434ms completed: SUCCESS
2022-07-18 06:11:07:881 - AuditLog - node-test-commit-arm » centos7-arm64-gcc6 #42808 Started by upstream project "node-test-commit-arm" build number 42,808, Parameters:[NODEJS_VERSION: {12.22.13}, NODEJS_MAJOR_VERSION: {12}] on node test-equinix-centos7_container-arm64-2 started at 2022-07-18T10:02:16Z completed in 523189ms completed: SUCCESS
NOTES:
The above is from using the output of using egrep - "test-equinix-centos7_container-arm64-2|test-equinix-ubuntu2004_sharedlibs_container-arm64-2|test-equinix-ubuntu1804_sharedlibs_container-arm64-2|test-equinix-ubuntu2004_sharedlibs_container-arm64-1|test-equinix-ubuntu1804_container-arm64-1|test-equinix-centos8_container-arm64-1|test-equinix-rhel8_container-arm64-1|test-equinix-ubuntu2004_container-armv7l-1|test-equinix-centos7_container-arm64-1|test-equinix-ubuntu2004_sharedlibs_container-arm64-3|test-equinix-ubuntu1804_sharedlibs_container-arm64-1|test-equinix-ubuntu2004_container-arm64-1|test-equinix-debian10_container-armv7l-1|test-equinix-ubuntu1804_sharedlibs_container-arm64-3" against the jenkins log which shows all the stuff about the containers on that host.
In case there are any issues specific to centos7-arm64-gcc6 I'm going to run a few rebuids of https://ci.nodejs.org/job/node-test-commit-arm/42880 which is ONLY building that one.
Have taken the second centos7 container offline and currently repeatedly running the centos7 gcc6 job repeatedly on the "failing" altra. I will also add in the ubuntu2004-armv7l combination in future runs as that is potentially more suspect than the others and bring test-equinix-centos7_container-arm64-2 from the other machine offline for now too.
Running as builds https://ci.nodejs.org/job/node-test-commit-arm 42988 up to 43000 which is running:
- https://ci.nodejs.org/job/node-test-commit-arm/nodes=centos7-arm64-gcc6 42988 up to 43000
And builds https://ci.nodejs.org/job/node-test-commit-arm 43001 up to 43010 which is running:
- https://ci.nodejs.org/job/node-test-commit-arm/nodes=ubuntu2004-armv7l 43001 up to 43010
It seems the issue is happening again https://github.com/nodejs/build/issues/3022, it has been blocking the CI for a while
I've brought https://ci.nodejs.org/computer/test-equinix-ubuntu2004_container-armv7l-2/ back online to clear the backlog.
test-equinix-ubuntu2004-arm64-1 - 145.40.81.219 - had gone offline for the first time in a while so we'll need to re-evaluate what's going on here. That's the first outage we've had in a few weeks on that server. It's now back and so there are two executors for the
ubuntu2004-armv7l jobs available again.
Had to log into the oob console for test-equinix-ubuntu2004-arm64-1 today to exit the UEFI prompt.
Had to recover test-equinix-ubuntu2004-arm64-1 today in the usual way.
test-equinix-ubuntu2004-arm64-1 had rebooted/was stuck again today 😞. I've recovered it.
Have taken the second centos7 container offline
@sxa FYI I've brought back the second container to help process the job queue.
test-equinix-ubuntu2004-arm64-1 was stuck again and has now been recovered.