build icon indicating copy to clipboard operation
build copied to clipboard

node-test-commit-arm is broken

Open tniessen opened this issue 3 years ago • 18 comments

It seems that node-test-commit-arm jobs never completes. This affects all PR CIs:

build time trend showing recent builds taking days

Unfortunately, the "Auto start CI" GitHub action is unaware of the issue and just silently removes the request-ci label while queuing even more jobs.

tniessen avatar Oct 02 '22 10:10 tniessen

I have brought one of the arm machines back online and its seems to be processing jobs.

mhdawson avatar Oct 02 '22 11:10 mhdawson

FWIW it looks like the bottleneck was ubuntu2004-armv7l. In addition to the machine that was offline, the container on the other machine appears to have gone down on 15 Sept:

root@test-equinix-ubuntu2004-docker-arm64-3:~# systemctl status jenkins-test-equinix-ubuntu2004_container-armv7l-2.service
● jenkins-test-equinix-ubuntu2004_container-armv7l-2.service - Jenkins Slave in Docker for test-equinix-ubuntu2004_container-armv7l-2
     Loaded: loaded (/lib/systemd/system/jenkins-test-equinix-ubuntu2004_container-armv7l-2.service; enabled; vendor preset: enabled)
     Active: active (running) since Tue 2022-07-26 12:27:04 UTC; 2 months 7 days ago
   Main PID: 3048135 (docker)
      Tasks: 113 (limit: 616176)
     Memory: 38.3M
     CGroup: /system.slice/jenkins-test-equinix-ubuntu2004_container-armv7l-2.service
             └─3048135 /usr/bin/docker run --init --rm -v /home/iojs/test-equinix-ubuntu2004_container-armv7l-2/:/home/iojs -v /home/iojs/.ccache/:/home/io>

Sep 15 00:46:45 test-equinix-ubuntu2004-docker-arm64-3 docker[3048135]: Sep 15, 2022 12:46:45 AM hudson.slaves.ChannelPinger$1 onDead
Sep 15 00:46:45 test-equinix-ubuntu2004-docker-arm64-3 docker[3048135]: INFO: Ping failed. Terminating the channel JNLP4-connect connection to ci.nodejs.or>
Sep 15 00:46:45 test-equinix-ubuntu2004-docker-arm64-3 docker[3048135]: java.util.concurrent.TimeoutException: Ping started at 1663199205593 hasn't complet>
Sep 15 00:46:45 test-equinix-ubuntu2004-docker-arm64-3 docker[3048135]:         at hudson.remoting.PingThread.ping(PingThread.java:134)
Sep 15 00:46:45 test-equinix-ubuntu2004-docker-arm64-3 docker[3048135]:         at hudson.remoting.PingThread.run(PingThread.java:90)
Sep 15 01:46:45 test-equinix-ubuntu2004-docker-arm64-3 docker[3048135]: Sep 15, 2022 1:46:45 AM hudson.slaves.ChannelPinger$1 onDead
Sep 15 01:46:45 test-equinix-ubuntu2004-docker-arm64-3 docker[3048135]: INFO: Ping failed. Terminating the channel JNLP4-connect connection to ci.nodejs.or>
Sep 15 01:46:45 test-equinix-ubuntu2004-docker-arm64-3 docker[3048135]: java.util.concurrent.TimeoutException: Ping started at 1663202805597 hasn't complet>
Sep 15 01:46:45 test-equinix-ubuntu2004-docker-arm64-3 docker[3048135]:         at hudson.remoting.PingThread.ping(PingThread.java:134)
Sep 15 01:46:45 test-equinix-ubuntu2004-docker-arm64-3 docker[3048135]:         at hudson.remoting.PingThread.run(PingThread.java:90)

I've restarted the container, so we now have the two back online.

richardlau avatar Oct 02 '22 12:10 richardlau

I have brought one of the arm machines back online and its seems to be processing jobs.

I'm guessing this was another occurrence of https://github.com/nodejs/build/issues/2894?

richardlau avatar Oct 02 '22 12:10 richardlau

Thank you @mhdawson and @richardlau!

tniessen avatar Oct 02 '22 13:10 tniessen

the container on the other machine appears to have gone down on 15 Sept:

Hmmm that's interesting if it was just the container that failed.

sxa avatar Oct 03 '22 11:10 sxa

Looks like all ubuntu2004-armv7l containers are offline: image

richardlau avatar Oct 04 '22 08:10 richardlau

root@test-equinix-ubuntu2004-docker-arm64-3:~# systemctl status jenkins-test-equinix-ubuntu2004_container-armv7l-2.service
● jenkins-test-equinix-ubuntu2004_container-armv7l-2.service - Jenkins Slave in Docker for test-equinix-ubuntu2004_container-armv7l-2
     Loaded: loaded (/lib/systemd/system/jenkins-test-equinix-ubuntu2004_container-armv7l-2.service; enabled; vendor preset: enabled)
     Active: active (running) since Sun 2022-10-02 11:55:32 UTC; 1 day 20h ago
   Main PID: 2538938 (docker)
      Tasks: 104 (limit: 616176)
     Memory: 28.9M
     CGroup: /system.slice/jenkins-test-equinix-ubuntu2004_container-armv7l-2.service
             └─2538938 /usr/bin/docker run --init --rm -v /home/iojs/test-equinix-ubuntu2004_container-armv7l-2/:/home/iojs -v /home/iojs/.ccache/:/home/io>

Oct 03 09:40:34 test-equinix-ubuntu2004-docker-arm64-3 docker[2538938]: Oct 03, 2022 9:40:34 AM hudson.slaves.ChannelPinger$1 onDead
Oct 03 09:40:34 test-equinix-ubuntu2004-docker-arm64-3 docker[2538938]: INFO: Ping failed. Terminating the channel JNLP4-connect connection to ci.nodejs.org/107.170.240.62:41913.
Oct 03 09:40:34 test-equinix-ubuntu2004-docker-arm64-3 docker[2538938]: java.util.concurrent.TimeoutException: Ping started at 1664786434284 hasn't completed by 1664790034284
Oct 03 09:40:34 test-equinix-ubuntu2004-docker-arm64-3 docker[2538938]:         at hudson.remoting.PingThread.ping(PingThread.java:132)
Oct 03 09:40:34 test-equinix-ubuntu2004-docker-arm64-3 docker[2538938]:         at hudson.remoting.PingThread.run(PingThread.java:88)
Oct 03 10:40:34 test-equinix-ubuntu2004-docker-arm64-3 docker[2538938]: Oct 03, 2022 10:40:34 AM hudson.slaves.ChannelPinger$1 onDead
Oct 03 10:40:34 test-equinix-ubuntu2004-docker-arm64-3 docker[2538938]: INFO: Ping failed. Terminating the channel JNLP4-connect connection to ci.nodejs.org/107.170.240.62:41913.
Oct 03 10:40:34 test-equinix-ubuntu2004-docker-arm64-3 docker[2538938]: java.util.concurrent.TimeoutException: Ping started at 1664790034286 hasn't completed by 1664793634287
Oct 03 10:40:34 test-equinix-ubuntu2004-docker-arm64-3 docker[2538938]:         at hudson.remoting.PingThread.ping(PingThread.java:132)
Oct 03 10:40:34 test-equinix-ubuntu2004-docker-arm64-3 docker[2538938]:         at hudson.remoting.PingThread.run(PingThread.java:88)

richardlau avatar Oct 04 '22 08:10 richardlau

root@test-equinix-ubuntu2004-docker-arm64-3:~# journalctl -u jenkins-test-equinix-ubuntu2004_container-armv7l-2.service
Oct 03 07:40:10 test-equinix-ubuntu2004-docker-arm64-3 docker[2538938]: WARNING: [JNLP4-connect connection to ci.nodejs.org/107.170.240.62:41913]
Oct 03 07:40:10 test-equinix-ubuntu2004-docker-arm64-3 docker[2538938]: java.lang.NullPointerException
Oct 03 07:40:10 test-equinix-ubuntu2004-docker-arm64-3 docker[2538938]:         at org.jenkinsci.remoting.util.DirectByteBufferPool.acquire(DirectByteBuffe>
Oct 03 07:40:10 test-equinix-ubuntu2004-docker-arm64-3 docker[2538938]:         at org.jenkinsci.remoting.protocol.IOHub.acquire(IOHub.java:164)
Oct 03 07:40:10 test-equinix-ubuntu2004-docker-arm64-3 docker[2538938]:         at org.jenkinsci.remoting.protocol.ProtocolStack.acquire(ProtocolStack.java>
Oct 03 07:40:10 test-equinix-ubuntu2004-docker-arm64-3 docker[2538938]:         at org.jenkinsci.remoting.protocol.impl.SSLEngineFilterLayer.processRead(SS>
Oct 03 07:40:10 test-equinix-ubuntu2004-docker-arm64-3 docker[2538938]:         at org.jenkinsci.remoting.protocol.impl.SSLEngineFilterLayer.onRecv(SSLEngi>
Oct 03 07:40:10 test-equinix-ubuntu2004-docker-arm64-3 docker[2538938]:         at org.jenkinsci.remoting.protocol.ProtocolStack$Ptr.onRecv(ProtocolStack.j>
Oct 03 07:40:10 test-equinix-ubuntu2004-docker-arm64-3 docker[2538938]:         at org.jenkinsci.remoting.protocol.NetworkLayer.onRead(NetworkLayer.java:13>
Oct 03 07:40:10 test-equinix-ubuntu2004-docker-arm64-3 docker[2538938]:         at org.jenkinsci.remoting.protocol.impl.BIONetworkLayer.access$2200(BIONetw>
Oct 03 07:40:10 test-equinix-ubuntu2004-docker-arm64-3 docker[2538938]:         at org.jenkinsci.remoting.protocol.impl.BIONetworkLayer$Reader.run(BIONetwo>
Oct 03 07:40:10 test-equinix-ubuntu2004-docker-arm64-3 docker[2538938]:         at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.jav>
Oct 03 07:40:10 test-equinix-ubuntu2004-docker-arm64-3 docker[2538938]:         at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.ja>
Oct 03 07:40:10 test-equinix-ubuntu2004-docker-arm64-3 docker[2538938]:         at hudson.remoting.Engine$1.lambda$newThread$0(Engine.java:126)
Oct 03 07:40:10 test-equinix-ubuntu2004-docker-arm64-3 docker[2538938]:         at java.lang.Thread.run(Thread.java:748)
Oct 03 07:40:10 test-equinix-ubuntu2004-docker-arm64-3 docker[2538938]: Oct 03, 2022 7:40:10 AM org.jenkinsci.remoting.protocol.impl.BIONetworkLayer$Reader>
Oct 03 07:40:10 test-equinix-ubuntu2004-docker-arm64-3 docker[2538938]: SEVERE: [JNLP4-connect connection to ci.nodejs.org/107.170.240.62:41913] Reader thr>
Oct 03 07:40:10 test-equinix-ubuntu2004-docker-arm64-3 docker[2538938]: java.lang.NullPointerException
Oct 03 07:40:10 test-equinix-ubuntu2004-docker-arm64-3 docker[2538938]:         at org.jenkinsci.remoting.util.DirectByteBufferPool.acquire(DirectByteBuffe>
Oct 03 07:40:10 test-equinix-ubuntu2004-docker-arm64-3 docker[2538938]:         at org.jenkinsci.remoting.protocol.IOHub.acquire(IOHub.java:164)
Oct 03 07:40:10 test-equinix-ubuntu2004-docker-arm64-3 docker[2538938]:         at org.jenkinsci.remoting.protocol.ProtocolStack.acquire(ProtocolStack.java>
Oct 03 07:40:10 test-equinix-ubuntu2004-docker-arm64-3 docker[2538938]:         at org.jenkinsci.remoting.protocol.impl.SSLEngineFilterLayer.processRead(SS>
Oct 03 07:40:10 test-equinix-ubuntu2004-docker-arm64-3 docker[2538938]:         at org.jenkinsci.remoting.protocol.impl.SSLEngineFilterLayer.onRecv(SSLEngi>
Oct 03 07:40:10 test-equinix-ubuntu2004-docker-arm64-3 docker[2538938]:         at org.jenkinsci.remoting.protocol.ProtocolStack$Ptr.onRecv(ProtocolStack.j>
Oct 03 07:40:10 test-equinix-ubuntu2004-docker-arm64-3 docker[2538938]:         at org.jenkinsci.remoting.protocol.NetworkLayer.onRead(NetworkLayer.java:13>
Oct 03 07:40:10 test-equinix-ubuntu2004-docker-arm64-3 docker[2538938]:         at org.jenkinsci.remoting.protocol.impl.BIONetworkLayer.access$2200(BIONetw>
Oct 03 07:40:10 test-equinix-ubuntu2004-docker-arm64-3 docker[2538938]:         at org.jenkinsci.remoting.protocol.impl.BIONetworkLayer$Reader.run(BIONetwo>
Oct 03 07:40:10 test-equinix-ubuntu2004-docker-arm64-3 docker[2538938]:         at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.jav>
Oct 03 07:40:10 test-equinix-ubuntu2004-docker-arm64-3 docker[2538938]:         at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.ja>
Oct 03 07:40:10 test-equinix-ubuntu2004-docker-arm64-3 docker[2538938]:         at hudson.remoting.Engine$1.lambda$newThread$0(Engine.java:126)
Oct 03 07:40:10 test-equinix-ubuntu2004-docker-arm64-3 docker[2538938]:         at java.lang.Thread.run(Thread.java:748)
Oct 03 08:40:34 test-equinix-ubuntu2004-docker-arm64-3 docker[2538938]: Oct 03, 2022 8:40:34 AM hudson.slaves.ChannelPinger$1 onDead
Oct 03 08:40:34 test-equinix-ubuntu2004-docker-arm64-3 docker[2538938]: INFO: Ping failed. Terminating the channel JNLP4-connect connection to ci.nodejs.or>
Oct 03 08:40:34 test-equinix-ubuntu2004-docker-arm64-3 docker[2538938]: java.util.concurrent.TimeoutException: Ping started at 1664782834280 hasn't complet>
Oct 03 08:40:34 test-equinix-ubuntu2004-docker-arm64-3 docker[2538938]:         at hudson.remoting.PingThread.ping(PingThread.java:132)
Oct 03 08:40:34 test-equinix-ubuntu2004-docker-arm64-3 docker[2538938]:         at hudson.remoting.PingThread.run(PingThread.java:88)
Oct 03 09:40:34 test-equinix-ubuntu2004-docker-arm64-3 docker[2538938]: Oct 03, 2022 9:40:34 AM hudson.slaves.ChannelPinger$1 onDead
Oct 03 09:40:34 test-equinix-ubuntu2004-docker-arm64-3 docker[2538938]: INFO: Ping failed. Terminating the channel JNLP4-connect connection to ci.nodejs.or>
Oct 03 09:40:34 test-equinix-ubuntu2004-docker-arm64-3 docker[2538938]: java.util.concurrent.TimeoutException: Ping started at 1664786434284 hasn't complet>
Oct 03 09:40:34 test-equinix-ubuntu2004-docker-arm64-3 docker[2538938]:         at hudson.remoting.PingThread.ping(PingThread.java:132)
Oct 03 09:40:34 test-equinix-ubuntu2004-docker-arm64-3 docker[2538938]:         at hudson.remoting.PingThread.run(PingThread.java:88)
Oct 03 10:40:34 test-equinix-ubuntu2004-docker-arm64-3 docker[2538938]: Oct 03, 2022 10:40:34 AM hudson.slaves.ChannelPinger$1 onDead
Oct 03 10:40:34 test-equinix-ubuntu2004-docker-arm64-3 docker[2538938]: INFO: Ping failed. Terminating the channel JNLP4-connect connection to ci.nodejs.or>
Oct 03 10:40:34 test-equinix-ubuntu2004-docker-arm64-3 docker[2538938]: java.util.concurrent.TimeoutException: Ping started at 1664790034286 hasn't complet>
Oct 03 10:40:34 test-equinix-ubuntu2004-docker-arm64-3 docker[2538938]:         at hudson.remoting.PingThread.ping(PingThread.java:132)
Oct 03 10:40:34 test-equinix-ubuntu2004-docker-arm64-3 docker[2538938]:         at hudson.remoting.PingThread.run(PingThread.java:88)
lines 7614

richardlau avatar Oct 04 '22 08:10 richardlau

I've restarted the ubuntu2004-armv7l container on test-equinix-ubuntu2004-docker-arm64-3.

richardlau avatar Oct 04 '22 09:10 richardlau

Hmmm - so this is ONLY affecting the armv7l container on the host?

sxa avatar Oct 04 '22 09:10 sxa

AFAICT at the moment just the ubuntu2004 armv7l containers. The debian10 armv7l containers were still working.

richardlau avatar Oct 04 '22 09:10 richardlau

It looks like the NullPointerException has happened again and I'm not sure if the agent on that container is now broken again (https://ci.nodejs.org/job/node-test-commit-arm/nodes=ubuntu2004-armv7l/44015/console looks stuck).

I've restarted the container on the other Altra -- will see if the problem reoccurs while watching (virtually) the rest of this morning's NodeConf EU talks. Might trying updating the containers this afternoon to Java 11/17 (currently on 8).

richardlau avatar Oct 04 '22 09:10 richardlau

I'm seeing this immediately after resuming CI:

fatal: Couldn't find remote ref refs/heads/jenkins-node-test-commit-arm-fanned-feaab78d411c319cbbe153db7d04e9b46b2b48eb-binary-pi1p/cc-armv7

(Example build)

tniessen avatar Oct 05 '22 09:10 tniessen

@tniessen jenkins-node-test-commit-arm-fanned-feaab78d411c319cbbe153db7d04e9b46b2b48eb-binary-pi1p/cc-armv7 was deleted by https://ci.nodejs.org/job/git-delete-branches/58878/ because the https://ci.nodejs.org/job/node-test-commit-arm-fanned/27642/console job from https://ci.nodejs.org/job/node-test-pull-request/46975/ succeeded.

richardlau avatar Oct 05 '22 09:10 richardlau

@richardlau I am not sure I understand what I did wrong; this is happening in two PRs and I am pretty sure I only resumed/restarted failing builds. I guess I'll start new builds and see if that works.

tniessen avatar Oct 05 '22 09:10 tniessen

@tniessen Arm builds were backlogged heavily over the weekend due to the machines being offline and then catching up the already queued jobs. https://github.com/nodejs/node/pull/44849 appears to have at least two builds which ended up racing each other:

  • https://ci.nodejs.org/job/node-test-pull-request/46975/
  • https://ci.nodejs.org/job/node-test-pull-request/46992/ image

The arm build from https://ci.nodejs.org/job/node-test-pull-request/46975/ succeeded and caused the branch in the binary repo to be deleted while https://ci.nodejs.org/job/node-test-pull-request/46992/ was still in progress and causing the arm build there to fail. You've been resuming builds from https://ci.nodejs.org/job/node-test-pull-request/46992/ so the resumed builds think they need to resume the failed arm build but the branch no longer exists.

I could easily see this happening in other PRs during the backlog if builds had started for the same commit while earlier builds for that commit were still waiting for the offline arm machines to come back online.

richardlau avatar Oct 05 '22 10:10 richardlau

Ah, thanks for the clarification @richardlau! That makes a lot of sense :)

tniessen avatar Oct 05 '22 10:10 tniessen

I've restarted the container on the other Altra -- will see if the problem reoccurs while watching (virtually) the rest of this morning's NodeConf EU talks. Might trying updating the containers this afternoon to Java 11/17 (currently on 8).

Haven't seen the problem again. FWIW I've updated the containers to run on Java 17 now.

richardlau avatar Oct 05 '22 14:10 richardlau