[main branch] unexplained long hang/pauses in CI runners causing build timeouts
We detected what appears to be a hang in the test suite when running .ci/scripts/test.sh -b "release" -t "apollo-server-express" "8" :
https://apm-ci.elastic.co/blue/organizations/jenkins/apm-agent-nodejs%2Fapm-agent-nodejs-mbp/detail/main/95/pipeline/1406
Relevant log entries:
<snip>
[](https://apm-ci.elastic.co/blue/organizations/jenkins/apm-agent-nodejs%2Fapm-agent-nodejs-mbp/detail/main/95/pipeline/1406#step-4196-log-926)[](https://apm-ci.elastic.co/blue/organizations/jenkins/apm-agent-nodejs%2Fapm-agent-nodejs-mbp/detail/main/95/pipeline/1406#step-4196-log-927)[](https://apm-ci.elastic.co/blue/organizations/jenkins/apm-agent-nodejs%2Fapm-agent-nodejs-mbp/detail/main/95/pipeline/1406#step-4196-log-928)[](https://apm-ci.elastic.co/blue/organizations/jenkins/apm-agent-nodejs%2Fapm-agent-nodejs-mbp/detail/main/95/pipeline/1406#step-4196-log-929)[](https://apm-ci.elastic.co/blue/organizations/jenkins/apm-agent-nodejs%2Fapm-agent-nodejs-mbp/detail/main/95/pipeline/1406#step-4196-log-930)[](https://apm-ci.elastic.co/blue/organizations/jenkins/apm-agent-nodejs%2Fapm-agent-nodejs-mbp/detail/main/95/pipeline/1406#step-4196-log-931)[](https://apm-ci.elastic.co/blue/organizations/jenkins/apm-agent-nodejs%2Fapm-agent-nodejs-mbp/detail/main/95/pipeline/1406#step-4196-log-932)[](https://apm-ci.elastic.co/blue/organizations/jenkins/apm-agent-nodejs%2Fapm-agent-nodejs-mbp/detail/main/95/pipeline/1406#step-4196-log-933)[](https://apm-ci.elastic.co/blue/organizations/jenkins/apm-agent-nodejs%2Fapm-agent-nodejs-mbp/detail/main/95/pipeline/1406#step-4196-log-934)[](https://apm-ci.elastic.co/blue/organizations/jenkins/apm-agent-nodejs%2Fapm-agent-nodejs-mbp/detail/main/95/pipeline/1406#step-4196-log-935)[](https://apm-ci.elastic.co/blue/organizations/jenkins/apm-agent-nodejs%2Fapm-agent-nodejs-mbp/detail/main/95/pipeline/1406#step-4196-log-936)[](https://apm-ci.elastic.co/blue/organizations/jenkins/apm-agent-nodejs%2Fapm-agent-nodejs-mbp/detail/main/95/pipeline/1406#step-4196-log-937)[](https://apm-ci.elastic.co/blue/organizations/jenkins/apm-agent-nodejs%2Fapm-agent-nodejs-mbp/detail/main/95/pipeline/1406#step-4196-log-938)[](https://apm-ci.elastic.co/blue/organizations/jenkins/apm-agent-nodejs%2Fapm-agent-nodejs-mbp/detail/main/95/pipeline/1406#step-4196-log-939)[](https://apm-ci.elastic.co/blue/organizations/jenkins/apm-agent-nodejs%2Fapm-agent-nodejs-mbp/detail/main/95/pipeline/1406#step-4196-log-940)[](https://apm-ci.elastic.co/blue/organizations/jenkins/apm-agent-nodejs%2Fapm-agent-nodejs-mbp/detail/main/95/pipeline/1406#step-4196-log-941)[](https://apm-ci.elastic.co/blue/organizations/jenkins/apm-agent-nodejs%2Fapm-agent-nodejs-mbp/detail/main/95/pipeline/1406#step-4196-log-942)[](https://apm-ci.elastic.co/blue/organizations/jenkins/apm-agent-nodejs%2Fapm-agent-nodejs-mbp/detail/main/95/pipeline/1406#step-4196-log-943)[](https://apm-ci.elastic.co/blue/organizations/jenkins/apm-agent-nodejs%2Fapm-agent-nodejs-mbp/detail/main/95/pipeline/1406#step-4196-log-944)[](https://apm-ci.elastic.co/blue/organizations/jenkins/apm-agent-nodejs%2Fapm-agent-nodejs-mbp/detail/main/95/pipeline/1406#step-4196-log-945)[](https://apm-ci.elastic.co/blue/organizations/jenkins/apm-agent-nodejs%2Fapm-agent-nodejs-mbp/detail/main/95/pipeline/1406#step-4196-log-946)[](https://apm-ci.elastic.co/blue/organizations/jenkins/apm-agent-nodejs%2Fapm-agent-nodejs-mbp/detail/main/95/pipeline/1406#step-4196-log-947)[](https://apm-ci.elastic.co/blue/organizations/jenkins/apm-agent-nodejs%2Fapm-agent-nodejs-mbp/detail/main/95/pipeline/1406#step-4196-log-948)[](https://apm-ci.elastic.co/blue/organizations/jenkins/apm-agent-nodejs%2Fapm-agent-nodejs-mbp/detail/main/95/pipeline/1406#step-4196-log-949)[](https://apm-ci.elastic.co/blue/organizations/jenkins/apm-agent-nodejs%2Fapm-agent-nodejs-mbp/detail/main/95/pipeline/1406#step-4196-log-950)[](https://apm-ci.elastic.co/blue/organizations/jenkins/apm-agent-nodejs%2Fapm-agent-nodejs-mbp/detail/main/95/pipeline/1406#step-4196-log-951)[](https://apm-ci.elastic.co/blue/organizations/jenkins/apm-agent-nodejs%2Fapm-agent-nodejs-mbp/detail/main/95/pipeline/1406#step-4196-log-952)[](https://apm-ci.elastic.co/blue/organizations/jenkins/apm-agent-nodejs%2Fapm-agent-nodejs-mbp/detail/main/95/pipeline/1406#step-4196-log-953)[](https://apm-ci.elastic.co/blue/organizations/jenkins/apm-agent-nodejs%2Fapm-agent-nodejs-mbp/detail/main/95/pipeline/1406#step-4196-log-954)[](https://apm-ci.elastic.co/blue/organizations/jenkins/apm-agent-nodejs%2Fapm-agent-nodejs-mbp/detail/main/95/pipeline/1406#step-4196-log-955)[](https://apm-ci.elastic.co/blue/organizations/jenkins/apm-agent-nodejs%2Fapm-agent-nodejs-mbp/detail/main/95/pipeline/1406#step-4196-log-956)[](https://apm-ci.elastic.co/blue/organizations/jenkins/apm-agent-nodejs%2Fapm-agent-nodejs-mbp/detail/main/95/pipeline/1406#step-4196-log-957)[2022-03-28T07:11:49.510Z] node_tests_1 | -- required packages ["graphql@^0.12.0","[email protected]"]
[2022-03-28T07:11:49.510Z] node_tests_1 | -- installing ["graphql@^0.12.0","[email protected]"]
[2022-03-28T07:11:59.500Z] node_tests_1 | -- running test "node test/instrumentation/modules/apollo-server-express.test.js" with apollo-server-express
[2022-03-28T07:12:01.408Z] node_tests_1 | -- running preinstall "npm uninstall express-graphql" for apollo-server-express
[2022-03-28T07:12:11.426Z] node_tests_1 | npm WARN nosup Unsupported engine for [email protected]: wanted: {"node":">=12.0"} (current: {"node":"8.17.0","npm":"6.13.4"})
[2022-03-28T07:12:11.427Z] node_tests_1 | npm WARN notsup Not compatible with your version of node/npm: [email protected]
[2022-03-28T07:12:11.427Z] node_tests_1 | npm WARN notsup Unsupported engine for [email protected]: wanted: {"node":">= 10.x"} (current: {"node":"8.17.0","npm":"6.13.4"})
[2022-03-28T07:12:11.427Z] node_tests_1 | npm WARN notsup Not compatible with your version of node/npm: [email protected]
[2022-03-28T07:12:11.427Z] node_tests_1 | npm WARN optional SKIPPING OPTIONAL DEPENDENCY: fsevents@~2.3.2 (node_modules/chokidar/node_modules/fsevents):
[2022-03-28T07:12:11.427Z] node_tests_1 | npm WARN notsup SKIPPING OPTIONAL DEPENDENCY: Unsupported platform for [email protected]: wanted {"os":"darwin","arch":"any"} (current: {"os":"linux","arch":"x64"})
[2022-03-28T07:12:11.427Z] node_tests_1 |
[2022-03-28T07:12:11.427Z] node_tests_1 | -- required packages ["graphql@^0.12.0","[email protected]"]
[2022-03-28T07:12:11.427Z] node_tests_1 | -- installing ["graphql@^0.12.0","[email protected]"]
[2022-03-28T07:12:23.643Z] node_tests_1 | -- running test "node test/instrumentation/modules/apollo-server-express.test.js" with apollo-server-express
[2022-03-28T07:12:24.213Z] node_tests_1 | -- running preinstall "npm uninstall express-graphql" for apollo-server-express
[2022-03-28T07:12:34.199Z] node_tests_1 | npm WARN notsup Unsupported engine for [email protected]: wanted: {"node":">=12.0"} (current: {"node":"8.17.0","npm":"6.13.4"})
[2022-03-28T07:12:34.199Z] node_tests_1 | npm WARN notsup Not compatible with your version of node/npm: [email protected]
[2022-03-28T07:12:34.199Z] node_tests_1 | npm WARN notsup Unsupported engine for [email protected]: wanted: {"node":">= 10.x"} (current: {"node":"8.17.0","npm":"6.13.4"})
[2022-03-28T07:12:34.199Z] node_tests_1 | npm WARN notsup Not compatible with your version of node/npm: [email protected]
[2022-03-28T07:12:34.199Z] node_tests_1 | npm WARN optional SKIPPING OPTIONAL DEPENDENCY: fsevents@~2.3.2 (node_modules/chokidar/node_modules/fsevents):
[2022-03-28T07:12:34.199Z] node_tests_1 | npm WARN notsup SKIPPING OPTIONAL DEPENDENCY: Unsupported platform for [email protected]: wanted {"os":"darwin","arch":"any"} (current: {"os":"linux","arch":"x64"})
[2022-03-28T07:12:34.199Z] node_tests_1 |
[2022-03-28T07:12:34.199Z] node_tests_1 | -- required packages ["graphql@^0.12.0","[email protected]"]
[2022-03-28T07:12:34.199Z] node_tests_1 | -- installing ["graphql@^0.12.0","[email protected]"]
[2022-03-28T07:12:46.409Z] node_tests_1 | -- running test "node test/instrumentation/modules/apollo-server-express.test.js" with apollo-server-express
[2022-03-28T07:12:47.798Z] node_tests_1 | -- ok
[2022-03-28T08:48:52.336Z] Sending interrupt signal to process
[2022-03-28T08:48:58.013Z] Stopping docker_node_tests_1 ...
[2022-03-28T08:48:58.014Z] Killing docker_node_tests_1 ...
[2022-03-28T08:48:58.014Z] Killing docker_node_tests_1 ... done
[2022-03-28T08:48:58.014Z] Gracefully stopping... (press Ctrl+C again to force)
[2022-03-28T08:48:58.331Z] Sending interrupt signal to process
[2022-03-28T08:48:58.353Z] script returned exit code 2
tl;dr: I think this is better explained as "unexpected hangs/pauses in CI runner VMs".
The 1h36m hang in the above log looks like a test hang on its own, however looking at some surrounding tests makes it feel like something else is going on.
- I cannot repro the hang when running locally:
[14:42:28 trentm@pink:~/el/apm-agent-nodejs (git:main)]
% time .ci/scripts/test.sh -b "release" -t "apollo-server-express" "8"
...
node_tests_1 |
node_tests_1 | -- required packages ["graphql@^0.12.0","[email protected]"]
node_tests_1 | -- installing ["graphql@^0.12.0","[email protected]"]
node_tests_1 | -- running test "node test/instrumentation/modules/apollo-server-express.test.js" with apollo-server-express
node_tests_1 | -- ok
docker_node_tests_1 exited with code 0
Aborting on container exit...
.ci/scripts/test.sh:268: NODE_VERSION=8
.ci/scripts/test.sh:268: docker-compose --no-ansi --log-level ERROR -f .ci/docker/docker-compose-node-test.yml down -v --remove-orphans
--no-ansi option is deprecated and will be removed in future versions. Use `--ansi never` instead.
Removing docker_node_tests_1 ...
Removing docker_node_tests_1 ... done
.ci/scripts/test.sh -b "release" -t "apollo-server-express" "8" 2.94s user 0.68s system 0% cpu 36:52.35 total
-
I cannot repro the hang in the Node v8 + apollo-server-express step in CI: https://apm-ci.elastic.co/blue/organizations/jenkins/apm-agent-nodejs%2Fapm-agent-nodejs-mbp/detail/main/96/pipeline/526 completed. However there is a different step in this build 96 that is hanging: the
.ci/scripts/test.sh -b "release" -t "bluebird" "12"step: https://apm-ci.elastic.co/blue/organizations/jenkins/apm-agent-nodejs%2Fapm-agent-nodejs-mbp/detail/main/96/pipeline/458 -
Our "TAV Test" stage is the longest. It typically and reliably has been taking slightly over 1h to complete. However, starting some time after Mar 23rd, that time started getting unexpectedly longer. A screenshot of the current https://apm-ci.elastic.co/job/apm-agent-nodejs/job/apm-agent-nodejs-mbp/job/main/

The preceding build 94 actually passed (because the full build run completed in 2 hr 53 min, slightly under the 3h timeout set on the build), but took an abnormally long time. Using the dev util script to list the slowest Jenkins build steps it was the jade TAV test with Node v12 that took an abnormally long time (6104897ms = ~101 minutes):
% ./dev-utils/jenkins-build-slow-steps.sh https://apm-ci.elastic.co/job/apm-agent-nodejs/job/apm-agent-nodejs-mbp/job/main/94/ | tail
1311604 FINISHED SUCCESS .ci/scripts/test.sh -b "release" -t "knex" "14"
1323594 FINISHED SUCCESS .ci/scripts/test.sh -b "release" -t "pg" "12"
1327722 FINISHED SUCCESS .ci/scripts/test.sh -b "release" -t "apollo-server-express" "8"
1330001 FINISHED SUCCESS .ci/scripts/test.sh -b "release" -t "knex" "10"
1441287 FINISHED SUCCESS .ci/scripts/test.sh -b "release" -t "graphql" "14"
1465226 FINISHED SUCCESS .ci/scripts/test.sh -b "release" -t "graphql" "12"
1519466 FINISHED SUCCESS .ci/scripts/test.sh -b "release" -t "knex" "12"
1836015 FINISHED SUCCESS .ci/scripts/test.sh -b "release" -t "apollo-server-express" "12"
1901324 FINISHED SUCCESS .ci/scripts/test.sh -b "release" -t "apollo-server-express" "14"
6104897 FINISHED SUCCESS .ci/scripts/test.sh -b "release" -t "jade" "12"
From the log for that build step, there was an unexplained 1h40m pause during its npm install:
...
[2022-03-25T06:39:19.419Z] node_tests_1 | npm WARN deprecated [email protected]: https://github.com/lydell/resolve-url#deprecated
[2022-03-25T06:39:19.419Z] node_tests_1 | npm WARN deprecated [email protected]: See https://github.com/lydell/source-map-url#deprecated
[2022-03-25T06:39:19.419Z] node_tests_1 | npm WARN deprecated [email protected]: Please see https://github.com/lydell/urix#deprecated
[2022-03-25T08:20:15.644Z] node_tests_1 |
[2022-03-25T08:20:15.645Z] node_tests_1 | > [email protected] install /app/node_modules/dtrace-provider
[2022-03-25T08:20:15.645Z] node_tests_1 | > node-gyp rebuild || node suppress-error.js
...
@cachedout Any ideas? Do you know if other projects have seen slow/hung/timing-out test runner runs?
That build that I mentioned in point "2." above eventually took ~122 minutes and then did complete. Its log https://apm-ci.elastic.co/blue/organizations/jenkins/apm-agent-nodejs%2Fapm-agent-nodejs-mbp/detail/main/96/pipeline/458 includes:
...
[2022-03-28T20:48:50.584Z] node_tests_1 | npm WARN deprecated [email protected]: https://github.com/lydell/resolve-url#deprecated[](https://apm-ci.elastic.co/blue/organizations/jenkins/apm-agent-nodejs%2Fapm-agent-nodejs-mbp/detail/main/96/pipeline/458#step-3036-log-113)
[2022-03-28T20:48:50.584Z] node_tests_1 | npm WARN deprecated [email protected]: See https://github.com/lydell/source-map-url#deprecated[](https://apm-ci.elastic.co/blue/organizations/jenkins/apm-agent-nodejs%2Fapm-agent-nodejs-mbp/detail/main/96/pipeline/458#step-3036-log-114)
[2022-03-28T20:48:50.584Z] node_tests_1 | npm WARN deprecated [email protected]: Please see https://github.com/lydell/urix#deprecated[](https://apm-ci.elastic.co/blue/organizations/jenkins/apm-agent-nodejs%2Fapm-agent-nodejs-mbp/detail/main/96/pipeline/458#step-3036-log-115)[](https://apm-ci.elastic.co/blue/organizations/jenkins/apm-agent-nodejs%2Fapm-agent-nodejs-mbp/detail/main/96/pipeline/458#step-3036-log-116)[](https://apm-ci.elastic.co/blue/organizations/jenkins/apm-agent-nodejs%2Fapm-agent-nodejs-mbp/detail/main/96/pipeline/458#step-3036-log-117)[](https://apm-ci.elastic.co/blue/organizations/jenkins/apm-agent-nodejs%2Fapm-agent-nodejs-mbp/detail/main/96/pipeline/458#step-3036-log-118)
[2022-03-28T22:29:31.259Z] node_tests_1 |
[2022-03-28T22:29:31.263Z] node_tests_1 | > [email protected] install /app/node_modules/dtrace-provider
[2022-03-28T22:29:31.263Z] node_tests_1 | > node-gyp rebuild || node suppress-error.js
[2022-03-28T22:29:31.263Z] node_tests_1 |
...
which shows a 1h40m delay during the initial npm install.
That seems suspiciously similar to the "there was an unexplained 1h40m pause during its npm install" from point "3." above! The time difference in ms between the log messages before and after the pause in both of these logs are:
> t2 = new Date('2022-03-28T22:29:31.259Z')
2022-03-28T22:29:31.259Z
> t1 = new Date('2022-03-28T20:48:50.584Z')
2022-03-28T20:48:50.584Z
> t2 - t1
6040675
>
>
> t2 = new Date('2022-03-25T08:20:15.644Z')
2022-03-25T08:20:15.644Z
> t1 = new Date('2022-03-25T06:39:19.419Z')
2022-03-25T06:39:19.419Z
> t2 - t1
6056225
Pretty close. They are also close to 100 minutes expressed in ms. That could be an error timeout somewhere... perhaps in the depths of npm?
This doesn't explain the original pause or hang in the original log (for the node 8 + apollo-server-express case).
I'm running https://apm-ci.elastic.co/job/apm-agent-nodejs/job/apm-agent-nodejs-mbp/job/main/97/ to try again to see if we get a similar pause somewhere.
@cachedout Any ideas? Do you know if other projects have seen slow/hung/timing-out test runner runs?
@trentm Nothing immediately comes to mind but I did just return from paternity leave so I'm still getting my head around the current state of affairs.
@v1v Do any changes in the above time frame ring a bell for you?
@v1v Do any changes in the above time frame ring a bell for you?
I wonder if a recent docker upgrade has happened in the CI workers, I recall I read something about a recent upgrade did affect something somewhere (sorry If I'm not precise but I don't recall exactly where I read it)
If those tests uses docker, maybe that's one of the reasons...
Our "TAV Test" stage is the longest. It typically and reliably has been taking slightly over 1h to complete. However, starting some time after Mar 23rd, that time started getting unexpectedly longer.
As far as I see docker has changed recently, see the below output from docker version in one of the CI workers:
Client: Docker Engine - Community
Version: 20.10.14
API version: 1.41
Built: Thu Mar 24 01:47:57 2022
...
Server: Docker Engine - Community
Engine:
Version: 20.10.14
API version: 1.41 (minimum version 1.12)
Built: Thu Mar 24 01:45:46 2022
If those tests uses docker,
They do use docker.
Refs: https://github.com/elastic/infra/issues/35413
For what it is worth this seems to have "gone away". Looking at recent apm-agent-nodejs build durations:

- build 93 had typical times
- build 94 had an abnormally slow "TAV Test", but just passed under the 3h timeout
- build 95 "TAV Test" was also slow and hit the 3h timeout
- build 96 "TAV Test" was also slow
- The "TAV Test" stage has been fine for subsequent tests. However, the "Nightly Test*" and "RC Test*" stages were longer for some reason (possibly a different reason) for build 98.
I'm closing this as "cannot repro". We can re-open if things start failing or looking slow again. I'm not sure how we could improve observability of the build steps from our end.
Interesting, I've just enabled https://github.com/elastic/apm-agent-nodejs/pull/2629 to run a few times with a pre-docker upgrade image.
I'll keep an eye for one day and then provide further details here before closing it.
crossposting -> https://github.com/elastic/apm-agent-nodejs/pull/2629#issuecomment-1085833914
THis is happening again:

For the latest one (build 114) it was the "Test / Node.js-14-async-hooks-false" build step that had an unexplained hang. From the log:
...
[2022-04-13T05:54:51.095Z] node_tests_1 | npm WARN deprecated [email protected]: https://github.com/lydell/resolve-url#deprecated
[2022-04-13T05:54:51.095Z] node_tests_1 | npm WARN deprecated [email protected]: See https://github.com/lydell/source-map-url#deprecated
[2022-04-13T05:54:51.095Z] node_tests_1 | npm WARN deprecated [email protected]: Please see https://github.com/lydell/urix#deprecated
[2022-04-13T07:35:47.210Z] node_tests_1 |
[2022-04-13T07:35:47.211Z] node_tests_1 | > [email protected] install /app/node_modules/dtrace-provider
[2022-04-13T07:35:47.211Z] node_tests_1 | > node-gyp rebuild || node suppress-error.js
...
That delay was again (see earlier comment https://github.com/elastic/apm-agent-nodejs/issues/2624#issuecomment-1081258866) approximately 100 minutes:
> t2 = new Date('2022-04-13T07:35:47.210Z')
2022-04-13T07:35:47.210Z
> t1 = new Date('2022-04-13T05:54:51.095Z')
2022-04-13T05:54:51.095Z
> t2 - t1
6056115
> (t2 - t1) / 1000 / 60
100.93525
and in the same part of npm install. Could it be the binary build step for the 'dtrace-provider' npm package? That consistent 100 minutes suggests some hardcoded timeout.