node
node copied to clipboard
Node 20.3 Crashes all the time when executed inside docker
- Node 20.3.0 crashes on start in non-bullseye docker container
- In bullseye container node-gyp-build (
yarn add bufferutil
) fails withText file busy
- In bullseye container + UV_USE_IO_URING=0 everything works
Another possible ref https://github.com/electron/rebuild/pull/1085
@nodejs/libuv
"Text file busy" means trying to write a shared object or binary that's already in use.
My hunch is that node-gyp has some race condition in reading/writing files that wasn't manifesting (much) when everything still went through the much slower thread pool, whereas io_uring is fast enough to make it much more visible.
Is there a way to disable ioring using a env variable as a temporary workaround when running node-gyp?
Yes, set UV_USE_IO_URING=0
in the environment. Use at your own risk: not a stable thing, will disappear again the future.
test-srv3.hq.lan:~# docker run -it node:20.3.0 bash
root@6f6c66eb5077:/# UV_USE_IO_URING=0 yarn add bufferutil
node[8]: ../src/node_platform.cc:68:std::unique_ptr<long unsigned int> node::WorkerThreadsTaskRunner::DelayedTaskScheduler::Start(): Assertion `(0) == (uv_thread_create(t.get(), start_thread, this))' failed.
1: 0xc8e4a0 node::Abort() [node]
2: 0xc8e51e [node]
3: 0xd0a059 node::WorkerThreadsTaskRunner::WorkerThreadsTaskRunner(int) [node]
4: 0xd0a17c node::NodePlatform::NodePlatform(int, v8::TracingController*, v8::PageAllocator*) [node]
5: 0xc4bbc4 node::V8Platform::Initialize(int) [node]
6: 0xc49408 [node]
7: 0xc497db node::Start(int, char**) [node]
8: 0x7ff94ec5818a [/lib/x86_64-linux-gnu/libc.so.6]
9: 0x7ff94ec58245 __libc_start_main [/lib/x86_64-linux-gnu/libc.so.6]
10: 0xba9ade _start [node]
Aborted
This is actually worse than I thought. Node doesn't run at all with 20.3
test-srv3.hq.lan:~# docker run -it node:20.3.0 node
node[1]: ../src/node_platform.cc:68:std::unique_ptr<long unsigned int> node::WorkerThreadsTaskRunner::DelayedTaskScheduler::Start(): Assertion `(0) == (uv_thread_create(t.get(), start_thread, this))' failed.
1: 0xc8e4a0 node::Abort() [node]
2: 0xc8e51e [node]
3: 0xd0a059 node::WorkerThreadsTaskRunner::WorkerThreadsTaskRunner(int) [node]
4: 0xd0a17c node::NodePlatform::NodePlatform(int, v8::TracingController*, v8::PageAllocator*) [node]
5: 0xc4bbc4 node::V8Platform::Initialize(int) [node]
6: 0xc49408 [node]
7: 0xc497db node::Start(int, char**) [node]
8: 0x7f5c393be18a [/lib/x86_64-linux-gnu/libc.so.6]
9: 0x7f5c393be245 __libc_start_main [/lib/x86_64-linux-gnu/libc.so.6]
10: 0xba9ade _start [node]
test-srv3.hq.lan:~# docker run -it node:20.2.0 node
Welcome to Node.js v20.2.0.
Type ".help" for more information.
>
I can reproduce this, and this is quite critical.
cc @nodejs/tsc for visibility
FWIW on two systems I have access to (a Red Hat owned RHEL 8 machine and test-digitalocean-ubuntu1804-docker-x64-1 from the Build infra) docker run -it node:20.3.0 node
is fine:
root@test-digitalocean-ubuntu1804-docker-x64-1:~# docker run -it node:20.3.0 node
Unable to find image 'node:20.3.0' locally
20.3.0: Pulling from library/node
bba7bb10d5ba: Pull complete
ec2b820b8e87: Pull complete
284f2345db05: Pull complete
fea23129f080: Pull complete
9063cd8e3106: Pull complete
4b4424ee38d8: Pull complete
0b4eb4cbb822: Pull complete
43443b026dcf: Pull complete
Digest: sha256:fc738db1cbb81214be1719436605e9d7d84746e5eaf0629762aeba114aa0c28d
Status: Downloaded newer image for node:20.3.0
Welcome to Node.js v20.3.0.
Type ".help" for more information.
>
I can reproduce the assertion failure on an Ubuntu 16.04 host with node:20.3.0
but not with node:20.3.0-bullseye
:
root@infra-digitalocean-ubuntu1604-x64-1:~# docker run -it node:20.3.0 node
Unable to find image 'node:20.3.0' locally
20.3.0: Pulling from library/node
bba7bb10d5ba: Pull complete
ec2b820b8e87: Pull complete
284f2345db05: Pull complete
fea23129f080: Pull complete
9063cd8e3106: Pull complete
4b4424ee38d8: Pull complete
0b4eb4cbb822: Pull complete
43443b026dcf: Pull complete
Digest: sha256:fc738db1cbb81214be1719436605e9d7d84746e5eaf0629762aeba114aa0c28d
Status: Downloaded newer image for node:20.3.0
node[1]: ../src/node_platform.cc:68:std::unique_ptr<long unsigned int> node::WorkerThreadsTaskRunner::DelayedTaskScheduler::Start(): Assertion `(0) == (uv_thread_create(t.get(), start_thread, this))' failed.
1: 0xc8e4a0 node::Abort() [node]
2: 0xc8e51e [node]
3: 0xd0a059 node::WorkerThreadsTaskRunner::WorkerThreadsTaskRunner(int) [node]
4: 0xd0a17c node::NodePlatform::NodePlatform(int, v8::TracingController*, v8::PageAllocator*) [node]
5: 0xc4bbc4 node::V8Platform::Initialize(int) [node]
6: 0xc49408 [node]
7: 0xc497db node::Start(int, char**) [node]
8: 0x7f6e8486218a [/lib/x86_64-linux-gnu/libc.so.6]
9: 0x7f6e84862245 __libc_start_main [/lib/x86_64-linux-gnu/libc.so.6]
10: 0xba9ade _start [node]
root@infra-digitalocean-ubuntu1604-x64-1:~# docker run -it node:20.3.0-bullseye node
Unable to find image 'node:20.3.0-bullseye' locally
20.3.0-bullseye: Pulling from library/node
93c2d578e421: Already exists
c87e6f3487e1: Already exists
65b4d59f9aba: Already exists
d7edca23d42b: Already exists
25c206b29ffe: Already exists
599134452287: Pull complete
bd8a83c4c2aa: Pull complete
d11f4613ae42: Pull complete
Digest: sha256:ceb28814a32b676bf4f6607e036944adbdb6ba7005214134deb657500b26f0d0
Status: Downloaded newer image for node:20.3.0-bullseye
Welcome to Node.js v20.3.0.
Type ".help" for more information.
>
Our website build is actually broken running apt update
with the default Node.js LTS image based on Debian 12 (bookworm) - - we've switched to the Debian 11 (bullseye) based image for now: https://github.com/nodejs/build/issues/3382
FWIW I opened an issue about this in the docker-node repo: https://github.com/nodejs/docker-node/issues/1918
TLDR: this is not a problem with Node.js itself, but with the default base OS used by the Docker image, which was upgraded for v20.3.0.
bullseye works for me as well
Now I also get the file busy error:
test-srv3.hq.lan:~# docker run -it node:20.3.0-bullseye bash
root@ed020dd3f80e:/# yarn add bufferutil
yarn add v1.22.19
info No lockfile found.
[1/4] Resolving packages...
[2/4] Fetching packages...
[3/4] Linking dependencies...
[4/4] Building fresh packages...
error /node_modules/bufferutil: Command failed.
Exit code: 126
Command: node-gyp-build
Arguments:
Directory: /node_modules/bufferutil
Output:
/bin/sh: 1: node-gyp-build: Text file busy
EDIT: Works with UV_USE_IO_URING=0
So to summarize:
- Node 20.3.0 crashes on start in non-bullseye docker container
- In bullseye container install fails with "node-gyp-build: Text file busy"
- In bullseye container + UV_USE_IO_URING=0 everything works
Should I split the uring problem into a separate issue?
Can someone post the result of strace -o trace.log -yy -f node app.js
when it crashes with that uv_thread_create check? I expect to see a failing clone/clone2/clone3 system call but it'd be good to confirm.
On the Ubuntu 16.04 infra machine I cannot run apt in the bookworm based node:20.3.0
or node:lts
containers to install strace
in them (it's not there by default).
Another datapoint, adding --security-opt=seccomp:unconfined
makes this work on the Ubuntu 16.04 host:
root@infra-digitalocean-ubuntu1604-x64-1:~# docker run --security-opt=seccomp:unconfined -it node:20.3.0 node
Welcome to Node.js v20.3.0.
Type ".help" for more information.
>
Right, then I can predict with near 100% certainty what the problem is: docker doesn't know about the newish clone3 system call. Its seccomp filter rejects it with some bogus error and node consequently fails when it tries to start a new thread.
This docker seccomp thing is like clockwork, it always pops up when new system calls are starting to see broader use. It's quite possibly fixed in newer versions.
Updating docker to the latest version fixed it (v24.0.2) for me.
A few notes:
- Ubuntu 22/04 LTS ships with Docker v20.x, which does not support this.
- I did not test any version in between, and I couldn't quickly identify what releases of Docker fixed it. From various comments in issues, it seems
runc
(a dependency of Docker) fixed it in v1.0.2.
Here is what I think we should do:
- document this error and the
UV_USE_IO_URING=0
solution for v20 - disable io_uring when we backport libuv in LTS lines
This seems a future-proof solution while keeping the current functionality available.
document this error and the UV_USE_IO_URING=0 solution for v20
UV_USE_IO_URING is (intentionally) undocumented and going away again so don't do that.
@bnoordhuis Would you just document this as "if you are hit by this bug, update docker"?
I think there are two different things here. I'm not sure updating docker will help with the uring problem. Or does it? Please confirm.
If I'm reading this correctly there are 2 separate issues here.
- The crash in some instances. This seems to be directly related to a bug with docker and nothing to do with io_uring.
- The
Text file busy
error, which might or might not be io_uring related but, at least, seems to be exacerbated by the use of io_uring.
I think we should try to understand better the 2nd issue before disabling it.
The same thing we are experiencing here: https://github.com/nodejs/docker-node/issues/1912#issuecomment-1594408113
I cannot reproduce the bufferutils
issue with latest docker.
@bnoordhuis Would you just document this as "if you are hit by this bug, update docker"?
Yes.
Tagging @nodejs/tsc for visibility.
@nodejs/docker are you comfortable with this approach?
In my opinion this will cause too much disturbance. Until enterprises have had more time to upgrade docker we should:
- make sure the docker node image stays on bullseye
- disable ioring
I don't see how we could have an LTS release with the current situation.
I have tested this on the latest release of Docker Desktop on Mac, I have the same issue described here