undici
undici copied to clipboard
`fetch` can make `worker_threads` stuck and process hang
Bug Description
When undici is run inside worker_threads it may prevent the worker from terminating and makes the whole process hang.
Reproducible By
This Node script runs undici in worker threads and does 4x requests there before exiting. If this does not reproduce on your fast environment, try increasing the requests and const ROUNDS.
"undici": "^5.21.0"
// Save as `undici.mjs` and run as `node uncidi.mjs`
import { isMainThread, parentPort, Worker } from "node:worker_threads";
import { fileURLToPath } from "node:url";
import { cpus } from "node:os";
const THREADS = cpus().length - 1;
const ROUNDS = 10;
const filename = fileURLToPath(import.meta.url);
if (isMainThread) {
const stuckWorkers = [];
const interval = setInterval(() => {
if (stuckWorkers.length > 0) {
console.log(`There are ${stuckWorkers.length} stuck workers`);
}
}, 1000);
const timeout = setTimeout(() => {
setTimeout(() => console.log("Still running"), 5_000); // This is never called so process.exit does something, but it's still unable to terminate worker running undici
console.log("Forcing process.exit on main thread");
process.exit();
}, 30_000);
const task = async () => {
const worker = new Worker(filename, {});
return new Promise((resolve) => {
worker.on("message", (message) => {
if (message !== "DONE") return;
const timer = setTimeout(() => {
console.log("Unable to terminate");
stuckWorkers.push(worker);
}, 10_000);
worker.terminate().then(() => {
clearTimeout(timer);
resolve();
});
});
});
};
const pool = Array(ROUNDS).fill(task);
async function execute() {
const task = pool.shift();
if (task) {
await task();
return execute();
}
}
await Promise.all(
Array(THREADS)
.fill(execute)
.map((task) => task())
);
console.log("All done!");
clearTimeout(timeout);
clearInterval(interval);
} else {
const { fetch } = await import("undici");
const request = async (pathWithParams) => {
const url = new URL(pathWithParams, "https://api.spacexdata.com");
const timer = setTimeout(
() => console.log("Request taking more than 10s"), // This is never logged so requests are not stuck
10_000
);
const output = await fetch(url.href, {
method: "GET",
headers: { "Content-Type": "application/json" },
});
clearTimeout(timer);
return output;
};
await Promise.all([
request("/v4/starlink").then((r) => r.json()),
request("/v4/starlink/5eed770f096e59000698560d").then((r) => r.json()),
request("/v4/starlink").then((r) => r.json()),
request("/v4/starlink/5eed770f096e59000698560d").then((r) => r.json()),
]);
parentPort.postMessage("DONE");
}
Expected Behavior
Package should be safe to run inside worker_threads. If it's not safe, it should detect such environment and throw an error.
Logs & Screenshots
ari ~/repro $ node undici.mjs
Unable to terminate
There are 1 stuck workers
...
There are 1 stuck workers
Forcing process.exit on main thread
^Z
[1]+ Stopped node undici.mjs
ari ~/repro $ pgrep node
24498
ari ~/repro $ kill 24498
ari ~/repro $ pgrep node
24498
# ^^ Ok this one is really stuck
ari ~/repro $ kill -9 24498
[1]+ Killed: 9 node undici.mjs
ari ~/repro $ pgrep node
# no output, process killed
Environment
MacOS 13.2.1 Node 18.14.0
Additional context
https://github.com/vitest-dev/vitest/issues/3077
Looks like the same minimal reproduction gets stuck even with Node's fetch - should this be moved to nodejs/node issues instead? Or was Node using undici under the hood?
I'm unable to reproduce. undici 5.21 + node v18.14.0 + Rounds 50
Yep. I can confirm this happens.
Can you repro with just undici.request? Is this a fetch specific issue?
https://www.npmjs.com/package/why-is-node-running
There are 10 handle(s) keeping the process running
# TLSWRAP
node:internal/async_hooks:202
/Users/ronagy/GitHub/undici/lib/core/connect.js:96 - socket = tls.connect({
/Users/ronagy/GitHub/undici/lib/client.js:1069 - client[kConnector]({
/Users/ronagy/GitHub/undici/lib/client.js:1068 - const socket = await new Promise((resolve, reject) => {
/Users/ronagy/GitHub/undici/lib/client.js:1262 - connect(client)
# DNSCHANNEL
node:internal/async_hooks:202
node:internal/dns/utils:80
node:internal/dns/utils:75
node:internal/dns/utils:242
node:internal/dns/utils:252
node:internal/bootstrap/loaders:333
node:internal/bootstrap/loaders:364
# TLSWRAP
node:internal/async_hooks:202
/Users/ronagy/GitHub/undici/lib/core/connect.js:96 - socket = tls.connect({
/Users/ronagy/GitHub/undici/lib/client.js:1069 - client[kConnector]({
/Users/ronagy/GitHub/undici/lib/client.js:1068 - const socket = await new Promise((resolve, reject) => {
/Users/ronagy/GitHub/undici/lib/client.js:1262 - connect(client)
# TLSWRAP
node:internal/async_hooks:202
/Users/ronagy/GitHub/undici/lib/core/connect.js:96 - socket = tls.connect({
/Users/ronagy/GitHub/undici/lib/client.js:1069 - client[kConnector]({
/Users/ronagy/GitHub/undici/lib/client.js:1068 - const socket = await new Promise((resolve, reject) => {
/Users/ronagy/GitHub/undici/lib/client.js:1262 - connect(client)
# TLSWRAP
node:internal/async_hooks:202
/Users/ronagy/GitHub/undici/lib/core/connect.js:96 - socket = tls.connect({
/Users/ronagy/GitHub/undici/lib/client.js:1069 - client[kConnector]({
/Users/ronagy/GitHub/undici/lib/client.js:1068 - const socket = await new Promise((resolve, reject) => {
/Users/ronagy/GitHub/undici/lib/client.js:1262 - connect(client)
# ZLIB
node:internal/async_hooks:202
/Users/ronagy/GitHub/undici/lib/fetch/index.js:2011 - decoders.push(zlib.createBrotliDecompress())
/Users/ronagy/GitHub/undici/lib/core/request.js:232 - return this[kHandler].onHeaders(statusCode, headers, resume, statusText)
/Users/ronagy/GitHub/undici/lib/client.js:789 - pause = request.onHeaders(statusCode, headers, this.resume, statusText) === false
/Users/ronagy/GitHub/undici/lib/client.js:406 - return currentParser.onHeadersComplete(statusCode, Boolean(upgrade), Boolean(shouldKeepAlive)) || 0
wasm://wasm/00036ac6:1
# ZLIB
node:internal/async_hooks:202
/Users/ronagy/GitHub/undici/lib/fetch/index.js:2011 - decoders.push(zlib.createBrotliDecompress())
/Users/ronagy/GitHub/undici/lib/core/request.js:232 - return this[kHandler].onHeaders(statusCode, headers, resume, statusText)
/Users/ronagy/GitHub/undici/lib/client.js:789 - pause = request.onHeaders(statusCode, headers, this.resume, statusText) === false
/Users/ronagy/GitHub/undici/lib/client.js:406 - return currentParser.onHeadersComplete(statusCode, Boolean(upgrade), Boolean(shouldKeepAlive)) || 0
wasm://wasm/00036ac6:1
# ZLIB
node:internal/async_hooks:202
/Users/ronagy/GitHub/undici/lib/fetch/index.js:2011 - decoders.push(zlib.createBrotliDecompress())
/Users/ronagy/GitHub/undici/lib/core/request.js:232 - return this[kHandler].onHeaders(statusCode, headers, resume, statusText)
/Users/ronagy/GitHub/undici/lib/client.js:789 - pause = request.onHeaders(statusCode, headers, this.resume, statusText) === false
/Users/ronagy/GitHub/undici/lib/client.js:406 - return currentParser.onHeadersComplete(statusCode, Boolean(upgrade), Boolean(shouldKeepAlive)) || 0
wasm://wasm/00036ac6:1
# ZLIB
node:internal/async_hooks:202
/Users/ronagy/GitHub/undici/lib/fetch/index.js:2011 - decoders.push(zlib.createBrotliDecompress())
/Users/ronagy/GitHub/undici/lib/core/request.js:232 - return this[kHandler].onHeaders(statusCode, headers, resume, statusText)
/Users/ronagy/GitHub/undici/lib/client.js:789 - pause = request.onHeaders(statusCode, headers, this.resume, statusText) === false
/Users/ronagy/GitHub/undici/lib/client.js:406 - return currentParser.onHeadersComplete(statusCode, Boolean(upgrade), Boolean(shouldKeepAlive)) || 0
wasm://wasm/00036ac6:1
@jasnell How is createBrotliDecompress keeping a ref that stops shutdown?
@KhafraDev any ideas?
Can you repro with just undici.request? Is this a fetch specific issue?
undici.request does not cause this issue. Tested by running 100 rounds with request() successfully. The fetch() case causes the issue after 10 rounds on my laptop.
And now I can't reproduce it. 😢
I was hoping this wouldn't come up 😐. I've had this issue in the past, no idea how to debug it or even cause it. It might be an issue with web readable streams (ie. something like https://github.com/nodejs/node/issues/44985?), but I'm unsure.
If it is an issue with zlib, that'd make sense at least. request doesn't decompress the body so you'd only experience it in fetch (such as this issue).
And now I can't reproduce it. 😢
This reproduction case is a bit flaky. You may need to add more parallel requests or increase the const ROUNDS to make workers stuck. I guess it's also possible the API sets some request limit if you run the script constantly. I think the issue comes up when CPU load is high enough so maybe add some expensive task, like zipping a file, on background. I'm using Apple M2 Air and can reproduce the issue easily.
Are there any work-arounds available for now? How could we make the main thread force the workers to terminate when worker.terminate() and process.exit() are not enough? The worker.terminate() does something to workers, since they are no longer responding to worker.postMessage calls but they are still somewhere stuck.
It should be noted that the worker.terminate docs explicitly say that it will "[s]top all JavaScript execution in the worker thread as soon as possible". There's no guarantee that it'll actually close the worker if there is a resource (seemingly a stream from the debugging above) still ref'd.
Do you notice this issue if you run tests sequentially? We run hundreds of fetch tests in workers w/o issue, but one at a time. We noticed some issues with sockets staying open (not because of fetch) when running in parallel, but this could be a similar issue?
[...] docs explicitly say that it will "[s]top all JavaScript execution in the worker thread as soon as possible". There's no guarantee that it'll actually close the worker if there is a resource (seemingly a stream from the debugging above) still ref'd.
Yes, worker.terminate is likely working correctly. I would expect that once await fetch() has resolved, it would not leave any active resources in the background. By reaching the end of worker file I would expect all JavaScript to have been run. But fetch seems to have some side effect here.
Also when I replace worker_threads with child_process this issue does not happen. And when I use worker_threads and replace undici with node-fetch, this issue does not happen.
Do you notice this issue if you run tests sequentially?
The issue comes up even when only a single worker is used. But if worker fires fetch calls sequentially instead of parallel with Promise.all it does not come up. Also those sequential fetch calls seem to be much faster than parallel ones.
We run hundreds of fetch tests in workers w/o issue, but one at a time.
Here is failing test case: https://github.com/nodejs/undici/pull/2032
That test case isn't failing on my machine (both windows and ubuntu, using node v19.8.1).
Sure, the point was that it fails the CI: https://github.com/nodejs/undici/actions/runs/4551655590/jobs/8026203674#step:14:12832. CI should be used as reference instead of developer machines. It's much easier to reproduce the issue constantly that way.
Failing, yes, but it's hard to fix an issue that can't be debugged locally.
Decompressing the body has a... pretty large number of issues that I can see. I wouldn't be surprised if createBrotliDecompress is causing issues, as it's the central cause of another issue that's hard to debug. Actually, looking at the code in #2032, spacexdata is sending us a br content-encoding header too.
If we take a look at https://github.com/node-fetch/node-fetch/blob/e87b093fd678a9ea39c5b17b2a1bdfc4691eedc7/src/index.js#L277-L366 (which is how node-fetch decompresses the body), it's more or less identical to what undici is doing in this instance. I'm unsure as to what exactly is causing it to hang here, and not node-fetch, unless of course it's not an issue with zlib, but something else?
I've been experiencing this as well, random hangs that only reproduce on Linux and never on macOS. The one time where I oberved it interactively in a Vagrant VM, I observed 200% CPU usage and the process was completely stuck somewhere and did not even accept SIGINT via keyboard.
There were no brotli-encoded responses involved, all responses were uncompressed and were coming from local node.js server.
I wouldn't be surprised if createBrotliDecompress is causing issues
I've been trying to reproduce this issue without fetch at all, using just node:stream and node:zlib with workers. So far I haven't been able to reproduce the issue that way. Below is a script that reads the API's raw result from filesystem and passes it to createBrotliDecompress:
/*
curl --raw -H 'Accept-Encoding: br' https://api.spacexdata.com/v4/starlink > raw-response
*/
import { createBrotliDecompress } from "node:zlib";
import { pipeline } from "node:stream";
import { createReadStream } from "node:fs";
import { isMainThread, parentPort, Worker } from "node:worker_threads";
import { fileURLToPath } from "node:url";
import { cpus } from "node:os";
const THREADS = cpus().length - 1;
const ROUNDS = 100;
if (isMainThread) {
const timeout = setTimeout(() => {
console.log("Forcing process.exit on main thread");
process.exit();
}, 60_000);
const task = async () => {
const worker = new Worker(fileURLToPath(import.meta.url), {});
const [promise, resolve] = getPromise();
worker.on("message", (m) => m === "DONE" && resolve());
await promise;
const timer = setTimeout(() => {
console.log("Unable to terminate");
}, 10_000);
await worker.terminate();
clearTimeout(timer);
};
const pool = Array(ROUNDS).fill(task);
async function execute() {
const task = pool.shift();
if (task) {
await task();
return execute();
}
}
await Promise.all(
Array(THREADS)
.fill(execute)
.map((task) => task())
);
console.log("All done!");
clearTimeout(timeout);
} else {
const pool = Array(10).fill(decompress);
await Promise.all(pool.map((task) => task()));
parentPort.postMessage("DONE");
}
async function decompress() {
const [promise, resolve, reject] = getPromise();
const output = pipeline(
createReadStream("./raw-response"),
createBrotliDecompress(),
(err) => err && reject(err)
);
const chunks = [];
output.on("readable", () => {
let chunk;
while (null !== (chunk = output.read())) {
chunks.push(chunk);
}
});
output.on("end", resolve);
output.on("error", reject);
await promise;
const data = chunks.join("");
const json = JSON.parse(data);
console.assert(json != null, `JSON unexpcted: ${JSON.stringify(json)}`);
}
function getPromise() {
let resolve, reject;
const promise = new Promise((...args) => {
[resolve, reject] = args;
});
return [promise, resolve, reject];
}
it's hard to fix an issue that can't be debugged locally
I'm happy to help by providing debugging logs or similar if there is anything you'd like to see. I'm able to reproduce the original issue constantly on my local machine.
I've been trying to look for any hanging HTTP sockets with netstat when workers get stuck, but haven't spotted anything unusual. I'm running out of ideas what to look for next.
https://github.com/vitest-dev/vitest/issues/2008#issuecomment-1337272807 may be relevant as well where I managed to attach a chrome debugger via node --inspect to such a hanging process, but not sure of those traces are of any help, or even if it's the same exact issue.
Would definitely help if we can get similar traces via node --inspect.
I wonder if this is the same of https://github.com/nodejs/node/issues/47228
I'm not sure. I did observe this with node 18. Also, that hang is with idle cpu, this one here was with pegged cpu.
I wonder if this is the same of nodejs/node#47228
I don't think so. That issue mentions that it does not reproduce on Node 18 while this one does. Also by passing keepalive: false option to fetch I'm still able to make the worker_threads stuck with the reproduction case of this issue.
Can confirm that this issue occurs on both Node v17.9.0 and v19.8.1 using undici v5.21.2, and not just with fetch. Here is some smaller example code to help reproduce:
index.js:
import { Worker } from "worker_threads";
import path from "path";
import { fileURLToPath } from "url";
new Worker(path.join(path.dirname(fileURLToPath(import.meta.url)), "./worker.js"));
worker.js:
import { request } from "undici";
request("https://media.tenor.com/qy_WcGdRzfgAAAAC/xluna-high-five.gif").then(res => res.body.arrayBuffer()).then(buf => Buffer.from(buf))
.then(() => ({
buffer: Buffer.alloc(10),
fileExtension: "bin"
}))
.then(() => {
process.exit();
});
Now removing process.exit fixes it:
import { request } from "undici";
request("https://media.tenor.com/qy_WcGdRzfgAAAAC/xluna-high-five.gif").then(res => res.body.arrayBuffer()).then(buf => Buffer.from(buf))
.then(() => ({
buffer: Buffer.alloc(10),
fileExtension: "bin"
}))
.then(() => {
//// comment out process.exit and the worker would not exit anymore
// process.exit();
});
I think this might be a bug in Node.js after all. @addaleax, have you got an idea why process.exit() in a worker might cause the worker to freeze?
Even worse, adding NODE_DEBUG=worker make the worker actually terminate correctly.
What's the simplest way to override all global reference to fetch?
Presumably, if we can do that, then we should be able avoid the bug.
It seems we are hit pretty hard by this with every couple of test runs flaking as a result (https://github.com/vitest-dev/vitest/issues/4415).
That issue seems unrelated, undici does not allocate any FileHandle.
That issue seems unrelated, undici does not allocate any
FileHandle.
We ended up removing all use of fetch from our codebase and restricting use of fetch by overriding globalThis.fetch and the issue disappeared. 🤷
Yep, running into this now with fetch. I had to do put this at the top of our Vitest setupTests.ts:
import fetch from "node-fetch"
globalThis.fetch = fetch
Issue is now resolved.
Hi @AriPerkkio, Is this fixed in the latest node (v21.7.1)? A Webstream fix patch (nodejs/node#51526) has been added. It seems to have fixed this issue.
Great work @tsctx and @jasnell (from https://github.com/nodejs/node/pull/51255, :wave:)!
This issue seems to be fixed. I ran these tests using [email protected] and following Node versions:
- ❌
v18.19.1stuck on first run - ✅
v21.7.110/10 runs without any worker hangs even withstress - ✅
v21.7.010/10 runs without any worker hangs - ✅
v21.6.210/10 runs without any worker hangs - ❌
v21.5.0stuck on first run
Looks like the fix was orignally included in v21.6.2 in https://github.com/nodejs/node/pull/51255, but reworked for v21.7.1 in https://github.com/nodejs/node/pull/51526.
This also seems to fix two other node:worker_thread hangs that were not using Undici at all.
I just now tested both of these with v21.7.1 and they no longer hang. With v18.19.1 the worker hang is easily reproduced:
dompurify: https://github.com/vitest-dev/vitest/issues/3077#issuecomment-1867306736.fastify: https://github.com/vitest-dev/vitest/issues/4956#issuecomment-1908312966
Let's close this issue as fix on Node's side has landed and root cause was not in Undici.