undici `fetch` can make `worker_threads` stuck and process hang

Bug Description

When undici is run inside worker_threads it may prevent the worker from terminating and makes the whole process hang.

Reproducible By

This Node script runs undici in worker threads and does 4x requests there before exiting. If this does not reproduce on your fast environment, try increasing the requests and const ROUNDS.

"undici": "^5.21.0"

// Save as `undici.mjs` and run as `node uncidi.mjs`
import { isMainThread, parentPort, Worker } from "node:worker_threads";
import { fileURLToPath } from "node:url";
import { cpus } from "node:os";

const THREADS = cpus().length - 1;
const ROUNDS = 10;
const filename = fileURLToPath(import.meta.url);

if (isMainThread) {
  const stuckWorkers = [];
  const interval = setInterval(() => {
    if (stuckWorkers.length > 0) {
      console.log(`There are ${stuckWorkers.length} stuck workers`);
    }
  }, 1000);

  const timeout = setTimeout(() => {
    setTimeout(() => console.log("Still running"), 5_000); // This is never called so process.exit does something, but it's still unable to terminate worker running undici
    console.log("Forcing process.exit on main thread");
    process.exit();
  }, 30_000);

  const task = async () => {
    const worker = new Worker(filename, {});

    return new Promise((resolve) => {
      worker.on("message", (message) => {
        if (message !== "DONE") return;

        const timer = setTimeout(() => {
          console.log("Unable to terminate");
          stuckWorkers.push(worker);
        }, 10_000);
        worker.terminate().then(() => {
          clearTimeout(timer);
          resolve();
        });
      });
    });
  };

  const pool = Array(ROUNDS).fill(task);

  async function execute() {
    const task = pool.shift();

    if (task) {
      await task();
      return execute();
    }
  }

  await Promise.all(
    Array(THREADS)
      .fill(execute)
      .map((task) => task())
  );
  console.log("All done!");
  clearTimeout(timeout);
  clearInterval(interval);
} else {
  const { fetch } = await import("undici");

  const request = async (pathWithParams) => {
    const url = new URL(pathWithParams, "https://api.spacexdata.com");

    const timer = setTimeout(
      () => console.log("Request taking more than 10s"), // This is never logged so requests are not stuck
      10_000
    );
    const output = await fetch(url.href, {
      method: "GET",
      headers: { "Content-Type": "application/json" },
    });
    clearTimeout(timer);
    return output;
  };
  await Promise.all([
    request("/v4/starlink").then((r) => r.json()),
    request("/v4/starlink/5eed770f096e59000698560d").then((r) => r.json()),
    request("/v4/starlink").then((r) => r.json()),
    request("/v4/starlink/5eed770f096e59000698560d").then((r) => r.json()),
  ]);

  parentPort.postMessage("DONE");
}

Expected Behavior

Package should be safe to run inside worker_threads. If it's not safe, it should detect such environment and throw an error.

Logs & Screenshots

ari ~/repro  $ node undici.mjs 
Unable to terminate
There are 1 stuck workers
...
There are 1 stuck workers
Forcing process.exit on main thread
^Z
[1]+  Stopped                 node undici.mjs

ari ~/repro  $ pgrep node
24498

ari ~/repro  $ kill  24498
ari ~/repro  $ pgrep node
24498
# ^^ Ok this one is really stuck

ari ~/repro  $ kill  -9 24498
[1]+  Killed: 9               node undici.mjs

ari ~/repro  $ pgrep node
# no output, process killed

Environment

MacOS 13.2.1 Node 18.14.0

Additional context

https://github.com/vitest-dev/vitest/issues/3077

Mar 26 '23 13:03 AriPerkkio

Looks like the same minimal reproduction gets stuck even with Node's fetch - should this be moved to nodejs/node issues instead? Or was Node using undici under the hood?

Mar 26 '23 14:03 AriPerkkio

I'm unable to reproduce. undici 5.21 + node v18.14.0 + Rounds 50

Mar 26 '23 16:03 ronag

Yep. I can confirm this happens.

Mar 26 '23 16:03 ronag

Can you repro with just undici.request? Is this a fetch specific issue?

Mar 26 '23 16:03 ronag

https://www.npmjs.com/package/why-is-node-running

There are 10 handle(s) keeping the process running

# TLSWRAP
node:internal/async_hooks:202                     
/Users/ronagy/GitHub/undici/lib/core/connect.js:96 - socket = tls.connect({
/Users/ronagy/GitHub/undici/lib/client.js:1069     - client[kConnector]({
/Users/ronagy/GitHub/undici/lib/client.js:1068     - const socket = await new Promise((resolve, reject) => {
/Users/ronagy/GitHub/undici/lib/client.js:1262     - connect(client)

# DNSCHANNEL
node:internal/async_hooks:202      
node:internal/dns/utils:80         
node:internal/dns/utils:75         
node:internal/dns/utils:242        
node:internal/dns/utils:252        
node:internal/bootstrap/loaders:333
node:internal/bootstrap/loaders:364

# TLSWRAP
node:internal/async_hooks:202                     
/Users/ronagy/GitHub/undici/lib/core/connect.js:96 - socket = tls.connect({
/Users/ronagy/GitHub/undici/lib/client.js:1069     - client[kConnector]({
/Users/ronagy/GitHub/undici/lib/client.js:1068     - const socket = await new Promise((resolve, reject) => {
/Users/ronagy/GitHub/undici/lib/client.js:1262     - connect(client)

# TLSWRAP
node:internal/async_hooks:202                     
/Users/ronagy/GitHub/undici/lib/core/connect.js:96 - socket = tls.connect({
/Users/ronagy/GitHub/undici/lib/client.js:1069     - client[kConnector]({
/Users/ronagy/GitHub/undici/lib/client.js:1068     - const socket = await new Promise((resolve, reject) => {
/Users/ronagy/GitHub/undici/lib/client.js:1262     - connect(client)

# TLSWRAP
node:internal/async_hooks:202                     
/Users/ronagy/GitHub/undici/lib/core/connect.js:96 - socket = tls.connect({
/Users/ronagy/GitHub/undici/lib/client.js:1069     - client[kConnector]({
/Users/ronagy/GitHub/undici/lib/client.js:1068     - const socket = await new Promise((resolve, reject) => {
/Users/ronagy/GitHub/undici/lib/client.js:1262     - connect(client)

# ZLIB
node:internal/async_hooks:202                      
/Users/ronagy/GitHub/undici/lib/fetch/index.js:2011 - decoders.push(zlib.createBrotliDecompress())
/Users/ronagy/GitHub/undici/lib/core/request.js:232 - return this[kHandler].onHeaders(statusCode, headers, resume, statusText)
/Users/ronagy/GitHub/undici/lib/client.js:789       - pause = request.onHeaders(statusCode, headers, this.resume, statusText) === false
/Users/ronagy/GitHub/undici/lib/client.js:406       - return currentParser.onHeadersComplete(statusCode, Boolean(upgrade), Boolean(shouldKeepAlive)) || 0
wasm://wasm/00036ac6:1                             

# ZLIB
node:internal/async_hooks:202                      
/Users/ronagy/GitHub/undici/lib/fetch/index.js:2011 - decoders.push(zlib.createBrotliDecompress())
/Users/ronagy/GitHub/undici/lib/core/request.js:232 - return this[kHandler].onHeaders(statusCode, headers, resume, statusText)
/Users/ronagy/GitHub/undici/lib/client.js:789       - pause = request.onHeaders(statusCode, headers, this.resume, statusText) === false
/Users/ronagy/GitHub/undici/lib/client.js:406       - return currentParser.onHeadersComplete(statusCode, Boolean(upgrade), Boolean(shouldKeepAlive)) || 0
wasm://wasm/00036ac6:1                             

# ZLIB
node:internal/async_hooks:202                      
/Users/ronagy/GitHub/undici/lib/fetch/index.js:2011 - decoders.push(zlib.createBrotliDecompress())
/Users/ronagy/GitHub/undici/lib/core/request.js:232 - return this[kHandler].onHeaders(statusCode, headers, resume, statusText)
/Users/ronagy/GitHub/undici/lib/client.js:789       - pause = request.onHeaders(statusCode, headers, this.resume, statusText) === false
/Users/ronagy/GitHub/undici/lib/client.js:406       - return currentParser.onHeadersComplete(statusCode, Boolean(upgrade), Boolean(shouldKeepAlive)) || 0
wasm://wasm/00036ac6:1                             

# ZLIB
node:internal/async_hooks:202                      
/Users/ronagy/GitHub/undici/lib/fetch/index.js:2011 - decoders.push(zlib.createBrotliDecompress())
/Users/ronagy/GitHub/undici/lib/core/request.js:232 - return this[kHandler].onHeaders(statusCode, headers, resume, statusText)
/Users/ronagy/GitHub/undici/lib/client.js:789       - pause = request.onHeaders(statusCode, headers, this.resume, statusText) === false
/Users/ronagy/GitHub/undici/lib/client.js:406       - return currentParser.onHeadersComplete(statusCode, Boolean(upgrade), Boolean(shouldKeepAlive)) || 0
wasm://wasm/00036ac6:1

Mar 26 '23 16:03 ronag

@jasnell How is createBrotliDecompress keeping a ref that stops shutdown?

Mar 26 '23 16:03 ronag

@KhafraDev any ideas?

Mar 26 '23 17:03 ronag

Can you repro with just undici.request? Is this a fetch specific issue?

undici.request does not cause this issue. Tested by running 100 rounds with request() successfully. The fetch() case causes the issue after 10 rounds on my laptop.

Mar 26 '23 17:03 AriPerkkio

And now I can't reproduce it. 😢

Mar 26 '23 17:03 ronag

I was hoping this wouldn't come up 😐. I've had this issue in the past, no idea how to debug it or even cause it. It might be an issue with web readable streams (ie. something like https://github.com/nodejs/node/issues/44985?), but I'm unsure.

If it is an issue with zlib, that'd make sense at least. request doesn't decompress the body so you'd only experience it in fetch (such as this issue).

Mar 26 '23 17:03 KhafraDev

And now I can't reproduce it. 😢

This reproduction case is a bit flaky. You may need to add more parallel requests or increase the const ROUNDS to make workers stuck. I guess it's also possible the API sets some request limit if you run the script constantly. I think the issue comes up when CPU load is high enough so maybe add some expensive task, like zipping a file, on background. I'm using Apple M2 Air and can reproduce the issue easily.

Are there any work-arounds available for now? How could we make the main thread force the workers to terminate when worker.terminate() and process.exit() are not enough? The worker.terminate() does something to workers, since they are no longer responding to worker.postMessage calls but they are still somewhere stuck.

Mar 27 '23 06:03 AriPerkkio

It should be noted that the worker.terminate docs explicitly say that it will "[s]top all JavaScript execution in the worker thread as soon as possible". There's no guarantee that it'll actually close the worker if there is a resource (seemingly a stream from the debugging above) still ref'd.

Do you notice this issue if you run tests sequentially? We run hundreds of fetch tests in workers w/o issue, but one at a time. We noticed some issues with sockets staying open (not because of fetch) when running in parallel, but this could be a similar issue?

Mar 28 '23 18:03 KhafraDev

[...] docs explicitly say that it will "[s]top all JavaScript execution in the worker thread as soon as possible". There's no guarantee that it'll actually close the worker if there is a resource (seemingly a stream from the debugging above) still ref'd.

Yes, worker.terminate is likely working correctly. I would expect that once await fetch() has resolved, it would not leave any active resources in the background. By reaching the end of worker file I would expect all JavaScript to have been run. But fetch seems to have some side effect here.

Also when I replace worker_threads with child_process this issue does not happen. And when I use worker_threads and replace undici with node-fetch, this issue does not happen.

Do you notice this issue if you run tests sequentially?

The issue comes up even when only a single worker is used. But if worker fires fetch calls sequentially instead of parallel with Promise.all it does not come up. Also those sequential fetch calls seem to be much faster than parallel ones.

We run hundreds of fetch tests in workers w/o issue, but one at a time.

Here is failing test case: https://github.com/nodejs/undici/pull/2032

Mar 29 '23 08:03 AriPerkkio

That test case isn't failing on my machine (both windows and ubuntu, using node v19.8.1).

Mar 29 '23 17:03 KhafraDev

Sure, the point was that it fails the CI: https://github.com/nodejs/undici/actions/runs/4551655590/jobs/8026203674#step:14:12832. CI should be used as reference instead of developer machines. It's much easier to reproduce the issue constantly that way.

Mar 29 '23 18:03 AriPerkkio

Failing, yes, but it's hard to fix an issue that can't be debugged locally.

Decompressing the body has a... pretty large number of issues that I can see. I wouldn't be surprised if createBrotliDecompress is causing issues, as it's the central cause of another issue that's hard to debug. Actually, looking at the code in #2032, spacexdata is sending us a br content-encoding header too.

If we take a look at https://github.com/node-fetch/node-fetch/blob/e87b093fd678a9ea39c5b17b2a1bdfc4691eedc7/src/index.js#L277-L366 (which is how node-fetch decompresses the body), it's more or less identical to what undici is doing in this instance. I'm unsure as to what exactly is causing it to hang here, and not node-fetch, unless of course it's not an issue with zlib, but something else?

Mar 29 '23 18:03 KhafraDev

I've been experiencing this as well, random hangs that only reproduce on Linux and never on macOS. The one time where I oberved it interactively in a Vagrant VM, I observed 200% CPU usage and the process was completely stuck somewhere and did not even accept SIGINT via keyboard.

There were no brotli-encoded responses involved, all responses were uncompressed and were coming from local node.js server.

Mar 30 '23 06:03 silverwind

I wouldn't be surprised if createBrotliDecompress is causing issues

I've been trying to reproduce this issue without fetch at all, using just node:stream and node:zlib with workers. So far I haven't been able to reproduce the issue that way. Below is a script that reads the API's raw result from filesystem and passes it to createBrotliDecompress:

/*
curl --raw -H 'Accept-Encoding: br' https://api.spacexdata.com/v4/starlink > raw-response
*/
import { createBrotliDecompress } from "node:zlib";
import { pipeline } from "node:stream";
import { createReadStream } from "node:fs";
import { isMainThread, parentPort, Worker } from "node:worker_threads";
import { fileURLToPath } from "node:url";
import { cpus } from "node:os";

const THREADS = cpus().length - 1;
const ROUNDS = 100;

if (isMainThread) {
  const timeout = setTimeout(() => {
    console.log("Forcing process.exit on main thread");
    process.exit();
  }, 60_000);

  const task = async () => {
    const worker = new Worker(fileURLToPath(import.meta.url), {});
    const [promise, resolve] = getPromise();
    worker.on("message", (m) => m === "DONE" && resolve());
    await promise;

    const timer = setTimeout(() => {
      console.log("Unable to terminate");
    }, 10_000);
    await worker.terminate();
    clearTimeout(timer);
  };

  const pool = Array(ROUNDS).fill(task);

  async function execute() {
    const task = pool.shift();
    if (task) {
      await task();
      return execute();
    }
  }

  await Promise.all(
    Array(THREADS)
      .fill(execute)
      .map((task) => task())
  );
  console.log("All done!");
  clearTimeout(timeout);
} else {
  const pool = Array(10).fill(decompress);
  await Promise.all(pool.map((task) => task()));
  parentPort.postMessage("DONE");
}

async function decompress() {
  const [promise, resolve, reject] = getPromise();
  const output = pipeline(
    createReadStream("./raw-response"),
    createBrotliDecompress(),
    (err) => err && reject(err)
  );

  const chunks = [];

  output.on("readable", () => {
    let chunk;
    while (null !== (chunk = output.read())) {
      chunks.push(chunk);
    }
  });

  output.on("end", resolve);
  output.on("error", reject);

  await promise;

  const data = chunks.join("");
  const json = JSON.parse(data);
  console.assert(json != null, `JSON unexpcted: ${JSON.stringify(json)}`);
}

function getPromise() {
  let resolve, reject;
  const promise = new Promise((...args) => {
    [resolve, reject] = args;
  });

  return [promise, resolve, reject];
}

it's hard to fix an issue that can't be debugged locally

I'm happy to help by providing debugging logs or similar if there is anything you'd like to see. I'm able to reproduce the original issue constantly on my local machine.

I've been trying to look for any hanging HTTP sockets with netstat when workers get stuck, but haven't spotted anything unusual. I'm running out of ideas what to look for next.

Mar 31 '23 07:03 AriPerkkio

https://github.com/vitest-dev/vitest/issues/2008#issuecomment-1337272807 may be relevant as well where I managed to attach a chrome debugger via node --inspect to such a hanging process, but not sure of those traces are of any help, or even if it's the same exact issue.

Would definitely help if we can get similar traces via node --inspect.

Mar 31 '23 07:03 silverwind

I wonder if this is the same of https://github.com/nodejs/node/issues/47228

Apr 10 '23 07:04 mcollina

I'm not sure. I did observe this with node 18. Also, that hang is with idle cpu, this one here was with pegged cpu.

Apr 10 '23 07:04 silverwind

I wonder if this is the same of nodejs/node#47228

I don't think so. That issue mentions that it does not reproduce on Node 18 while this one does. Also by passing keepalive: false option to fetch I'm still able to make the worker_threads stuck with the reproduction case of this issue.

Apr 11 '23 05:04 AriPerkkio

Can confirm that this issue occurs on both Node v17.9.0 and v19.8.1 using undici v5.21.2, and not just with fetch. Here is some smaller example code to help reproduce:

index.js:

import { Worker } from "worker_threads";
import path from "path";
import { fileURLToPath } from "url";

new Worker(path.join(path.dirname(fileURLToPath(import.meta.url)), "./worker.js"));

worker.js:

import { request } from "undici";

request("https://media.tenor.com/qy_WcGdRzfgAAAAC/xluna-high-five.gif").then(res => res.body.arrayBuffer()).then(buf => Buffer.from(buf))
  .then(() => ({
    buffer: Buffer.alloc(10),
    fileExtension: "bin"
  }))
  .then(() => {
    process.exit();
  });

Apr 13 '23 16:04 TheEssem

Now removing process.exit fixes it:

import { request } from "undici";

request("https://media.tenor.com/qy_WcGdRzfgAAAAC/xluna-high-five.gif").then(res => res.body.arrayBuffer()).then(buf => Buffer.from(buf))
  .then(() => ({
    buffer: Buffer.alloc(10),
    fileExtension: "bin"
  }))
  .then(() => {
    //// comment out process.exit and the worker would not exit anymore
    // process.exit();
  });

I think this might be a bug in Node.js after all. @addaleax, have you got an idea why process.exit() in a worker might cause the worker to freeze?

Even worse, adding NODE_DEBUG=worker make the worker actually terminate correctly.

Apr 27 '23 22:04 mcollina

What's the simplest way to override all global reference to fetch?

Presumably, if we can do that, then we should be able avoid the bug.

It seems we are hit pretty hard by this with every couple of test runs flaking as a result (https://github.com/vitest-dev/vitest/issues/4415).

Nov 03 '23 14:11 gajus

That issue seems unrelated, undici does not allocate any FileHandle.

Nov 03 '23 20:11 mcollina

That issue seems unrelated, undici does not allocate any FileHandle.

We ended up removing all use of fetch from our codebase and restricting use of fetch by overriding globalThis.fetch and the issue disappeared. 🤷

Nov 03 '23 20:11 gajus

Yep, running into this now with fetch. I had to do put this at the top of our Vitest setupTests.ts:

import fetch from "node-fetch"

globalThis.fetch = fetch

Issue is now resolved.

Nov 16 '23 19:11 divmgl

Hi @AriPerkkio, Is this fixed in the latest node (v21.7.1)? A Webstream fix patch (nodejs/node#51526) has been added. It seems to have fixed this issue.

Mar 15 '24 06:03 tsctx

Great work @tsctx and @jasnell (from https://github.com/nodejs/node/pull/51255, :wave:)!

This issue seems to be fixed. I ran these tests using [email protected] and following Node versions:

❌ v18.19.1 stuck on first run
✅ v21.7.1 10/10 runs without any worker hangs even with stress
✅ v21.7.0 10/10 runs without any worker hangs
✅ v21.6.2 10/10 runs without any worker hangs
❌ v21.5.0 stuck on first run

Looks like the fix was orignally included in v21.6.2 in https://github.com/nodejs/node/pull/51255, but reworked for v21.7.1 in https://github.com/nodejs/node/pull/51526.

This also seems to fix two other node:worker_thread hangs that were not using Undici at all. I just now tested both of these with v21.7.1 and they no longer hang. With v18.19.1 the worker hang is easily reproduced:

dompurify: https://github.com/vitest-dev/vitest/issues/3077#issuecomment-1867306736.
fastify: https://github.com/vitest-dev/vitest/issues/4956#issuecomment-1908312966

Let's close this issue as fix on Node's side has landed and root cause was not in Undici.

Mar 15 '24 16:03 AriPerkkio

undici undici copied to clipboard

`fetch` can make `worker_threads` stuck and process hang

Bug Description

Reproducible By

Expected Behavior

Logs & Screenshots

Environment

Additional context

undici
undici copied to clipboard