node icon indicating copy to clipboard operation
node copied to clipboard

`Readable.toWeb` seems to load file contents to memory

Open Dzieni opened this issue 2 years ago • 20 comments

Version

v19.5.0

Platform

Darwin dzieni 21.6.0 Darwin Kernel Version 21.6.0: Wed Aug 10 14:28:35 PDT 2022; root:xnu-8020.141.5~2/RELEASE_ARM64_T8101 arm64

Subsystem

stream

What steps will reproduce the bug?

  • run REPL
  • create a file stream: const nodeStream = fs.createReadStream('/some/large/file')
  • check process.memoryUsage()
  • convert it to a web stream: const webStream = (require('stream').Readable).toWeb(nodeStream)
  • check process.memoryUsage()

How often does it reproduce? Is there a required condition?

At all times

What is the expected behavior?

Memory usage does not grow significantly.

What do you see instead?

Memory usage (precisely arrayBuffers section) grows by a few orders of magnitude. It seems to be correlated with the file size.

Additional information

No response

Dzieni avatar Jan 25 '23 11:01 Dzieni

Can you post a complete, ready-to-run test case? Maybe you've hit a bug, maybe you haven't, but it's impossible to tell right now.

bnoordhuis avatar Jan 31 '23 11:01 bnoordhuis

@bnoordhuis

// since it's ESM, save it as .mjs

import fs from 'node:fs'
import process from 'node:process'
import {Readable} from 'node:stream'

// we initialize a stream, but not start consuming it
const randomNodeStream = fs.createReadStream('/dev/urandom')
// after 10 seconds, it'll get converted to web stream
let randomWebStream

// we check memory usage every second
// since it's a stream, it shouldn't be higher than the chunk size
const reportMemoryUsage = () => {
	const {arrayBuffers} = process.memoryUsage()
	console.log(
		`Array buffers memory usage is ${Math.round(
			arrayBuffers / 1024 / 1024
		)} MiB`
	)
	if (arrayBuffers > 256 * 1024 * 1024) {
		// streaming should not lead to such a memory increase
		// therefore, if it happens => bail
		console.log('Over 256 MiB taken, exiting')
		process.exit(0)
	}
}
setInterval(reportMemoryUsage, 1000)

// after 10 seconds we use Readable.toWeb
// memory usage should stay pretty much the same since it's still a stream
setTimeout(() => {
	console.log('converting node stream to web stream')
	randomWebStream = Readable.toWeb(randomNodeStream)
}, 10000)

// after 15 seconds we start consuming the stream
// memory usage will grow, but the old chunks should be garbage-collected pretty quickly
setTimeout(async () => {
	console.log('reading the chunks')
	for await (const chunk of randomWebStream) {
		// do nothing, just let the stream flow
	}
}, 15000)

This produces the same behavior on macOS and Linux:

michal@dzieni ~ % node test.mjs
Array buffers memory usage is 0 MiB
Array buffers memory usage is 0 MiB
Array buffers memory usage is 0 MiB
Array buffers memory usage is 0 MiB
Array buffers memory usage is 0 MiB
Array buffers memory usage is 0 MiB
Array buffers memory usage is 0 MiB
Array buffers memory usage is 0 MiB
Array buffers memory usage is 0 MiB
converting node stream to web stream
Array buffers memory usage is 0 MiB
Array buffers memory usage is 617 MiB
Over 256 MiB taken, exiting

Immediately after using Readable.toWeb the memory usage spikes. Since we use /dev/urandom (so the file that never ends), the usage will grow indefinitely.

You can compare it to a third-party library readable-stream-node-to-web, where it works properly:

// since it's ESM, save it as .mjs

import fs from 'node:fs'
import process from 'node:process'
import nodeToWebStream from 'readable-stream-node-to-web'

// we initialize a stream, but not start consuming it
const randomNodeStream = fs.createReadStream('/dev/urandom')
// after 10 seconds, it'll get converted to web stream
let randomWebStream

// we check memory usage every second
// since it's a stream, it shouldn't be higher than the chunk size
const reportMemoryUsage = () => {
	const {arrayBuffers} = process.memoryUsage()
	console.log(
		`Array buffers memory usage is ${Math.round(
			arrayBuffers / 1024 / 1024
		)} MiB`
	)
	if (arrayBuffers > 256 * 1024 * 1024) {
		// streaming should not lead to such a memory increase
		// therefore, if it happens => bail
		console.log('Over 256 MiB taken, exiting')
		process.exit(0)
	}
}
setInterval(reportMemoryUsage, 1000)

// after 10 seconds we use nodeToWebStream
// memory usage should stay pretty much the same since it's still a stream
setTimeout(() => {
	console.log('converting node stream to web stream')
	randomWebStream = nodeToWebStream(randomNodeStream)
}, 10000)

// after 15 seconds we start consuming the stream
// memory usage will grow, but the old chunks should be garbage-collected pretty quickly
setTimeout(async () => {
	console.log('reading the chunks')
	for await (const chunk of randomWebStream) {
		// do nothing, just let the stream flow
	}
}, 15000)

In that case, the memory usage is fine:

michal@dzieni ~ % node src/test.mjs 
Array buffers memory usage is 0 MiB
Array buffers memory usage is 0 MiB
Array buffers memory usage is 0 MiB
Array buffers memory usage is 0 MiB
Array buffers memory usage is 0 MiB
Array buffers memory usage is 0 MiB
Array buffers memory usage is 0 MiB
Array buffers memory usage is 0 MiB
Array buffers memory usage is 0 MiB
converting node stream to web stream
Array buffers memory usage is 0 MiB
Array buffers memory usage is 0 MiB
Array buffers memory usage is 0 MiB
Array buffers memory usage is 0 MiB
Array buffers memory usage is 0 MiB
reading the chunks
Array buffers memory usage is 1 MiB
Array buffers memory usage is 6 MiB
Array buffers memory usage is 9 MiB
Array buffers memory usage is 15 MiB
Array buffers memory usage is 5 MiB
Array buffers memory usage is 9 MiB
Array buffers memory usage is 11 MiB
Array buffers memory usage is 4 MiB
Array buffers memory usage is 19 MiB
Array buffers memory usage is 16 MiB
Array buffers memory usage is 1 MiB
Array buffers memory usage is 30 MiB
Array buffers memory usage is 24 MiB
Array buffers memory usage is 6 MiB
Array buffers memory usage is 4 MiB
Array buffers memory usage is 2 MiB
Array buffers memory usage is 1 MiB

Dzieni avatar Feb 02 '23 15:02 Dzieni

Thanks, I see what you mean. The stream starts reading before there's something consuming it.

@nodejs/whatwg-stream this is a legitimate bug. You can see it even more clearly when you switch from /dev/urandom to /dev/zero.

edit: bug also exists in v19.6.0.

bnoordhuis avatar Feb 03 '23 11:02 bnoordhuis

Hello, @Dzieni I think this issue is solved if you pass a strategy while converting the stream from node stream something like:

randomWebStream = Readable.toWeb(randomNodeStream, {
    strategy: new CountQueuingStrategy({ highWaterMark: 100 }),
  });

Tried this on my local machine and memory doesn't seem to be overflowing Screenshot 2023-02-04 at 12 55 53 AM

I think the behavior here is somewhat expected when the readable stream is created, the pull function as described here https://nodejs.org/api/webstreams.html#new-readablestreamunderlyingsource--strategy would be called continuously as soon as the ReadableStream is created and given how adapters from webstreams are defined here https://github.com/nodejs/node/blob/23effb255efe3eb0dc935e3a430d80b41ea1e660/lib/internal/webstreams/adapters.js#L462 this would mean if there is no strategy defined streamReadable.resume() will try to consume the whole file causing memory overflow, but when we pass a strategy it ensures backpressure is applied

Maybe the docs here could be updated to include this scenario https://nodejs.org/api/stream.html#streamreadabletowebstreamreadable-options @bnoordhuis

debadree25 avatar Feb 03 '23 19:02 debadree25

I think this is a bug. The whole point of streams is to manage the flow of data.

mcollina avatar Feb 03 '23 20:02 mcollina

I think this is a bug. The whole point of streams is to manage the flow of data.

So something like a default highWatermark while doing toWeb()?

debadree25 avatar Feb 03 '23 20:02 debadree25

No I think there is an actual bug somewhere. Instead of resume, this should call read()

mcollina avatar Feb 03 '23 21:02 mcollina

@debadree25 Sounds like a workaround that is good enough for me, thanks!

Dzieni avatar Feb 06 '23 11:02 Dzieni

Hey guys and gals, could you please confirm my findings. So I've looked into this briefly, and I believe I have traced this to /lib/internal/webstreams/readablestream.js.

Now, I am not entirely sure if it's by design but neither readableByteStreamControllerShouldCallPull nor readableStreamDefaultControllerShouldCallPull seem to check for presence of a reader: https://github.com/nodejs/node/blob/main/lib/internal/webstreams/readablestream.js#L2458-L2481 https://github.com/nodejs/node/blob/main/lib/internal/webstreams/readablestream.js#L2223-L2240

Refusing to pull when a stream has no reader seems to alleviate the issue. Something like this seems to do the trick: https://github.com/nodejs/node/pull/46643/commits/1f0e2493f159d61f9974ea416a0efc34df6ad426 (https://github.com/nodejs/node/pull/46643)

Testing against @Dzieni's benchmark - slightly modified to convert to the web stream sooner, - here's before the change is applied:

Array buffers memory usage is 0 MiB
Array buffers memory usage is 0 MiB
converting node stream to web stream
Array buffers memory usage is 0 MiB
Array buffers memory usage is 109 MiB
Array buffers memory usage is 206 MiB
Array buffers memory usage is 336 MiB
Over 256 MiB taken, exiting

Here's after the change is applied:

Array buffers memory usage is 0 MiB
Array buffers memory usage is 0 MiB
converting node stream to web stream
Array buffers memory usage is 0 MiB
Array buffers memory usage is 0 MiB
Array buffers memory usage is 0 MiB
Array buffers memory usage is 0 MiB
Array buffers memory usage is 0 MiB

For completeness, I'm attaching @Dzieni's benchmark with the slight modification that I mentioned along with this message.

// since it's ESM, save it as .mjs

import fs from 'node:fs'
import process from 'node:process'
import {Readable} from 'node:stream'

// we initialize a stream, but not start consuming it
const randomNodeStream = fs.createReadStream('/dev/urandom')
// in a few seconds, it'll get converted to web stream
let randomWebStream

// we check memory usage every second
// since it's a stream, it shouldn't be higher than the chunk size
const reportMemoryUsage = () => {
	const {arrayBuffers} = process.memoryUsage()
	console.log(
		`Array buffers memory usage is ${Math.round(
			arrayBuffers / 1024 / 1024
		)} MiB`
	)
	if (arrayBuffers > 256 * 1024 * 1024) {
		// streaming should not lead to such a memory increase
		// therefore, if it happens => bail
		console.log('Over 256 MiB taken, exiting')
		process.exit(0)
	}
}
setInterval(reportMemoryUsage, 1000)

// after 3 seconds we use Readable.toWeb
// memory usage should stay pretty much the same since it's still a stream
setTimeout(() => {
	console.log('converting node stream to web stream')
	randomWebStream = Readable.toWeb(randomNodeStream)
}, 3000)

// after 30 seconds we start consuming the stream
// memory usage will grow, but the old chunks should be garbage-collected pretty quickly
setTimeout(async () => {
	console.log('reading the chunks')
	for await (const chunk of randomWebStream) {
		// do nothing, just let the stream flow
	}
}, 30000)

[Edit, Feb 14]: updated the draft solution link to point to a specific commit.

lilsweetcaligula avatar Feb 14 '23 04:02 lilsweetcaligula

@jasnell @KhafraDev could you take a look?

mcollina avatar Feb 17 '23 10:02 mcollina

It looks like node implements ReadableByteStreamControllerShouldCallPull and ReadableStreamDefaultControllerShouldCallPull correctly, which makes me think it's not an issue with them. I don't know enough about the webstream spec to give a definitive answer though.

KhafraDev avatar Feb 17 '23 15:02 KhafraDev

@debadree25 I've been looking into this on and off and I may need a clarification. Could you please clarify it for me - whether in the code snippet below, - the highWaterMark value of the randomWebStream's default controller should default to 65536 bytes or 65536 chunks of size 65536 bytes each?

const randomNodeStream = fs.createReadStream('/dev/urandom')
const randomWebStream = Readable.toWeb(randomNodeStream)

lilsweetcaligula avatar Feb 19 '23 00:02 lilsweetcaligula

From what I understand the high watermark would be 65536 "chunks" since here https://github.com/nodejs/node/blob/132c383b1872d0114e00722fe0610745f7f09cab/lib/internal/webstreams/adapters.js#L424 we dont set any size function and by default the size function just returns 1 Ref: https://github.com/nodejs/node/blob/132c383b1872d0114e00722fe0610745f7f09cab/lib/internal/webstreams/util.js#L73 so it would be 65536 "chunks" each chunk regarded as size 1

The comment in lines https://github.com/nodejs/node/blob/132c383b1872d0114e00722fe0610745f7f09cab/lib/internal/webstreams/adapters.js#L420-L423 mention ByteLengthQueuingStrategy as unecessary but maybe it indeed is?

debadree25 avatar Feb 19 '23 05:02 debadree25

@debadree25 Thank you for your reply. As I was inspecting the code yesterday, I just found it somewhat odd - hence I had to clarify it with someone.

Another odd thing I found, - if you have code like this:

const randomNodeStream = fs.createReadStream('/dev/urandom', {
	highWaterMark: 5556
})

const randomWebStream = Readable.toWeb(randomNodeStream)

Upon inspection with a debugger, I found that randomWebStream's controller's queue (!) was eventually filled with 5556 chunks (i.e. queueTotalSize === 5556) with seemingly each (!) chunk having a size of 5556 bytes. Changing 5556 to 5557 in the code snippet would give a similar result - 5557 chunks with each chunk having a size 5557 bytes.

That means if a user does not explicitly pass an hwm to fs.createReadStream - the resulting stream will have an hwm of 65536, which upon conversion to a web stream results in 65536 chunks 65536 bytes each, which is a total of 4 GB.

Is this a bug or am I misunderstanding it?

lilsweetcaligula avatar Feb 19 '23 16:02 lilsweetcaligula

This isn't a "good first issue", I don't think "streams" and "good first issue" really mix except for tests/docs

benjamingr avatar Feb 19 '23 18:02 benjamingr

@lilsweetcaligula tbf even I am confused here would need deeper investigation 😅😅

debadree25 avatar Feb 19 '23 18:02 debadree25

@benjamingr I think this might be as simple as it gets for streams.

The problem is that in https://github.com/nodejs/node/blob/0093fd3ca85b35b8bb2f4ff9d97082a71b23a124/lib/internal/webstreams/adapters.js#L462 we call resume() and in https://github.com/nodejs/node/blob/0093fd3ca85b35b8bb2f4ff9d97082a71b23a124/lib/internal/webstreams/adapters.js#L436-L437 we call pause() only on certain conditions (which I think are never met or similar). This is somewhat dangerous and can cause exactly what we are seeing here.

Note that this should be calling .read() and listen to 'readable' instead.

mcollina avatar Feb 21 '23 22:02 mcollina

From what I understand the high watermark would be 65536 "chunks" since here

https://github.com/nodejs/node/blob/132c383b1872d0114e00722fe0610745f7f09cab/lib/internal/webstreams/adapters.js#L424

we dont set any size function and by default the size function just returns 1 Ref: https://github.com/nodejs/node/blob/132c383b1872d0114e00722fe0610745f7f09cab/lib/internal/webstreams/util.js#L73

so it would be 65536 "chunks" each chunk regarded as size 1 The comment in lines

https://github.com/nodejs/node/blob/132c383b1872d0114e00722fe0610745f7f09cab/lib/internal/webstreams/adapters.js#L420-L423

mention ByteLengthQueuingStrategy as unecessary but maybe it indeed is?

Hey @mcollina

I think @debadree25 is right in his comment that we should use the ByteLengthQueuingStrategy

nevertheless, I did try using the readable event and it worked, I'll create a PR shortly

rluvaton avatar Feb 26 '23 01:02 rluvaton

Any updates on this?

karimfromjordan avatar Sep 13 '23 08:09 karimfromjordan

In meanwhile I found another workaround - ReadableStream.from(randomNodeStream) seems to do the job just right, without having to set the queueing strategy explicitly.

Dzieni avatar May 07 '24 06:05 Dzieni

I had the same issue with the memory leak, and after installing Node v22.2 it went away, but not entirely. I still get massive slowdown/memory leak when I use the ReadableStream in fetch (which many probably try with this toWeb function):

await fetch(url, {
  method: 'POST',
  body: Readable.toWeb(data),
  ...({ duplex: "half" })
});

I had to switch to got to remove the memory leak, unfortunately

marcj avatar Jun 07 '24 00:06 marcj

I want to note that I'm still experiencing a serious memory leak here as of node 22.9.0. My use case is the same as mentioned by @marcj – I'm basically running a proxy, passing the request in the fetch body via Readable.toWeb(request). When uploading a several GB file, memory utilization quickly jumps to ~4GB before crashing.

The ReadableStream.from(request) approach mentioned above also experiences the same issue.

I refactored to use the got library streaming interface, and the memory overhead completely goes away. Of course, using native fetch with web streams would be preferable.

async function requestHandler(request, response){
        const upstreamUrl = decodeURIComponent(request.query.url);
	const upstreamResponse = await fetch(upstreamUrl, {
		method: request.method,
		duplex: 'half',
		// Both approaches cause memory leaks
		body: Readable.toWeb(request),
		// body: ReadableStream.from(request),
	});

	await upstreamResponse.body.pipeTo(Writable.toWeb(response));
}

imathews avatar Sep 25 '24 05:09 imathews