node
node copied to clipboard
`Readable.toWeb` seems to load file contents to memory
Version
v19.5.0
Platform
Darwin dzieni 21.6.0 Darwin Kernel Version 21.6.0: Wed Aug 10 14:28:35 PDT 2022; root:xnu-8020.141.5~2/RELEASE_ARM64_T8101 arm64
Subsystem
stream
What steps will reproduce the bug?
- run REPL
- create a file stream:
const nodeStream = fs.createReadStream('/some/large/file')
- check
process.memoryUsage()
- convert it to a web stream:
const webStream = (require('stream').Readable).toWeb(nodeStream)
- check
process.memoryUsage()
How often does it reproduce? Is there a required condition?
At all times
What is the expected behavior?
Memory usage does not grow significantly.
What do you see instead?
Memory usage (precisely arrayBuffers
section) grows by a few orders of magnitude. It seems to be correlated with the file size.
Additional information
No response
Can you post a complete, ready-to-run test case? Maybe you've hit a bug, maybe you haven't, but it's impossible to tell right now.
@bnoordhuis
// since it's ESM, save it as .mjs
import fs from 'node:fs'
import process from 'node:process'
import {Readable} from 'node:stream'
// we initialize a stream, but not start consuming it
const randomNodeStream = fs.createReadStream('/dev/urandom')
// after 10 seconds, it'll get converted to web stream
let randomWebStream
// we check memory usage every second
// since it's a stream, it shouldn't be higher than the chunk size
const reportMemoryUsage = () => {
const {arrayBuffers} = process.memoryUsage()
console.log(
`Array buffers memory usage is ${Math.round(
arrayBuffers / 1024 / 1024
)} MiB`
)
if (arrayBuffers > 256 * 1024 * 1024) {
// streaming should not lead to such a memory increase
// therefore, if it happens => bail
console.log('Over 256 MiB taken, exiting')
process.exit(0)
}
}
setInterval(reportMemoryUsage, 1000)
// after 10 seconds we use Readable.toWeb
// memory usage should stay pretty much the same since it's still a stream
setTimeout(() => {
console.log('converting node stream to web stream')
randomWebStream = Readable.toWeb(randomNodeStream)
}, 10000)
// after 15 seconds we start consuming the stream
// memory usage will grow, but the old chunks should be garbage-collected pretty quickly
setTimeout(async () => {
console.log('reading the chunks')
for await (const chunk of randomWebStream) {
// do nothing, just let the stream flow
}
}, 15000)
This produces the same behavior on macOS and Linux:
michal@dzieni ~ % node test.mjs
Array buffers memory usage is 0 MiB
Array buffers memory usage is 0 MiB
Array buffers memory usage is 0 MiB
Array buffers memory usage is 0 MiB
Array buffers memory usage is 0 MiB
Array buffers memory usage is 0 MiB
Array buffers memory usage is 0 MiB
Array buffers memory usage is 0 MiB
Array buffers memory usage is 0 MiB
converting node stream to web stream
Array buffers memory usage is 0 MiB
Array buffers memory usage is 617 MiB
Over 256 MiB taken, exiting
Immediately after using Readable.toWeb
the memory usage spikes. Since we use /dev/urandom
(so the file that never ends), the usage will grow indefinitely.
You can compare it to a third-party library readable-stream-node-to-web
, where it works properly:
// since it's ESM, save it as .mjs
import fs from 'node:fs'
import process from 'node:process'
import nodeToWebStream from 'readable-stream-node-to-web'
// we initialize a stream, but not start consuming it
const randomNodeStream = fs.createReadStream('/dev/urandom')
// after 10 seconds, it'll get converted to web stream
let randomWebStream
// we check memory usage every second
// since it's a stream, it shouldn't be higher than the chunk size
const reportMemoryUsage = () => {
const {arrayBuffers} = process.memoryUsage()
console.log(
`Array buffers memory usage is ${Math.round(
arrayBuffers / 1024 / 1024
)} MiB`
)
if (arrayBuffers > 256 * 1024 * 1024) {
// streaming should not lead to such a memory increase
// therefore, if it happens => bail
console.log('Over 256 MiB taken, exiting')
process.exit(0)
}
}
setInterval(reportMemoryUsage, 1000)
// after 10 seconds we use nodeToWebStream
// memory usage should stay pretty much the same since it's still a stream
setTimeout(() => {
console.log('converting node stream to web stream')
randomWebStream = nodeToWebStream(randomNodeStream)
}, 10000)
// after 15 seconds we start consuming the stream
// memory usage will grow, but the old chunks should be garbage-collected pretty quickly
setTimeout(async () => {
console.log('reading the chunks')
for await (const chunk of randomWebStream) {
// do nothing, just let the stream flow
}
}, 15000)
In that case, the memory usage is fine:
michal@dzieni ~ % node src/test.mjs
Array buffers memory usage is 0 MiB
Array buffers memory usage is 0 MiB
Array buffers memory usage is 0 MiB
Array buffers memory usage is 0 MiB
Array buffers memory usage is 0 MiB
Array buffers memory usage is 0 MiB
Array buffers memory usage is 0 MiB
Array buffers memory usage is 0 MiB
Array buffers memory usage is 0 MiB
converting node stream to web stream
Array buffers memory usage is 0 MiB
Array buffers memory usage is 0 MiB
Array buffers memory usage is 0 MiB
Array buffers memory usage is 0 MiB
Array buffers memory usage is 0 MiB
reading the chunks
Array buffers memory usage is 1 MiB
Array buffers memory usage is 6 MiB
Array buffers memory usage is 9 MiB
Array buffers memory usage is 15 MiB
Array buffers memory usage is 5 MiB
Array buffers memory usage is 9 MiB
Array buffers memory usage is 11 MiB
Array buffers memory usage is 4 MiB
Array buffers memory usage is 19 MiB
Array buffers memory usage is 16 MiB
Array buffers memory usage is 1 MiB
Array buffers memory usage is 30 MiB
Array buffers memory usage is 24 MiB
Array buffers memory usage is 6 MiB
Array buffers memory usage is 4 MiB
Array buffers memory usage is 2 MiB
Array buffers memory usage is 1 MiB
Thanks, I see what you mean. The stream starts reading before there's something consuming it.
@nodejs/whatwg-stream this is a legitimate bug. You can see it even more clearly when you switch from /dev/urandom to /dev/zero.
edit: bug also exists in v19.6.0.
Hello, @Dzieni I think this issue is solved if you pass a strategy while converting the stream from node stream something like:
randomWebStream = Readable.toWeb(randomNodeStream, {
strategy: new CountQueuingStrategy({ highWaterMark: 100 }),
});
Tried this on my local machine and memory doesn't seem to be overflowing
I think the behavior here is somewhat expected when the readable stream is created, the pull function as described here https://nodejs.org/api/webstreams.html#new-readablestreamunderlyingsource--strategy would be called continuously as soon as the ReadableStream
is created and given how adapters from webstreams are defined here https://github.com/nodejs/node/blob/23effb255efe3eb0dc935e3a430d80b41ea1e660/lib/internal/webstreams/adapters.js#L462 this would mean if there is no strategy defined streamReadable.resume()
will try to consume the whole file causing memory overflow, but when we pass a strategy it ensures backpressure is applied
Maybe the docs here could be updated to include this scenario https://nodejs.org/api/stream.html#streamreadabletowebstreamreadable-options @bnoordhuis
I think this is a bug. The whole point of streams is to manage the flow of data.
I think this is a bug. The whole point of streams is to manage the flow of data.
So something like a default highWatermark while doing toWeb()
?
No I think there is an actual bug somewhere. Instead of resume, this should call read()
@debadree25 Sounds like a workaround that is good enough for me, thanks!
Hey guys and gals, could you please confirm my findings. So I've looked into this briefly, and I believe I have traced this to /lib/internal/webstreams/readablestream.js
.
Now, I am not entirely sure if it's by design but neither readableByteStreamControllerShouldCallPull
nor readableStreamDefaultControllerShouldCallPull
seem to check for presence of a reader:
https://github.com/nodejs/node/blob/main/lib/internal/webstreams/readablestream.js#L2458-L2481
https://github.com/nodejs/node/blob/main/lib/internal/webstreams/readablestream.js#L2223-L2240
Refusing to pull when a stream has no reader seems to alleviate the issue. Something like this seems to do the trick: https://github.com/nodejs/node/pull/46643/commits/1f0e2493f159d61f9974ea416a0efc34df6ad426 (https://github.com/nodejs/node/pull/46643)
Testing against @Dzieni's benchmark - slightly modified to convert to the web stream sooner, - here's before the change is applied:
Array buffers memory usage is 0 MiB
Array buffers memory usage is 0 MiB
converting node stream to web stream
Array buffers memory usage is 0 MiB
Array buffers memory usage is 109 MiB
Array buffers memory usage is 206 MiB
Array buffers memory usage is 336 MiB
Over 256 MiB taken, exiting
Here's after the change is applied:
Array buffers memory usage is 0 MiB
Array buffers memory usage is 0 MiB
converting node stream to web stream
Array buffers memory usage is 0 MiB
Array buffers memory usage is 0 MiB
Array buffers memory usage is 0 MiB
Array buffers memory usage is 0 MiB
Array buffers memory usage is 0 MiB
For completeness, I'm attaching @Dzieni's benchmark with the slight modification that I mentioned along with this message.
// since it's ESM, save it as .mjs
import fs from 'node:fs'
import process from 'node:process'
import {Readable} from 'node:stream'
// we initialize a stream, but not start consuming it
const randomNodeStream = fs.createReadStream('/dev/urandom')
// in a few seconds, it'll get converted to web stream
let randomWebStream
// we check memory usage every second
// since it's a stream, it shouldn't be higher than the chunk size
const reportMemoryUsage = () => {
const {arrayBuffers} = process.memoryUsage()
console.log(
`Array buffers memory usage is ${Math.round(
arrayBuffers / 1024 / 1024
)} MiB`
)
if (arrayBuffers > 256 * 1024 * 1024) {
// streaming should not lead to such a memory increase
// therefore, if it happens => bail
console.log('Over 256 MiB taken, exiting')
process.exit(0)
}
}
setInterval(reportMemoryUsage, 1000)
// after 3 seconds we use Readable.toWeb
// memory usage should stay pretty much the same since it's still a stream
setTimeout(() => {
console.log('converting node stream to web stream')
randomWebStream = Readable.toWeb(randomNodeStream)
}, 3000)
// after 30 seconds we start consuming the stream
// memory usage will grow, but the old chunks should be garbage-collected pretty quickly
setTimeout(async () => {
console.log('reading the chunks')
for await (const chunk of randomWebStream) {
// do nothing, just let the stream flow
}
}, 30000)
[Edit, Feb 14]: updated the draft solution link to point to a specific commit.
@jasnell @KhafraDev could you take a look?
It looks like node implements ReadableByteStreamControllerShouldCallPull
and ReadableStreamDefaultControllerShouldCallPull
correctly, which makes me think it's not an issue with them. I don't know enough about the webstream spec to give a definitive answer though.
@debadree25 I've been looking into this on and off and I may need a clarification. Could you please clarify it for me - whether in the code snippet below, - the highWaterMark
value of the randomWebStream
's default controller should default to 65536 bytes or 65536 chunks of size 65536 bytes each?
const randomNodeStream = fs.createReadStream('/dev/urandom')
const randomWebStream = Readable.toWeb(randomNodeStream)
From what I understand the high watermark would be 65536 "chunks" since here https://github.com/nodejs/node/blob/132c383b1872d0114e00722fe0610745f7f09cab/lib/internal/webstreams/adapters.js#L424 we dont set any size function and by default the size function just returns 1 Ref: https://github.com/nodejs/node/blob/132c383b1872d0114e00722fe0610745f7f09cab/lib/internal/webstreams/util.js#L73 so it would be 65536 "chunks" each chunk regarded as size 1
The comment in lines https://github.com/nodejs/node/blob/132c383b1872d0114e00722fe0610745f7f09cab/lib/internal/webstreams/adapters.js#L420-L423
mention ByteLengthQueuingStrategy
as unecessary but maybe it indeed is?
@debadree25 Thank you for your reply. As I was inspecting the code yesterday, I just found it somewhat odd - hence I had to clarify it with someone.
Another odd thing I found, - if you have code like this:
const randomNodeStream = fs.createReadStream('/dev/urandom', {
highWaterMark: 5556
})
const randomWebStream = Readable.toWeb(randomNodeStream)
Upon inspection with a debugger, I found that randomWebStream
's controller's queue (!) was eventually filled with 5556 chunks (i.e. queueTotalSize
=== 5556) with seemingly each (!) chunk having a size of 5556 bytes. Changing 5556 to 5557 in the code snippet would give a similar result - 5557 chunks with each chunk having a size 5557 bytes.
That means if a user does not explicitly pass an hwm to fs.createReadStream
- the resulting stream will have an hwm of 65536, which upon conversion to a web stream results in 65536 chunks 65536 bytes each, which is a total of 4 GB.
Is this a bug or am I misunderstanding it?
This isn't a "good first issue", I don't think "streams" and "good first issue" really mix except for tests/docs
@lilsweetcaligula tbf even I am confused here would need deeper investigation 😅😅
@benjamingr I think this might be as simple as it gets for streams.
The problem is that in https://github.com/nodejs/node/blob/0093fd3ca85b35b8bb2f4ff9d97082a71b23a124/lib/internal/webstreams/adapters.js#L462 we call resume()
and in https://github.com/nodejs/node/blob/0093fd3ca85b35b8bb2f4ff9d97082a71b23a124/lib/internal/webstreams/adapters.js#L436-L437 we call pause()
only on certain conditions (which I think are never met or similar). This is somewhat dangerous and can cause exactly what we are seeing here.
Note that this should be calling .read()
and listen to 'readable'
instead.
From what I understand the high watermark would be 65536 "chunks" since here
https://github.com/nodejs/node/blob/132c383b1872d0114e00722fe0610745f7f09cab/lib/internal/webstreams/adapters.js#L424
we dont set any size function and by default the size function just returns 1 Ref: https://github.com/nodejs/node/blob/132c383b1872d0114e00722fe0610745f7f09cab/lib/internal/webstreams/util.js#L73
so it would be 65536 "chunks" each chunk regarded as size 1 The comment in lines
https://github.com/nodejs/node/blob/132c383b1872d0114e00722fe0610745f7f09cab/lib/internal/webstreams/adapters.js#L420-L423
mention
ByteLengthQueuingStrategy
as unecessary but maybe it indeed is?
Hey @mcollina
I think @debadree25 is right in his comment that we should use the ByteLengthQueuingStrategy
nevertheless, I did try using the readable
event and it worked, I'll create a PR shortly
Any updates on this?
In meanwhile I found another workaround - ReadableStream.from(randomNodeStream)
seems to do the job just right, without having to set the queueing strategy explicitly.
I had the same issue with the memory leak, and after installing Node v22.2 it went away, but not entirely. I still get massive slowdown/memory leak when I use the ReadableStream in fetch (which many probably try with this toWeb function):
await fetch(url, {
method: 'POST',
body: Readable.toWeb(data),
...({ duplex: "half" })
});
I had to switch to got
to remove the memory leak, unfortunately
I want to note that I'm still experiencing a serious memory leak here as of node 22.9.0
. My use case is the same as mentioned by @marcj – I'm basically running a proxy, passing the request in the fetch body via Readable.toWeb(request)
. When uploading a several GB file, memory utilization quickly jumps to ~4GB before crashing.
The ReadableStream.from(request)
approach mentioned above also experiences the same issue.
I refactored to use the got
library streaming interface, and the memory overhead completely goes away. Of course, using native fetch with web streams would be preferable.
async function requestHandler(request, response){
const upstreamUrl = decodeURIComponent(request.query.url);
const upstreamResponse = await fetch(upstreamUrl, {
method: request.method,
duplex: 'half',
// Both approaches cause memory leaks
body: Readable.toWeb(request),
// body: ReadableStream.from(request),
});
await upstreamResponse.body.pipeTo(Writable.toWeb(response));
}