Unable to process a zip containing many files
Issue
I have been monitoring node memory and as we are processing more files, the heap is steadily growing as the garbage collector seems unable to act fast enough. In a couple minutes, it ends up with a memory issue like this:
<--- Last few GCs --->
11629672 ms: Mark-sweep 1174.6 (1426.5) -> 1172.4 (1418.3) MB, 659.9 / 0 ms [allocation failure] [GC in old space requested].
11630371 ms: Mark-sweep 1172.4 (1418.3) -> 1172.4 (1411.3) MB, 698.9 / 0 ms [allocation failure] [GC in old space requested].
11631105 ms: Mark-sweep 1172.4 (1411.3) -> 1172.4 (1389.3) MB, 733.5 / 0 ms [last resort gc].
11631778 ms: Mark-sweep 1172.4 (1389.3) -> 1172.4 (1368.3) MB, 673.6 / 0 ms [last resort gc].
<--- JS stacktrace --->
==== JS stack trace =========================================
Security context: 0x3d1d329c9e59 <JS Object>
1: SparseJoinWithSeparatorJS(aka SparseJoinWithSeparatorJS) [native array.js:~84] [pc=0x3629ef689ad0] (this=0x3d1d32904189 <undefined>,w=0x2b690ce91071 <JS Array[241627]>,L=241627,M=0x3d1d329b4a11 <JS Function ConvertToString (SharedFunctionInfo 0x3d1d3294ef79)>,N=0x7c953bf4d49 <String[4]\: ,\n >)
2: Join(aka Join) [native array.js:143] [pc=0x3629ef616696] (this=0x3d1d32904189 <undefin...
FATAL ERROR: CALL_AND_RETRY_LAST Allocation failed - JavaScript heap out of memory
1: node::Abort() [/usr/bin/node]
2: 0xe2c5fc [/usr/bin/node]
3: v8::Utils::ReportApiFailure(char const*, char const*) [/usr/bin/node]
4: v8::internal::V8::FatalProcessOutOfMemory(char const*, bool) [/usr/bin/node]
5: v8::internal::Factory::NewRawTwoByteString(int, v8::internal::PretenureFlag) [/usr/bin/node]
6: v8::internal::Runtime_SparseJoinWithSeparator(int, v8::internal::Object**, v8::internal::Isolate*) [/usr/bin/node]
7: 0x3629ef50961b
I don't think that this sample code doing nothing with each entry can produce that memory leak so I assume that this is coming from this library.
How to reproduce
Unzipper version: 0.10.11
Have a very large zip let's say 40GB containing about 1 million files and run this code:
import fs from 'fs'
import stream from 'stream'
import unzipper, { Entry } from 'unzipper'
const readStream = fs.createReadStream('xl.zip')
const voidStream = new stream.Writable({
objectMode: true,
write: (chunk: Entry, encoding, next) => {
chunk.autodrain()
next()
},
})
readStream.pipe(unzipper.Parse()).pipe(voidStream)
Update 08/06/2022
The issue seems do be correlated with the number of files in the archive
@MathieuVeber are you sure it works well with any zip containing more than 1 file compressed? I've just opened an issue, because I wasn't able to get it work with zip files containing more than 1 file
I made this small example to prove my memory concern but I've been using this library with zips containing more than 1 file for a while now. However maybe this is an issue with a newer version therefore I've added mine (0.10.11) in this one.
@MathieuVeber I found out that sometimes it works with zip files containing more than 1 file. Maybe the thing is with the file contents idk
It seems that the problem is recursion using promises which, for zip files with lots of entries, increases the size of the big promise object to a point breaking the node memory limits.
In Parse#_readRecord, a signature is read and depending on its value, various specialized _read... functions are called. Many of these in turn call _readRecord again.
Each call happens inside the same promise chain, calling .then and returning a promise in the handler. This results in an ever-growing promise chain for the zip file.
I tested with the following setup:
I ran the following code to create a folder data full of 10,000 small files:
const fs = require("fs/promises");
const crypto = require("crypto");
(async function() {
for(let i = 0; i < 10000; i++) {
const content = crypto.randomBytes(4).toString("hex");
await fs.writeFile(`${__dirname}/data/${content}.txt`, content, "utf8");
}
})();
Then I used Windows to zip the entire folder into data.zip.
I ran @MathieuVeber's script from above but with data.zip as the input file, while setting the Node.js memory limit to 10 mb:
node --max-old-space-size=10 largefile-memory-leak.js
This gave me the memory limit error that @MathieuVeber printed above.
data.zip is only around 1 Mb large -- in contrast, a file that was 1 Gb but with only 100 files worked fine under the same conditions.
I have added a PR that fixes this issue: https://github.com/ZJONSSON/node-unzipper/pull/257
For anyone finding a quick approach. They can use this repo.
https://github.com/diverta/node-unzipper-multifile
Hopefully, this issue has been resolved with https://github.com/ZJONSSON/node-unzipper/pull/257 (thanks @hypesystem) Please reopen if this is still an issue. The latest published version is v0.12.0