node-unzipper icon indicating copy to clipboard operation
node-unzipper copied to clipboard

Unable to process a zip containing many files

Open MathieuVeber opened this issue 3 years ago • 6 comments

Issue

I have been monitoring node memory and as we are processing more files, the heap is steadily growing as the garbage collector seems unable to act fast enough. In a couple minutes, it ends up with a memory issue like this:

<--- Last few GCs --->

11629672 ms: Mark-sweep 1174.6 (1426.5) -> 1172.4 (1418.3) MB, 659.9 / 0 ms [allocation failure] [GC in old space requested].
11630371 ms: Mark-sweep 1172.4 (1418.3) -> 1172.4 (1411.3) MB, 698.9 / 0 ms [allocation failure] [GC in old space requested].
11631105 ms: Mark-sweep 1172.4 (1411.3) -> 1172.4 (1389.3) MB, 733.5 / 0 ms [last resort gc].
11631778 ms: Mark-sweep 1172.4 (1389.3) -> 1172.4 (1368.3) MB, 673.6 / 0 ms [last resort gc].


<--- JS stacktrace --->

==== JS stack trace =========================================

Security context: 0x3d1d329c9e59 <JS Object>
1: SparseJoinWithSeparatorJS(aka SparseJoinWithSeparatorJS) [native array.js:~84] [pc=0x3629ef689ad0] (this=0x3d1d32904189 <undefined>,w=0x2b690ce91071 <JS Array[241627]>,L=241627,M=0x3d1d329b4a11 <JS Function ConvertToString (SharedFunctionInfo 0x3d1d3294ef79)>,N=0x7c953bf4d49 <String[4]\: ,\n  >)
2: Join(aka Join) [native array.js:143] [pc=0x3629ef616696] (this=0x3d1d32904189 <undefin...

FATAL ERROR: CALL_AND_RETRY_LAST Allocation failed - JavaScript heap out of memory
 1: node::Abort() [/usr/bin/node]
 2: 0xe2c5fc [/usr/bin/node]
 3: v8::Utils::ReportApiFailure(char const*, char const*) [/usr/bin/node]
 4: v8::internal::V8::FatalProcessOutOfMemory(char const*, bool) [/usr/bin/node]
 5: v8::internal::Factory::NewRawTwoByteString(int, v8::internal::PretenureFlag) [/usr/bin/node]
 6: v8::internal::Runtime_SparseJoinWithSeparator(int, v8::internal::Object**, v8::internal::Isolate*) [/usr/bin/node]
 7: 0x3629ef50961b

I don't think that this sample code doing nothing with each entry can produce that memory leak so I assume that this is coming from this library.

How to reproduce

Unzipper version: 0.10.11

Have a very large zip let's say 40GB containing about 1 million files and run this code:

import fs from 'fs'
import stream from 'stream'
import unzipper, { Entry } from 'unzipper'

const readStream = fs.createReadStream('xl.zip')
const voidStream = new stream.Writable({
	objectMode: true,
	write: (chunk: Entry, encoding, next) => {
		chunk.autodrain()
		next()
	},
})

readStream.pipe(unzipper.Parse()).pipe(voidStream)

Update 08/06/2022

The issue seems do be correlated with the number of files in the archive

MathieuVeber avatar Jun 03 '22 08:06 MathieuVeber

@MathieuVeber are you sure it works well with any zip containing more than 1 file compressed? I've just opened an issue, because I wasn't able to get it work with zip files containing more than 1 file

mstmustisnt avatar Jun 03 '22 11:06 mstmustisnt

I made this small example to prove my memory concern but I've been using this library with zips containing more than 1 file for a while now. However maybe this is an issue with a newer version therefore I've added mine (0.10.11) in this one.

MathieuVeber avatar Jun 07 '22 12:06 MathieuVeber

@MathieuVeber I found out that sometimes it works with zip files containing more than 1 file. Maybe the thing is with the file contents idk

mstmustisnt avatar Jun 07 '22 12:06 mstmustisnt

It seems that the problem is recursion using promises which, for zip files with lots of entries, increases the size of the big promise object to a point breaking the node memory limits.

In Parse#_readRecord, a signature is read and depending on its value, various specialized _read... functions are called. Many of these in turn call _readRecord again.

Each call happens inside the same promise chain, calling .then and returning a promise in the handler. This results in an ever-growing promise chain for the zip file.

I tested with the following setup:

I ran the following code to create a folder data full of 10,000 small files:

const fs = require("fs/promises");
const crypto = require("crypto");

(async function() {
    for(let i = 0; i < 10000; i++) {
        const content = crypto.randomBytes(4).toString("hex");
        await fs.writeFile(`${__dirname}/data/${content}.txt`, content, "utf8");
    }
})();

Then I used Windows to zip the entire folder into data.zip.

I ran @MathieuVeber's script from above but with data.zip as the input file, while setting the Node.js memory limit to 10 mb:

node --max-old-space-size=10 largefile-memory-leak.js

This gave me the memory limit error that @MathieuVeber printed above.

data.zip is only around 1 Mb large -- in contrast, a file that was 1 Gb but with only 100 files worked fine under the same conditions.

hypesystem avatar Jun 08 '22 13:06 hypesystem

I have added a PR that fixes this issue: https://github.com/ZJONSSON/node-unzipper/pull/257

hypesystem avatar Jun 08 '22 18:06 hypesystem

For anyone finding a quick approach. They can use this repo.

https://github.com/diverta/node-unzipper-multifile

aayusharyan avatar Oct 28 '22 08:10 aayusharyan

Hopefully, this issue has been resolved with https://github.com/ZJONSSON/node-unzipper/pull/257 (thanks @hypesystem) Please reopen if this is still an issue. The latest published version is v0.12.0

ZJONSSON avatar Jun 08 '24 20:06 ZJONSSON