node-archiver unexpected behavior using append(<stream>)

unexpected behavior using append(<stream>)

Open FranckFreiburger opened this issue 5 years ago • 26 comments

In the following example:

const fs = require('fs')
const { spawn } = require('child_process');
const { Readable } = require('stream');
const archiver = require('archiver');

const archive = archiver('zip');

archive.pipe(fs.createWriteStream('test.zip'));

for ( let i = 0; i < 10; ++i ) {

	let child = spawn('dir', [], {
		shell: true,
		stdio: ['ignore', 'pipe', 'ignore'],
	});

	archive.append(child.stdout, { name: 'test'+i+'.txt' });
}

archive.finalize();

The file test.zip contains 10 files and only the first one has content (other are 0-size)

archiver v3.0.0 nodejs v11.3.0 os: windows 7

Jan 20 '19 17:01 FranckFreiburger

Same trouble here, on linux fedora 29, nodejs v10, archiver v3 and when using an HTTP stream (IncomingMessage).

The content is randomly truncated, in a totally undeterministic way, i.e: the very same streams in the exact same order may succeed 10% of time, and fails 90% of time with half of the file truncated or even with a length of zero bytes.

I have tried to append a stream only when the previous one has finished, it doesn't seems to improve the result.

Any news on this?

Feb 12 '19 15:02 cronvel

I am having the same issue

Feb 28 '19 20:02 robertellam

Same for me Node 10.7.0 Ubuntu 18.10 archiver 3.0.0

Mar 14 '19 16:03 eric-basley

I noticed that when files get truncated more than one entry event has fired between append calls. I tried waiting on the entry event before appending the next file. This fixed most of the files but it is still possible in some cases that now the first file gets truncated. Instead I tried buffering the stdout stream through a passthrough stream which seems to work around the issue.

const fs = require('fs')
const child_process = require('child_process');
const { PassThrough } = require('stream');
const archiver = require('archiver');

issue_364();

async function issue_364() {
   await verify(wait_entry);
   await verify(pass_through);
   await verify(append_stream);
}

async function verify(generate_archive) {
   console.log(generate_archive.name);
   const archive = archiver('zip');
   const out = fs.createWriteStream('test.zip');
   archive.pipe(out);

   let done = new Promise(resolve => {
      out.on('finish', () => {
         verify_files();
         resolve();
      });
   });
   await generate_archive(archive);  
   archive.finalize();
   await done;
}

// fails all but first file
function append_stream(archive) {
    for ( let i = 0; i < 10; ++i ) {
      let child = spawn();
      append(child.stdout, i, archive);
   }
}   

// all files pass
function pass_through(archive) {
    for ( let i = 0; i < 10; ++i ) {
      let pass = new PassThrough();
      let child = spawn();
      child.stdout.pipe(pass);
      append(pass, i, archive);
   }
}

// the first file still gets truncated if this is run first
// if this is run after another test it will pass
async function wait_entry(archive) {
    for ( let i = 0; i < 10; ++i ) {
      let child = spawn();
      let entry = new Promise(resolve => {
         archive.on('entry', resolve);
      });
      append(child.stdout, i, archive);
      await entry;
   }
}          

// 13 bytes on stdout
function spawn() {
   return child_process.spawn('echo', ['hello world!']);
}

function append(stream, index, archive) {
   archive.append(stream, { name: `foo/test${index}.txt` });
}

function verify_files() {
   child_process.execSync('unzip test.zip');
   for(let file of fs.readdirSync('foo')) {
      let size = fs.statSync(`foo/${file}`).size;
      if(size != 13) {
         console.log(`${file} expected 13 bytes got ${size}`);
      }
   }
   child_process.execSync('rm -r foo test.zip');
}

May 15 '19 14:05 brownjt

For anyone having the same problem and having trouble finding the workaround, the short answer is:

const { PassThrough } = require('stream')
archive.append(file.pipe(new PassThrough()), { name })

I tested this with network requests and child processes, seems to work every time.

Oct 24 '19 23:10 jntesteves

This is work for me, and it's fluency.

const bufs = []
child.stdout.on('data', d => bufs.push(d))
child.stdout.on('end', () => archive.append(Buffer.concat(bufs), { name: 'test'+i+'.txt' }))

Dec 08 '19 09:12 zhujun24

@zhujun24 This issue does not apply to your example, since you are appending a Buffer, and the problem only happens when appending a Stream.

To anyone interested, the fix for this issue was already accepted on upstream archiver-utils with the merged PR 17. Hopefully it'll be in the next release of archiver.

Dec 09 '19 22:12 jntesteves

I am having a similar issue but for bigger files, sample code:

const urls = [
  { fileName: "abc", url: "https://abcd.com/q?abc.mp4" },
  { fileName: "xyz", url: "https://abcd.com/q?xyz.mp4" }
];
while (urls.length) {
  await Promise.all(
    urls.splice(0, promiseAllLimit).map(elem => {
      return new Promise((resolve, reject) => {
        const fileName = elem.fileName;
        const url = elem.url;
        if (url) {
          https.get(url, data => {
            archive.append(data, { name: fileName });
            data.on("close", () => {
              console.log("Closed : ", fileName);
              resolve("done");
            });
          });
        }
      });
    })
  );
}
archive.finalize();

This code works fine for small size files but whenever I try to zip large size files it does not seems to work. let say I try archiving 3 500MB files each, then after archiving is complete, out of three only 1 file is complete and the other two are in KBs.It seems like the finalize() is finishing before completely reading the read stream.

Also If I am using buffer instead of streams while appending into Archiver it works fine. But the buffer takes memory and if the file is very large it will take lots of memory which is not the best option. Please Help!

Jan 13 '20 04:01 raghuchahar007

@raghuchahar007 No wonder that doesn't work. I'm surprised you can even finalize at all. That code will fail whenever the VM hits it's memory limit. Your problem is not related to this issue, since this issue is about streaming only, which you definitely should be doing, but you're not.

I don't know what you're using for the http requests, but you should take a look at request's documentation, especially the Streaming part, and try to follow from there. Then you can use the solution I posted above for each append(). It's important to pipe through a PassThrough() stream on the same iteration of the event loop when the stream was initialized.

A full example of streaming a file with request is on this PR to archiver-utils that landed the solution to this issue.

I hope that helps. I know Node Streams is quite a complex subject.

[EDIT]: Reading your code again, I guess maybe the data argument on the callback to https.get() might actually be a Stream, in which case the only thing you're missing is pipping it through a PassThrough() stream until the fix lands in a future version of archiver.

Jan 13 '20 16:01 jntesteves

@jntesteves Yes, data provided in the callback by https get method is actually a readable stream and I have already tried both of your suggestions i.e using request module and using passthrough before appending in archiver. the request module was also behaving similarly.

Using passthrough

const { PassThrough } = require("stream")
const urls = [
  { fileName: "abc", url: "https://abcd.com/q?abc.mp4" },
  { fileName: "xyz", url: "https://abcd.com/q?xyz.mp4" }
];
while (urls.length) {
  await Promise.all(
    urls.splice(0, promiseAllLimit).map(elem => {
      return new Promise((resolve, reject) => {
        const fileName = elem.fileName;
        const url = elem.url;
        if (url) {
          https.get(url, data => {
            archive.append(data.pipe(new PassThrough()), { name: fileName });
            data.on("close", () => {
              console.log("Closed : ", fileName);
              resolve("done");
            });
          });
        }
      });
    })
  );
}
archive.finalize();

Am I doing something wrong here?

Also, I tried piping the same files to file directory along with the response i.e

archive.append(data.pipe(new PassThrough()), { name: fileName });//data is readable stream
data.pipe(require("fs").createWriteStream(fileName));

In this case, files saved on my file directory were fine but files I received in ZIP were incomplete.

Strange thing is that one file out of all those large files from the ZIP is always complete.

Jan 14 '20 05:01 raghuchahar007

@raghuchahar007 Just to make sure it's not a simple mistake: I have suggested this same workaround to a colleague at work and he had trouble because he npm installed stream and some other dependencies that are actually native to nodejs, and doing that he replaced the native modules with whatever someone has published to npm under the same name (possibly even malware).

That's would be the simple mistake, I'll follow this message with a much more elaborate answer to your problem. I wish this feature could always be just as simple as archiver's API tries to make it, but I have had to do it the more elaborate way in the past, and I still didn't figure out if it's actually possible to make archiver's simple implementation solve this use case.

Jan 14 '20 13:01 jntesteves

Internally, archiver uses an async queue to queue up files to be added to the archive. Due to zip's streaming nature, the files must be encoded one at a time, which means queued files will not be processed until previous files have finished encoding. The async queue does that for you, but for node streams there's a catch, if your streams are already flowing when added to the queue, backpressure must be managed, which in the case of long running tasks like these, mean pausing the stream for potentially minutes while a file downloads. The PassThough() stream (and all node native streams) are quite smart and will do that automatically for you, but they can only signal to the producer that they need to pause, they can't force the producer to stop producing chunks, and their internal in-memory buffer is very limited. If the http request keep sending chunks after the buffer is full, those bytes will be lost.

Given this scenario, archiver's internal queue becomes inappropriate, and the best solution I've found so far is implementing my own queuing for these jobs. It's quite simple, though. Internally, archiver uses a lower level module called zip-stream. It does the exact same thing archiver does, except for the queuing part. You can manually queue your http requests to be started only after the previous file finishes encoding, this way creating the producer streams only when the consumer stream is ready to start consuming them. I usually queue with a sequence of chained promises, but any mechanism will do.

I've just setup a repository with my experiments in breaking archiver. There you'll find a test-file with a scenario using zip-stream to download gigabytes of files in a sequence of promises. This is the only test that passes with a list of such big files.

Sorry for the overly long text. I hope that helps anyone with the same problem.

Jan 14 '20 20:01 jntesteves

Thanks, @jntesteves, Yes, I was also searching for some end/finish/close event for the completion of the archiver's append(not finalize) method so that I can queue my files accordingly but didn't find any. It would be good if the archiver itself provides some flexibility over its append method so that we can wait for the previous file before adding others.

As per your suggestion regarding using zip-stream directly, I checked your code and modified it as per my requirements and now it seems working fine.

Here is new sample:

const packer = require("zip-stream");
const fs = require("fs");
const https = require("https");
const http = require("http");

const archive = new packer();

const output = fs.createWriteStream("./ZipTest.zip");
archive.pipe(output);

const urls = [
  {
    fileName: "1.png",
    url: "https://homepages.cae.wisc.edu/~ece533/images/airplane.png"
  },
  {
    fileName: "2.png",
    url: "https://homepages.cae.wisc.edu/~ece533/images/boat.png"
  }
];

//urls must be https
const handleEntries = elem => {
  return new Promise((resolve, reject) => {
    const fileName = elem.fileName;
    const url = elem.url;
    console.log("Downloading : ", fileName);
    https.get(url, data => {
      archive.entry(data, { name: fileName }, (error, result) => {
        if (!error) {
          console.log(`File : ${fileName} appended.`);
          resolve(result);
        } else {
          console.error(`Error appending file : ${fileName} url : ${url}.`);
          reject(error);
        }
      });
      data.on("close", () => {
        console.log("Int : closed ,", fileName);
      });
    });
  });
};

const testZip = async () => {
  for (const elem of urls) {
    await handleEntries(elem);
  }
  archive.finish();
};

testZip();

Please correct me if anything is wrong here. Also why we need zlib.createGunzip() before piping?

Jan 15 '20 04:01 raghuchahar007

@raghuchahar007 Yes, your code seems fine.

zlib.createGunzip() is there because if the response is gzipped by the server, request doesn't gunzip it automatically on the response event.

Regarding the append() method, having a callback for when a file finishes processing would mean you're queuing yourself, and since the only thing archiver does is queuing over zip-stream and tar-stream, I think it kinda defeats the point of using archiver in the first place. Archiver is a queuing helper, nothing more. I think just the documentation needs a little more clarification on that, and what use-cases each lib supports best. And zip-stream's documentation is definitely needing some love. I'll try to contribute that when I have some time.

Jan 15 '20 11:01 jntesteves

@jntesteves Thanks for your suggestions

zlib.createGunzip() is there because if the response is gzipped by the server, the request doesn't gunzip it automatically on the response event.

Actually, in my case I don't need to unzip the gzipped file, i.e if I encountered a gzipped file, I will append it to the zip-stream directly which would result in a zip containing gzipped files.

Jan 15 '20 13:01 raghuchahar007

@raghuchahar007 That part actually is about the transparent gzipping that the http protocol supports when you send a header Accept-Encoding: gzip, and the webserver gzips any payload before sending through the wire. It's usually for text files, so you probably don't need that.

Jan 15 '20 20:01 jntesteves

@jntesteves I am getting "network failed" error on chrome while downloading large files, above code worked for 600 MB files each but failing for 1GB and more. Please Help!

Jan 20 '20 12:01 raghuchahar007

@raghuchahar007 This is the kind of problem that can't be solved by just looking at code. It's gonna take some debugging. I also haven't ever sent payloads this big using nodejs, so you might wanna check if you can even do it without going through your middlewares first to check if you're not hitting a limitation in some more basic infrastructure of yours, like nodejs, express, a proxy. It could be a bug not in your code, not many projects have automated tests with large files. Check VM's memory consumption. Run express in debug mode. Check logs of VM, proxy, and any other infrastructure in the middle. Try other browsers, other versions of nodejs. Best of luck.

Jan 20 '20 12:01 jntesteves

@jntesteves the problem is that I am unable to produce this locally, it is happening when my app is deployed. My local machine is of 12 GB RAM and 4 cores and the pod where my app is deployed on is of 300MB RAM and .3 cores CPU. But the task I am doing doesn't seem to use more than 100MB RAM and .1 core CPU. I am not getting any error nor my server stops working and also chromes dev tools not showing anything. After downloading a few GB/MB it suddenly gives network error.

Should I go more native and try something else?

Jan 21 '20 08:01 raghuchahar007

Switching directly to zip-stream resolves all the issues for very large archives when using streams. Thanks @raghuchahar007 for a great example!

Feb 26 '20 10:02 nazwa

@nazwa Is it working on live project ? I mean I am getting some issues(Network failed error) after deploying it on google k8s.

Feb 27 '20 17:02 raghuchahar007

@raghuchahar007 I only took the "zipping & file adding parts" out of your code and modified it to our needs. It seems to work great on lambdas creating +1GB zips.

Using node-archiver with S3 streams would "end" after about 100MB 👾

Feb 27 '20 17:02 nazwa

@nazwa I tried it for more than 20+ GB zip, It worked fine on local but start giving network error after deployement. Also it worked fine after I increased server timeout on k8s to 1hr, but increasing timeout might be not a good solution.

Feb 27 '20 17:02 raghuchahar007

I have the same issue but when passing a buffer to the append. The code is the same except that a buffer is passed to the append function. I push ~5Kb files to a zip archive. For a small number of files (~100) everything is OK. For bigger numbers 7-zip reports that there is an unexpected end of archive (even though it extracts all the files correctly). Windows explorer reports that the file is invalid. I will appreciate your suggestions on how to resolve this issue

It seems like it's related to #414

Described the details here #476

Nov 24 '20 11:11 JekRock

This just happened to me. Several files ends up with 0 bytes - changing to tar doesn't even finish the process.

Dec 10 '20 22:12 WoLfulus

@WoLfulus can you share the code you were using?

Dec 11 '20 08:12 JekRock

node-archiver node-archiver copied to clipboard

unexpected behavior using append(<stream>)

node-archiver
node-archiver copied to clipboard