node-unzipper
node-unzipper copied to clipboard
file.stream(...).autodrain is not a function
I am trying to process a zip from s3. I need to ignore some files. At the moment I am not calling autodrain
function and skipping few files. It is giving me some unexpected result while processing. Basically while reading content it is giving content of some other file which I ignore. I suspect it is because of me not calling autodrain
.
As pe the documentation I should call autodrain
so stream doesn't halt, so I tried and now this is giving me error (file.stream(...).autodrain is not a function
). This is my code
const aws = require("aws-sdk");
const s3Client = new aws.S3();
const unzipper = require("unzipper");
const directory = await unzipper.Open.s3(s3Client, {
Bucket: "somebucket",
Key: "somezip.zip"
});
const files = directory.files;
for (const file of files) {
if (file.path.includes("some/path")) {
// do something with the stream
} else {
file.stream().autodrain(); // this gives autodrain is not a function
}
}
If i look at the types and as per intellisense, stream()
method returns an Entry
object which has autodrain()
function but not sure why it says autodrain is not a function. See screenshot here
Any help is really appreciated.
The Open
methods are random access and there is no reason to drain, i.e if you ignore an entry then there is no harm. Whenever you call .stream
or .buffer
you start reading the zip file at the precise location of the entry.
The concept of autodrain
comes from the legacy Parse
method that basically reads through the entire zip file from start to finish and emits entries
along the way. Each entry has to be read
for the reader to be able to continue to the next entry (or end of the file). If you want to skip an entry, you have to call autodrain
. Again, with the Open
methods, autodrain is not applicable or needed.
Perhaps we should make autodrain a NOOP here to ensure the entry
definitions match between Open
and Parse
Thanks @ZJONSSON for quick reply. This makes sense. But I am not sure why the behavior is like this. My zip has structure like this
- somefile.txt
- data
- file1.csv
- file2.csv
- ....
- file86.csv // this is the file from where the problem occurs
- ....
- file450 csv
- resources
- header.csv
I have check in my code to only read files from data folder. It works fine till csv no 86 but after that it picks the content of somefile.txt.
this is my full code
const directory = await unzipper.Open.s3(s3Client, {Bucket: 'ocal', Key: 'CSV.zip'});
const files = directory.files;
for(const file of files) {
if (file.path.includes("data/file")) {
console.time(file.path);
const collection = mongo.db().collection(recordIdentifiersMapping[file.path]);
await file.stream()
.pipe(etl.csv({"skipLines": 1, "headers": getHeaders(file.path)}))
.pipe(etl.collect(1000))
.pipe(etl.map(res => {
res.forEach(item => {
item._id = uuid();
});
return res;
}))
.pipe(etl.mongo.upsert(collection, ["_id"]))
.promise()
.then(() => {
console.log("finished", file.path);
console.timeEnd(file.path);
});
}
}
Any idea what I might be doing wrong?
Just to give an update, instead of opening the zip from s3 using unzipper.Open.s3
, this time I downloaded and extracted the zip and then used my code (basically fs.createReadStream()
), it seems to be working fine till now. Not sure if this will help in.
Hi @ZJONSSON - Did you get a chance to look in to my updated comments and behavior I am facing?. It is now consistent. If I try to download first 60-70 files, it works fine, After that for all the files it returns the first file as I mentioned in my previous comments. Do you have suggestion here or something I might be doing wrong?
Thanks @ZJONSSON for quick reply. This makes sense. But I am not sure why the behavior is like this. My zip has structure like this
somefile.txt
data
- file1.csv
- file2.csv
- ....
- file86.csv // this is the file from where the problem occurs
- ....
- file450 csv
resources
- header.csv
I have check in my code to only read files from data folder. It works fine till csv no 86 but after that it picks the content of somefile.txt.
this is my full code
const directory = await unzipper.Open.s3(s3Client, {Bucket: 'ocal', Key: 'CSV.zip'}); const files = directory.files; for(const file of files) { if (file.path.includes("data/file")) { console.time(file.path); const collection = mongo.db().collection(recordIdentifiersMapping[file.path]); await file.stream() .pipe(etl.csv({"skipLines": 1, "headers": getHeaders(file.path)})) .pipe(etl.collect(1000)) .pipe(etl.map(res => { res.forEach(item => { item._id = uuid(); }); return res; })) .pipe(etl.mongo.upsert(collection, ["_id"])) .promise() .then(() => { console.log("finished", file.path); console.timeEnd(file.path); }); } }
Any idea what I might be doing wrong?
I've got the exact same problem, for some reason, randomly it populates the file data with the root file, which in my case is license.txt; the console logged files are the correct ones but the content is the one from the root file.
for reference:
const directory = await unzipper_1.default.Open.s3(s3, { Bucket: utilities_1.getEnv("BUCKET"), Key: key });
try {
const filesToExtract = directory.files.filter(file => files.includes(file.path));
console.log('-----------------', JSON.stringify(filesToExtract));
await Promise.all(filesToExtract.map(async (file) => {
console.log('---------------extracting-------------', file.path);
const fileKey = `extracted/${file.path}`;
return await s3.upload({ Bucket: utilities_1.getEnv("BUCKET"), Key: fileKey, Body: file.stream() }, {
queueSize: 2
}).promise();
}));
}
catch (err) {
console.log(err);
}