Input file buffers retained in memory after compression of the file has finished
Context In the browser web worker, zipping a directory containing 2000 files totalling 2GB. 80% of files are under 100KB, there are about 10 of 10-50MB, and the rest are in between. It takes just under 2 minutes (initial sync implementation), resulting zip file is 1.8GB.
How to reproduce
In principle:
const zip = new fflate.Zip();
const zipOutputStream = fflToRS(zip); // https://github.com/101arrowz/fflate/wiki/Guide:-Modern-(Buildless)
zipOutputStream.pipeTo(targetFileStream);
// https://developer.mozilla.org/en-US/docs/Web/API/FileSystemDirectoryHandle#return_handles_for_all_files_in_a_directory
for await (const fileHandle of getTreeFileHandles(sourceDirHandle)) {
const relativePath = await sourceDirHandle.resolve(fileHandle);
const compressionStream = new fflate.ZipDeflate(relativePath.join('/'));
zip.add(compressionStream);
const file = await fileHandle.getFile();
for await (const chunk of file.stream()) {
compressionStream.push(chunk);
}
compressionStream.push(new Uint8Array(), true);
}
zip.end();
The problem
The renderer process memory usage grows to over 2GB during the compression.
As the output is being streamed to high performance disk with each chunk (Origin-private file system using syncAccessHandle), this isn't expected. Chunks should be read, compressed, and written, without any data hanging around.
Looking at the allocation timeline of the worker in DevTools at a random point a few seconds into the compression, I can see 500MB of JSArrayBuffer data being retained. Most are of size 98,304 (Uint8Array) or 2,097,152 (Uint16Array) and are retained by Deflate objects held in the u array of Zip. They are buffers and other structures to do with compression. It doesn't seem to me that it's necessary for these to be retained in memory once a file has been compressed.
Workaround
Discard all references to the d Deflate object after the final compressed chunk has been emitted:
const ondata = compressionStream.ondata;
compressionStream.ondata = (error, data, final) => {
ondata(error, data, final);
if (final) {
compressionStream.d = null;
zip.u.at(-1).d = null; // Object created in `zip.add()`
}
}
With this in place, my scenario will use 100-500MB of renderer memory depending on when Chrome garbage collects.
Thanks for taking the time to diagnose the issue here! This looks like a good change, I'll make it for the next release.
Great, thanks.
Of course the workaround from client code is hacky, so I'm sure there'll be a better way to do it.
Hey @101arrowz, I ran into the same issue and ended up doing a similar patch as @robatwilliams. Just wanted to check if you've had a chance to work on it or if you'd like some help.
I've been experimenting with removing the .f (filename u8) from the u array and building the central directory header as each entry is added. It seems to be working fine with all the zip files I've tested so far, but I'm still getting the hang of zip structures, I'll run some benchmarks to see if it's worth it at least for large zip files, memory was growing to 7GB in my case before crashing, with over 2000 entries (300-500kb each).
Sorry that work on this has been slow; I tend to make changes in batches. I don't think this is a huge or difficult change, but it is a big flaw in the current design. I'll try to get this done somewhat soon.
Hi all 👋 and big thanks to @101arrowz for this lib and to @robatwilliams for your workaround.
I was able to reduce the memory footprint with the workaround by another 30%:
const ondata = compressionStream.ondata;
compressionStream.ondata = (error, data, final) => {
ondata(error, data, final);
if (final) {
compressionStream.d = null;
const internal = zip.u.at(-1), // Object created in `zip.add()`
f = internal.f;
internal.d = null;
internal.f = { length : f.length }; // another array, holding some seemingly discardable data.
}
}
Edit: You shouldn't be using this "improvement" as it messes up the central directory, rendering it invalid for the windows-native zip tool. Other tools would still work, even on windows...
Edit2: Ok, this is what @marcosc90 was talking about :)
As a side-note: I would love to work on a PR but the abbreviated source code makes it impossible for me to get in. Why did you choose to not spell out any variable, use newlines or consts? Is it a kink for small source code size?
Sorry, this might be opinionated, but I was just disappointed to see the sources.
Again: still very big thanks to you, @101arrowz. If you can handle it, I guess it's fine.