zig icon indicating copy to clipboard operation
zig copied to clipboard

ability to fetch from zip files

Open andrewrk opened this issue 10 months ago • 16 comments

zig fetch foo.zip should work.

Relevant logic is here:

https://github.com/ziglang/zig/blob/f7bc55c0136b91805bd046a8cc8ea745d7e7567d/src/Package/Fetch.zig#L946-L1024

andrewrk avatar Oct 05 '23 22:10 andrewrk

I happen to have been working with ZIP files in zig anyway, so I might take a shot at this if it's ok. I have a branch to start with:

https://github.com/ziglang/zig/compare/master...MFAshby:build_zip_support

I'll figure out how to build & test it before submitting a PR.

ZIP files are a little different from TAR archives in that they might have data at the start of the file which is not part of the ZIP at all, (a fact taken advantage of by things like self-extracting ZIPs, and the https://redbean.dev/ web server). They also store the metadata preceding the file data, and repeated in a 'central directory' at the end of the file.

Probably the more reliable way to extract the ZIP is to use the central directory, which would need a little code restructuring so that the unpackZip function can seek to the end of the file first. This branch doesn't do that because it only gets a reader for the ZIP contents, which it can't necessarily seek back and forth.

MFAshby avatar Oct 06 '23 12:10 MFAshby

I know enough about zip files to confidently say that the End of Central Directory Record should certainly be used as the source of truth, rather than trying to scan a stream for local headers which may actually just be file contents. I suggest to wait until #17392 lands because the logic is much more straightforward and will not need any restructuring. The function is passed a temporary directory and a reader stream of a remote zip file. So it just needs to pipe the stream into a file in the temporary directory, then unzip all the files into the temporary directory.

andrewrk avatar Oct 06 '23 19:10 andrewrk

A while back I rewrote @jorangreef's pure in Zig, and was about to refactor it to use a Reader before I got tied up with work. I think this would be a better base for a ZIP reader, as pure is a a static-analysis tool to protect against Zip bombs more than a generic Zip library. Of course I would need some helping hands if this were ever to land in the stdlib.

The-King-of-Toasters avatar Oct 07 '23 05:10 The-King-of-Toasters

But one might want to wait until #17392 is done to start working on this.

The package manager rework landed in f7bc55c0136b91805bd046a8cc8ea745d7e7567d. I updated the "relevant logic" snippet above.

andrewrk avatar Oct 09 '23 20:10 andrewrk

Alright I started on this and have a working implementation of a ZIP reader in 400 lines of pure Zig with an Iterator / pipeToFileSystem API just like std.tar :)

Code is at https://github.com/nDimensional/zig-zip for now but I'd like to PR this into lib/std/zip.zig and get package fetching working.

I have a couple questions for you @andrewrk if you have input

  • I'm using std.posix.mmap to read the ZIP file since extraction involves jumping around a lot. Is this ok / sufficiently portable for the package manager?
  • What's the best way to test ZIP extraction? I see that there's a suite of tar files in lib/std/tar/testdata, but given recent events I would feel uncomfortable checking more compressed binary blobs into the repo 🙃. Plus, as far as I can tell, ZIP doesn't have an authoritative RFC or test suite like tar.

joeltg avatar Apr 05 '24 20:04 joeltg

What's the best way to test ZIP extraction? I see that there's a suite of tar files in lib/std/tar/testdata, but given recent events I would feel uncomfortable checking more compressed binary blobs into the repo 🙃. Plus, as far as I can tell, ZIP doesn't have an authoritative RFC or test suite like tar.

libzip has a regression testing suite. I think APPNOTE.txt is the spec from the original developer of the zip format, it does not come with a test suite though.

MFAshby avatar Apr 06 '24 15:04 MFAshby

Okay here's my current idea. My implementation has a small test suite that uses the system zip command (via std.ChildProcess) to create archives, extract them using zip.pipeToFileSystem, and then verify the result.

My understanding is that this type of testing is appropriate for https://github.com/ziglang/contrib-testing - so maybe I open a PR to create lib/std/zip.zig without tests, and then move zip_test.zig into its own repo and add it to ziglang/contrib-testing?

joeltg avatar Apr 18 '24 00:04 joeltg

no, a std.zip implementation would need tests in the main repo that do not rely on any system tools to run. there's multiple existing suites with files and known contents that can be adapted for our use.

eg test blocks that reference files, read them in, and use std.testing to compare the resulting structures

nektro avatar Apr 18 '24 00:04 nektro

Okay happy to do that too, just wondered if there was a way to avoid checking more binary blobs into the repo :)

joeltg avatar Apr 18 '24 01:04 joeltg

I'm using std.posix.mmap to read the ZIP file since extraction involves jumping around a lot. Is this ok / sufficiently portable for the package manager?

Here's an mmap that should work on posix and windows: https://github.com/marler8997/zigx/blob/master/MappedFile.zig

marler8997 avatar Apr 20 '24 21:04 marler8997

I'm using std.posix.mmap to read the ZIP file since extraction involves jumping around a lot. Is this ok / sufficiently portable for the package manager?

No, mmap is problematic because instead of convenient error codes from I/O you get signals which are nearly impossible to handle correctly.

Mmap should be avoided for this use case.

andrewrk avatar Apr 20 '24 21:04 andrewrk

I was thinking of taking a stab at this and wanted to clarify my understanding of the solutions. Does this breakdown look correct?

Solution 1: Process zip as a stream (front to back)

No. zip files cannot be interpreted correctly front-to-back, they must be done back to front starting from the central directory header/record (https://games.greggman.com/game/zip-rant/).

Solution 2: Read entire zip contents from reader into memory

No. A zip file could be too large to fit into memory. Let's write it to a file first instead, then we can load only the parts we need into memory as we go.

Solution 3: Use MMAP after writing it to disk

No. Nearly impossible to handle IO errors correctly

Solution 4: Read/Process zip file in chunks after writing it to disk using normal reads/writes

Yes

marler8997 avatar Apr 21 '24 15:04 marler8997

Here's my version that avoids mmap based on Joel's implementation:

https://github.com/marler8997/zig-zip/blob/3bcdd558410ae0c78068e9d6635e177fcce4f206/src/zip.zig

I'll see if I can get a PR together for it.

marler8997 avatar Apr 21 '24 18:04 marler8997

Nice! Although there's still more I need to work into my impl, at least Zig64 support and executable bit preservation

joeltg avatar Apr 21 '24 18:04 joeltg

FYI, I added a bufferedReader when passing the file stream to the inflate decompressor...and performance went from about 3 to 4 times slower than the mmap version to about the same performance. (i.e. decompressing zig-windows-x86_64-0.12.0.zip went from 29 seconds to 8 seconds on Windows and from 4.5 to 1.5 seconds on linux).

marler8997 avatar Apr 21 '24 19:04 marler8997

Here's the branch I'm currently working on: https://github.com/ziglang/zig/compare/master...marler8997:zig:zip

marler8997 avatar Apr 21 '24 20:04 marler8997