haxe icon indicating copy to clipboard operation
haxe copied to clipboard

📁 haxe.zip.Reader - algorithm for reading ZIP files in std is incorrect

Open c-g-dev opened this issue 7 months ago • 2 comments

I had a discussion with @tobil4sk about this in the discord but there probably needs to be an issue raised.

The current algorithm in the haxe.zip.Reader reads the data size/compressed size of from the "Local File Record Headers" and/or their data descriptors. However, this is actually not a valid way to read entry sizes -- the ZIP standard does not guarantee that these values are maintained. The only valid authority for zip entry size tracking is in the Central Directory Record.

As per wikipedia:

A ZIP file is correctly identified by the presence of an end of central directory record which is located at the end of the archive structure in order to allow the easy appending of new files. If the end of central directory record indicates a non-empty archive, the name of each file or directory within the archive should be specified in a central directory entry, along with other metadata about the entry, and an offset into the ZIP file, pointing to the actual entry data. This allows a file listing of the archive to be performed relatively quickly, as the entire archive does not have to be read to see the list of files. The entries within the ZIP file also include this information, for redundancy, in a local file header. Because ZIP files may be appended to, only files specified in the central directory at the end of the file are valid. Scanning a ZIP file for local file headers is invalid (except in the case of corrupted archives), as the central directory may declare that some files have been deleted and other files have been updated.

In practice, whenever I make a ZIP file from my Win11 machine, it does not work with haxe.zip.Reader. I don't know if Win11 is just using a different algorithm to write .zips or what. But they work in every other application I use them in -- the only thing that they don't work with is haxe.zip.Reader. Looking into their data I found that they do not track sizes in the "Local File Record Headers" -- they MIGHT exist in the CRC/data descriptor but you cannot actually find the data descriptor without the sizes existing in the header, because you have to skip the content bytes to find the descriptor, and if you don't know the size of the content you don't know how far to skip.

tl;dr we need to change haxe.zip.Reader to check the Central Directory Record on read rather than file seeking the Local File Records.

I would gladly make a candidate PR for this if someone would signoff on this approach.

c-g-dev avatar Apr 23 '25 19:04 c-g-dev

Hi! Since speaking on discord, we've also realised that the way zip files are read currently is very memory inefficient which has severe performance implications for the haxelib server. Listing file names in a zip file should only require reading in the central directory, and reading a single file from the archive into memory should not require reading all the files into memory.

We probably need a new class that takes in a sys.io.FileInput instead of haxe.io.Input, so we can seek the correct position in the file rather than reading it like a stream. This would allow us to locate the central directory records first before going to individual file headers as needed.

For the purpose of haxelib, it would be useful to have api functions that allow us to:

  • List file names, perhaps in some Central Directory Record object
  • Load the file content for that given Central Directory Record object

tobil4sk avatar Apr 24 '25 09:04 tobil4sk

We probably need a new class that takes in a sys.io.FileInput instead of haxe.io.Input, so we can seek the correct position in the file rather than reading it like a stream. This would allow us to locate the central directory records first before going to individual file headers as needed.

Yes, I have been working on an implementation of this fix and I was surprised to learn that you really do NEED to be able to reverse-scan to be fully .zip spec-compliant...

Would it be appropriate, for backwards compatibility sake, to just do something like:

var unsafeRead: Bool = false; // configuration for if you want to try and trust the local file records or not

if(input is sys.io.FileInput){
    //run with seek
}
else{
    if(unsafeRead) {
        //do current logic
    }
    else {
        //read entire file into memory for manual seek
    }
}

c-g-dev avatar May 05 '25 12:05 c-g-dev