zip icon indicating copy to clipboard operation
zip copied to clipboard

Issue trying to read stats on file in zip files

Open sgmoore opened this issue 3 years ago • 1 comments

I am only starting to learn Nim and I want to read the stats of files inside a zip file. Zipfiles.nim does not expose this information, so I am attempting to modify it by creating a new iterator like this

iterator walkFileStats*(z: var ZipArchive): ZipStat =
  ## walks over all files in the archive `z` and returns the File stats
  var i = 0'i32
  var num = zip_get_num_files(z.w)
  while i < num:
    var name : cstring = $zip_get_name(z.w, i, 0'i32)

    var stat : ZipStat 

    var r = zip_stat(z.w , name, ZIP_FL_UNCHANGED , addr(stat)) ;
    if (r == 0) : 
        yield stat
    inc(i)

But this does not return the correct results. I am building on 64-bit windows and the name and date look correct but the size and compressed size are not. But I need the final program to run on some 32-bit windows machines and when I compile for 32-bit, the name is still correct, but the date is just garbage and the sizes are also incorrect.

I got a little further by looking at the definition of ZipStat in libzip.nim which declares mtime to be of type time. If I change this to int32 then the fromUnix method returns the correct time for both 32-bit and 64-bit versions. The sizes are correct for the 32-bit version, but not when compiled for 64-bit.

Since I don't need a 64-bit version of the program, this gives me what I need, but I would prefer to fix the issue properly.

Has anyone any ideas?

sgmoore avatar Aug 07 '20 11:08 sgmoore

I had a chance to look at this further.

It seems that the zip_stat structure in libzip has changed considerably since version 0.9 on which libzip_all.c was based. Since the unix version defaults to use the libzip.so library, it will be different from that used for windows or if useLibzipSrc is defined when compiling for unix. More importantly it is possible that the actual structure may depend on the version of libzip.so installed and may not be known at compile time. Hence using this structure could cause problems.

For that reason, I suspect this issue should probably be closed as Won't fix.

That said, it proved useful and accurate for me, so in case anyone else really needs this, I will report my conclusions.

Firstly, it makes more sense to use zip_stat_index and hence define walkFileStats as

iterator walkFileStats*(z: var ZipArchive): ZipStat =
  ## walks over all files in the archive `z` and returns the File stats
  var i = 0'i32
  var num = zip_get_num_files(z.w)
  while i < num:
     var stat : ZipStat 

     var r = zip_stat_index(z.w , i,  ZIP_FL_UNCHANGED , addr(stat))

     if (r == 0) : 
         yield stat
     inc(i)

I changed the ZipStat structure to be

ZipStat* = object                ## the 'zip_stat' struct
    when defined(unix) and not defined(useLibzipSrc):
        valid*:int
        name*: cstring             ## name of the file
        index*: int                ## index within archive
        size*: uint64              ## size of file (uncompressed)
        compSize*: uint64          ## size of file (compressed)
        mtime*: uint               ## modification time
        crc*: uint32               ## crc of file data

        compMethod*: int16         ## compression method used
        encryptionMethod*: int16   ## encryption method used
        flags* : uint32

    else :
        name*: cstring             ## name of the file
        index*: int32              ## index within archive
        crc*: uint32               ## crc of file data
        mtime*: int                ## modification time
        when defined(windows) :
            size*: uint32          ## size of file (uncompressed)
            compSize*: uint32      ## size of file (compressed)
        else :
            size*: uint64          ## size of file (uncompressed)
            compSize*: uint64      ## size of file (compressed)

        compMethod*: int16         ## compression method used
        encryptionMethod*: int16   ## encryption method used

I have tested this against the output from InfoZip V6 on over 1400 zip files that I have on my system.

Tested using 32-bit windows, 64-bit windows, 64-bit linux (standard build which uses libzip.so) 64-bit linux with useLibzipSrc defined

I was unable to get an 32-bit linux version to compile, so was untested.

Observations

name The only differences occur when a filename has unusual characters, eg below 32 and above 127. The number of names with differences differed between the various builds.

mtime

  • There are lots of timezone issues, for examples dates differ by 1 hour or by 8 hours
  • A number of files reported times which differ by 1 minute.
  • There are differences handling dates before 1980.
  • I have a few files with dates past 2038 which don't fit in 32-bit unsigned integer and hence we not reported correctly using the 32-bit windows build
  • I also have one file inside one zip file that shows valid but completely different dates. I'm not sure what the issue is, but the issue is not restricted to nim. For example Windows explorer agrees with Nim whereas 7z agrees with infozip.

It would appear that such issues with handling dates occur with other zip libraries.

compSize Matches for non-encrypted files, The zip spec says each encrypted file has an extra 12 bytes stored at the start of the data area defining the encryption header for that file. InfoZip does not count these 12 bytes and hence is always 12 lower than compSize

crc, size and compMethod all match.

encryptionMethod is not output by Infozip, but based on the compSize calculation is probably accurate.

Conclusion In my use-case scenario, time-zone and DST issue did not apply and I knew that none of the filenames had any unusual characters and they all have valid dates, so this solution proved adequate, but that may not be good enough for everyone.

sgmoore avatar Aug 31 '20 15:08 sgmoore