libarchive icon indicating copy to clipboard operation
libarchive copied to clipboard

Intent to (try to) implement: Deflate64 read support

Open dunhor opened this issue 1 year ago • 10 comments

We noticed that Deflate64 support is missing from the Zip reader code, however Deflate64 is one of the compression methods offered by many compression tools available on Windows, so it's not super uncommon to come across such archives.

This issue is part intended to serve as a notice to others who might happen to want to add support at the same time, so that there's no accidental duplication of effort, and part to solicit information from anyone who may have attempted to add support in the past or any other suggestions or recommendations. I have not attempted to start this work, nor have I begun any research to see what might be required, so I may be stepping into something over my head, but I'll try my best and use this issue for any updates.


Side note: AFAICT libarchive also currently lacks write support for Deflate64. Unless read support turns out to be easy and write support looks to be just as easy, I have no intention at this time to try and add write support as well.

dunhor avatar Sep 10 '24 17:09 dunhor

Deflate64 is pretty pointless as algorithm and that's why libraries like zlib (which we depend on for Deflate) don't implement it. It would be nice if Microsoft stopped with the nonsense of defaulting to it for larger files and just go with a modern compression algorithm.

jsonn avatar Sep 10 '24 20:09 jsonn

Deflate64 is pretty pointless as algorithm

I agree, however whether or not it's pointless doesn't change the fact that such archives exist and the ability to read them is valuable. I see now that I may be opening up a can of worms, however I also see that there appears to be an (unsupported?) implementation in zlib via contrib/infback9. I'm not quite sure yet how this code is intended to be consumed as it's not distributed as a part of zlib (at least it doesn't appear to be a part of its vcpkg distribution).

It would be nice if Microsoft stopped with the nonsense of defaulting to it for larger files

Technically we have now ;). Though it just always uses Deflate by default.

dunhor avatar Sep 10 '24 20:09 dunhor

Investigating more, the two most likely approaches we can take here are:

  1. Check the infback9 code into libarchive and compile it only if the zlib package is found. Idk if there's precedent for including 3rd party code in libarchive, however its license seems rather permissive.
  2. Alternatively, update the CMakeLists.txt to try and detect the presence of the infback9 code. I'm not aware of any package that distributes the infback9 code, so this may need to be a "does this compile" check.

dunhor avatar Sep 11 '24 20:09 dunhor

Write support for Deflate64 is probably not worth the effort. (It's certainly not worth implementing Deflate64 from scratch in libarchive for.)

There is a valid argument for trying to get read support, but there are also concerns: I do not remember all the reasons that zlib has refused to include such, it's possible some of those reasons might apply to us as well. I'm not thrilled with having a full implementation in libarchive. We currently do have a couple of compression algorithms implemented directly in libarchive and it's frankly been rather painful; we should not expand that any further.

kientzle avatar Sep 12 '24 04:09 kientzle

FYI: The official position from the zlib authors is still:

Does zlib support the new "Deflate64" format introduced by PKWare?

No. PKWare has apparently decided to keep that format proprietary, since they have not documented it as they have previous compression formats. In any case, the compression improvements are so modest compared to other more modern approaches, that it's not worth the effort to implement.

So basically, in order to support Deflate64, we would not be able to use (standard) zlib. That's a problem for me.

kientzle avatar Sep 12 '24 04:09 kientzle

After more research, it appears as though the infback9 code needs to be compiled into zlib, so my option (1) from above isn't feasible, even if we wanted to take that route.

in order to support Deflate64, we would not be able to use (standard) zlib. That's a problem for me.

From what I've gathered so far, libarchive doesn't control which libraries are used, nor their versions. So in theory, someone could use a patched or forked version of zlib, which we could detect with something like:

  CMAKE_PUSH_CHECK_STATE()
  SET(CMAKE_REQUIRED_INCLUDES ${ZLIB_INCLUDE_DIR})
  SET(CMAKE_REQUIRED_LIBRARIES ${ZLIB_LIBRARIES})
  CHECK_C_SOURCE_COMPILES("#include <zlib.h>
    #include <infback9.h>
    unsigned in(void*, unsigned char**) { return 0; }
    int put(void*, unsigned char*, unsigned) { return 0; }
    int main() {
      z_stream zs;
      unsigned char window[256];
      void* desc;
      inflateBack9Init(&zs, window);
      inflateBack9(&zs, in, desc, out, desc);
      inflateBack9End(&zs);
    }" HAVE_INFBACK9)
  CMAKE_POP_CHECK_STATE()

and then conditionally enable support. That would likely be a pain for someone to set up, and probably nearly impossible for libarchive to test, but it does seem doable. This would be completely opt-in, but if that still makes you uneasy, that's understandable. Otherwise, apart from adding decoding support directly into libarchive, I believe I saw that the LZMA SDK has decode support for Deflate64, though I believe that's a C++ library and I'm unsure how feasible it might be to incorporate into libarchive.

It's quite unfortunate that zlib's stance on no integrated Deflate64 support seems to be based on its "modest improvements" with no consideration for the utility of being able to decompress data encoded by other software. They've also seemingly already paid the cost for an implementation, so it not being "worth the effort" doesn't seem all that strong anymore (unless perhaps it's considering the effort of testing?).

dunhor avatar Sep 12 '24 18:09 dunhor

Coming back around to this. Upon further inspection, it seems that inflateBack/inflateBack9 require decompressing an entire entry in a single call. This makes these functions pretty much incompatible with libarchive's model of incrementally decompressing "chunks" of data that are returned back to the caller. So even if someone were to go through the trouble of patching zlib like I mention above, we couldn't make it usable from archive_read_data, at least not without potentially allocating a bunch of data up front so that the entire uncompressed entry can reside in memory, which seems excessive.

Additionally, I've been unable to find evidence of support for Deflate64 in the LZMA SDK. I don't recall where I read that it supposedly supports it, but I can find almost no mention of it in the source code, especially not as a general purpose decompression algorithm, and especially not as something that's easily consumable from C.

Beyond those two, I've thus far been unable to find any mention of a good library with Deflate64 support from a reliable and trustworthy source. That said, when trying to find one, I did come across a change to the dotnet repo adding Deflate64 support to the System.IO.Compression namespace. The "unfortunate" news is that it looks like this support code is authored in C#, making it not easily consumable by non-managed code. That said, it's probably fair to say that this implementation is decently well tested. In fact, it looks like there's a Rust create for Deflate64 support that's based off the dotnet version. I'm unsure how to accurately gauge popularity of Rust crates, however at 3M+ downloads, it seems like it's probably safe to say that it has a pretty healthy adoption and is also likely battle tested with real world data.

Of course, neither or these are realistic options for taking a dependency on in libarchive, however it might be worth considering porting the dotnet code to C, similar to what was done for Rust. We could either:

  1. Port the dotnet code to C and add it to libarchive, OR
  2. Port the dotnet code to C as its own standalone library and then add support to libarchive to consume this new library

From @kientzle 's previous comment, it sounds like option (1) is not preferred, leaving option (2). I spoke with @DHowett last week about the possibility of adding such a library under the Microsoft GitHub organization and he was onboard with this idea. What are y'all's thoughts on this as a possible path forward?

dunhor avatar Oct 10 '24 23:10 dunhor

That does sound like a promising approach, especially if it was generously licensed to simplify adoption by our many open-source clients.

kientzle avatar Oct 11 '24 04:10 kientzle

We'll get the ball rolling on that then. I'll update this thread when we have more info to share.

dunhor avatar Oct 11 '24 21:10 dunhor

It's been a while, but we've finally gotten around to publishing this work. It'll probably be a little while before I'm able to update libarchive to consume it. In the meantime, we'd love feedback on the implementation, API, etc.

https://github.com/microsoft/inflatelib

dunhor avatar Sep 24 '25 20:09 dunhor