commons-compress icon indicating copy to clipboard operation
commons-compress copied to clipboard

Add reusable Huffman decoder

Open ppkarwasz opened this issue 4 months ago • 6 comments

This PR adds a reusable Huffman decoder for canonical Huffman codes. The decoder can be shared across multiple algorithms, including DEFLATE, DEFLATE64, and BZip2.

By centralizing Huffman code handling, the implementation reduces duplication and makes compressor code easier to maintain and extend.

Work in progress:

  • [x] Refactor BZip2 implementation
  • [x] Refactor DEFLATE64 implementation
  • [ ] (optional) Refactor DEFLATE implementation: currently relies on InflaterInputStream from the JDK

ppkarwasz avatar Aug 29 '25 10:08 ppkarwasz

Hello @ppkarwasz

Are the implementations in DEFLATE, DEFLATE64, and BZip2 identical? Do we know the specs for these formats well enough to have confidence that we won't end up back where we started, accommodating tweaks, features, or oddities that are format-specific? We'll only know some of this as this refactoring proceeds.

I don't think we need a new package called "alg", especially with a single class in it. If this new class is designed for reuse among compressors, then is can go in "compressors". Or "hufman" if we end up with more classes.

garydgregory avatar Aug 29 '25 12:08 garydgregory

Are the implementations in DEFLATE, DEFLATE64, and BZip2 identical? Do we know the specs for these formats well enough to have confidence that we won't end up back where we started, accommodating tweaks, features, or oddities that are format-specific? We'll only know some of this as this refactoring proceeds.

They aren’t identical end-to-end, but they share the same canonical Huffman decoding primitive: given an array of code lengths, map an LSB/MSB bitstream to symbols. This PR extracts just that primitive into a reusable HuffmanDecoder. Format-specific parts remain outside (block structure, alphabets, extra bits, selectors, etc.).

I have also refactored the DEFLATE64 implementation, so this PR is ready for a review.

I don't think we need a new package called "alg", especially with a single class in it. If this new class is designed for reuse among compressors, then is can go in "compressors". Or "hufman" if we end up with more classes.

Agreed. I moved the class to compressors.support for now. If we don’t want it as public API, I can relocate to an internal package (and mark as non-API in module/OSGi metadata).

ppkarwasz avatar Aug 29 '25 21:08 ppkarwasz

@garydgregory,

I had completely forgotten about this one, but it’s now in good shape. The PR is complete: both BZip2 and DEFLATE64 now use the shared HuffmanDecoder. The only adjustment needed was handling bit ordering within a byte:

  • BZip2 reads bits starting from the most significant bit.
  • DEFLATE64 reads bits starting from the least significant bit.

The tricky part shows up when BitInputStream retrieves multiple bits at once: regardless of whether it’s constructed with ByteOrder.LITTLE_ENDIAN or ByteOrder.BIG_ENDIAN, it always delivers the bits in the order they appear in the byte. For DEFLATE64, that means we need to swap the retrieved bits.

Looking more broadly, Commons Compress probably would benefits from having two families of prefix-code decoders:

  1. Canonical Huffman decoder (this PR)

    • Uses a very compact representation of canonical Huffman codes.

    • Needs only:

      • One table for the alphabet, ordered by increasing code length.

      • Two small tables (up to “max code length” in size) to store:

        • The last code for each length.
        • The bias between code values and their table index.
    • Supports large code lengths (up to 20 bits, required by BZip2).

  2. General binary tree decoder (o.a.c.compress.archivers.zip.BinaryTree)

    • Stores a full binary tree in an array.
    • Can decode not only canonical Huffman codes but any prefix code, including the Shannon–Fano codes used by the IMPLODE ZIP method.
    • If we want to make this public, I’d prefer the refreshed implementation suggested by @fkjellberg in #690.

ppkarwasz avatar Sep 25 '25 17:09 ppkarwasz

Hi @ppkarwasz I'm OK with waiting on this one. Are you thinking that #690 should come in first? For 1.29.0 or 1.30.0?

garydgregory avatar Sep 26 '25 21:09 garydgregory

Hi @garydgregory,

Yes, I believe #690 is more valuable and should come in first. It gives us a good opportunity to restructure the compressor code into more reusable and maintainable components. Right now much of the compressor logic is still a direct Java translation of defensive C code, which I don’t fully understand myself. That has led to duplication across classes implementing the same algorithms with C-style optimizations that don’t fit well in Java.

LHA introduces four additional compression methods, some of which may already overlap with what we have (e.g. in SevenZ). But I don’t currently have the time to review it as thoroughly as it deserves.

So my proposal would be:

  • For 1.29.0: complete the cycle of archiver improvements we’ve been working on in recent months.
  • For 1.30.0: revisit the compression algorithms to identify reusable patterns, reduce duplication, and add LHA support.

@fkjellberg: does this plan sound reasonable to you? You have much deeper expertise in compression than we do, so your perspective would be really valuable here. And thank you for your patience. I know your PR has been waiting for review longer than it should, and we’ve been tied up finishing the work from our previous roadmap. I really appreciate the time and effort you’ve already put into it.

ppkarwasz avatar Sep 27 '25 08:09 ppkarwasz

Hi @ppkarwasz ,

The plan to postpone this and my LHA PR until 1.30 sounds good to me.

Like you, I also noticed that there are a lot of overlapping code in different compressors that can be refactored for better reuse and to make the code easier to understand. My plan was to create some refactoring PRs once my LHA implementation was merged because I wanted to avoid doing too much refactoring of existing code as part of my PR.

The LHA compressor is very similar to the ExplodingInputStream in the zip package and I reused the BinaryTree and CircularBuffer from that package and more refactoring could possibly be done to align both implementations even more.

The deflate64 that you started refactoring in this PR is also a good candidate for further refactoring, e.g. the HuffmanDecoder.DecodingMemory could possibly be replaced with CircularBuffer instead.

I think this PR is a good first step in the refactoring process. Let's continue working on refactoring after the 1.29 release.

fkjellberg avatar Sep 28 '25 17:09 fkjellberg