[COMPRESS-706] Add support for reading LHA archive format

Open fkjellberg opened this issue 5 months ago • 13 comments

Thanks for your contribution to Apache Commons! Your help is appreciated!

Before you push a pull request, review this list:

[X] Read the contribution guidelines for this project.
[ ] Read the ASF Generative Tooling Guidance if you use Artificial Intelligence (AI).
[ ] I used AI to create any part of, or all of, this pull request.
[X] Run a successful build using the default Maven goal with mvn; that's mvn on the command line by itself.
[X] Write unit tests that match behavioral changes, where the tests fail if the changes to the runtime are not applied. This may not always be possible, but it is a best-practice.
[X] Write a pull request description that is detailed enough to understand what the pull request does, how, and why.
[X] Each commit in the pull request should have a meaningful subject line and body. Note that a maintainer may squash commits during the merge process.

This is an read only implementation of the LHA/LZH archive format supporting a subset (but most commonly used) of the compression algorithms (lh0, lh4, lh5, lh6, lh7).

There is currently support for LHA header levels 0, 1 and 2. I added support for header level 3 but since I have not been able to create such archives myself, I removed the support for now. I may create another PR in the future adding it. It seems only the lha tool on OS/2 ever supported level 3 headers and I don't have access to that platform myself.

A large number of archives were downloaded from FUNET and AMINET for mainly the Amiga and Atari platforms and most of them can be successfully decompressed. The small number of archives that don't work are corrupt in one way or the other and the lha tools I tried are not able to decompress them either. Some very old archives use lh1 compression that is currently not supported by this implementation.

I tried to align the implementation with other similar decompression algorithms found in the commons-compress repo and reuse code wherever possible. The ExplodingInputStream from the zip implementation and the HuffmanDecoder from deflate64 are using similar compression algorithms.

I ended up reusing the CircularBuffer from the zip implementation and I've refactored it into the utils package to make it available for reuse. I added some checks to make sure parts of the buffer that have not been read yet were never overwritten and that distance never goes farther back than the size of the buffer. This is a separate commit in the PR and could possibly be reviewed and merged separately. The HuffmanDecoder from deflate64 is using DecodingMemory that is similar to CircularBuffer but it would require some refactoring of HuffmanDecoder to use CircularBuffer instead.

The ExplodingInputStream and deflate64's HuffmanDecoder are both using binary trees for Huffman decoding. I ended up using the BinaryTree from ExplodingInputStream but since the storage and construction of the actual tree differs, I copied the code into the LHA implementation and kept it package private for now. The common code of this class could possibly go to the utils package and the code to build the tree kept in each implementation package in a future refactoring. The HuffmanDecoder could also possibly use such a refactored BinaryTree implementation.

I also added a CRC-16 implementation. It is needed in both the archiver and the compressor packages so I put it in the utils package as a public class.

Aug 17 '25 15:08 fkjellberg