K4os.Compression.LZ4 icon indicating copy to clipboard operation
K4os.Compression.LZ4 copied to clipboard

Calculate decompress buffer size

Open UkooLabs opened this issue 4 years ago • 6 comments

Is it possible to add a method that pretends to decompress data and thus works out buffer size needed for decompression?

UkooLabs avatar May 23 '20 05:05 UkooLabs

It is an interesting idea, it could work as decompressing data but actually not writing anything, but... such method does not exist in original implementation, so it would need to be written from scratch. It would also double the time needed for decompression going slightly against the rationale behind "fast compression algorithm". What actual scenario prevents from using LZ4Stream or LZ4Pickle, or just storing extra 4 bytes?

I would also be interested what @Cyan4973 thinks about it?

MiloszKrajewski avatar May 23 '20 16:05 MiloszKrajewski

@MiloszKrajewski

This came about whilst doing a library to parse Pixar's Usdc file formats. Certain parts of the data are compressed and they give you the compressed buffer size but not the uncompressed size. In their c++ source code they have some calculation to generate the size of work space needed for decompression. However i was in line of thinking would be way cleaner to know really how much is needed.

This is my c# library which is a port of tinyusdz

https://github.com/UkooLabs/UsdzSharpie

The calculation they use is in....

https://github.com/UkooLabs/UsdzSharpie/blob/master/UsdzSharpie/UsdcReader.cs

     public ulong GetEncodedBufferSize(ulong count)
        {
            return count > 0 ? (sizeof(int)) + ((count * 2 + 7) / 8) + (count * sizeof(int)) : 0;
        }

But to me seems a bit black magic...

I will have a look into LZ4 stream, was not sure if it would auto grow buffer needed, if so that would be an ideal candidate.

UkooLabs avatar May 23 '20 19:05 UkooLabs

@MiloszKrajewski looks like i cant use LZ4Stream.Decode as a substitute for LZ4Codec.Decode as complains 'LZ4 frame magic number expected'

UkooLabs avatar May 23 '20 20:05 UkooLabs

The LZ4 block format expects the host system to handle metadata, such as compressed size or decompressed size. It doesn't provide any facility for it, but depends on it.

Not that the LZ4 block format doesn't require the decompressed size, but an upper bound of the decompressed size. In many applications, this is implied by the application context. For example, maybe it operates on block of maximum 64 KB, so decompressed size can't be larger than 64 KB. This is enough information for the LZ4 block decoder.

On the other hand, the LZ4 frame format does contain metadata, and can figure out how to decompress any compressed blob given a stream to flush data into. But, this is a different format. They are not interchangeable.

Cyan4973 avatar May 23 '20 20:05 Cyan4973

Hi @Cyan4973. Sorry, two question were mixed together. The question I was trying to get you involved was: Method to find out what was uncompressed block length by decompressing it without writing anything (so no memory needs to be allocated). I guess it can be done by adapting "LZ4_decompress_generic" somehow, but maybe there is faster way to do it.

MiloszKrajewski avatar May 25 '20 13:05 MiloszKrajewski

Ah yes, one could create a custom "decoder" for it, which just reads the token, extracting length fields and adding them together to return total length, without ever generating anything.

Such function currently doesn't exist in LZ4 reference library.

Cyan4973 avatar May 25 '20 13:05 Cyan4973