K4os.Compression.LZ4
K4os.Compression.LZ4 copied to clipboard
Chained encoder does not produce the same result as the lz4 cli when chaining is enabled
Description During a test of the lz4 encoder I have seen differences in the encoded output compared to the lz4 cli when block chaining is enabled
To reproduce Consider the binary attached input file testfile.zip. The file is zip compressed and must be decompressed. Processing the file with the lz4 cli produces the following checksums:
lz4 -v -B4 -BI -1 --no-frame-crc testfile.bin testfile.lz4-1-independent-expected
lz4 -v -B4 -BD -1 --no-frame-crc testfile.bin testfile.lz4-1-chained-expected
lz4 -v -B4 -BI -3 --no-frame-crc testfile.bin testfile.lz4-3-independent-expected
lz4 -v -B4 -BD -3 --no-frame-crc testfile.bin testfile.lz4-3-chained-expected
sha1sum testfile.*
a1938254adb8d00835b5bd7a63d51499ddb9c3af testfile.bin
58e6b6fe0f76de620d78d8afd8e19539b4fa0289 testfile.lz4-1-chained-expected
6bf34dc13fd8f7102b506c810c6f975751f1e236 testfile.lz4-1-independent-expected
d12c77f9ce1a44996e5679ef2f873928369dee7e testfile.lz4-3-chained-expected
4522410e07a080d555c90680c3e8d00a39b1e002 testfile.lz4-3-independent-expected
Now consider this reproducing example program:
using K4os.Compression.LZ4;
using K4os.Compression.LZ4.Internal;
using K4os.Compression.LZ4.Streams;
Console.WriteLine("Hello, World!");
using (var source = File.OpenRead("testfile.bin"))
{
// lz4 -v -B4 -BI -1 --no-frame-crc testfile.bin testfile.lz4-1-independent-expected
using (var actual = LZ4Stream.Encode(File.Create("testfile.lz4-1-independent-actual"), new LZ4EncoderSettings()
{
ChainBlocks = false,
BlockSize = Mem.K64,
CompressionLevel = LZ4Level.L00_FAST,
}))
{
source.Position = 0;
source.CopyTo(actual);
}
PrintComparison("testfile.lz4-1-independent-expected", "testfile.lz4-1-independent-actual");
// lz4 -v -B4 -BD -1 --no-frame-crc testfile.bin testfile.lz4-1-chained-expected
using (var actual = LZ4Stream.Encode(File.Create("testfile.lz4-1-chained-actual"), new LZ4EncoderSettings()
{
ChainBlocks = true,
BlockSize = Mem.K64,
CompressionLevel = LZ4Level.L00_FAST,
}))
{
source.Position = 0;
source.CopyTo(actual);
}
PrintComparison("testfile.lz4-1-chained-expected", "testfile.lz4-1-chained-actual");
// lz4 -v -B4 -BI -3 --no-frame-crc testfile.bin testfile.lz4-3-independent-expected
using (var actual = LZ4Stream.Encode(File.Create("testfile.lz4-3-independent-actual"), new LZ4EncoderSettings()
{
ChainBlocks = false,
BlockSize = Mem.K64,
CompressionLevel = LZ4Level.L03_HC,
}))
{
source.Position = 0;
source.CopyTo(actual);
}
PrintComparison("testfile.lz4-3-independent-expected", "testfile.lz4-3-independent-actual");
// lz4 -v -B4 -BD -3 --no-frame-crc testfile.bin testfile.lz4-3-chained-expected
using (var actual = LZ4Stream.Encode(File.Create("testfile.lz4-3-chained-actual"), new LZ4EncoderSettings()
{
ChainBlocks = true,
BlockSize = Mem.K64,
CompressionLevel = LZ4Level.L03_HC,
}))
{
source.Position = 0;
source.CopyTo(actual);
}
PrintComparison("testfile.lz4-3-chained-expected", "testfile.lz4-3-chained-actual");
}
static void PrintComparison(string expectedFile, string actualFile)
{
var expected = File.ReadAllBytes(expectedFile);
var actual = File.ReadAllBytes(actualFile);
if (expected.SequenceEqual(actual))
{
Console.WriteLine($"The files {expectedFile} and {actualFile} are the same.");
}
else
{
Console.Error.WriteLine($"The files {expectedFile} and {actualFile} are NOT the same!");
}
}
It will produce this output:
Hello, World!
The files testfile.lz4-1-independent-expected and testfile.lz4-1-independent-actual are the same.
The files testfile.lz4-1-chained-expected and testfile.lz4-1-chained-actual are NOT the same!
The files testfile.lz4-3-independent-expected and testfile.lz4-3-independent-actual are the same.
The files testfile.lz4-3-chained-expected and testfile.lz4-3-chained-actual are NOT the same!
This can be verified from the file checksums:
sha1sum bin/Debug/net6.0/testfile.*
a1938254adb8d00835b5bd7a63d51499ddb9c3af bin/Debug/net6.0/testfile.bin
af7d8eee3d20a43a6553f4cb7bf960cf9920791b bin/Debug/net6.0/testfile.lz4-1-chained-actual
58e6b6fe0f76de620d78d8afd8e19539b4fa0289 bin/Debug/net6.0/testfile.lz4-1-chained-expected
6bf34dc13fd8f7102b506c810c6f975751f1e236 bin/Debug/net6.0/testfile.lz4-1-independent-actual
6bf34dc13fd8f7102b506c810c6f975751f1e236 bin/Debug/net6.0/testfile.lz4-1-independent-expected
85f5c92294dfbc6df1f0247410c3888aaa55caf9 bin/Debug/net6.0/testfile.lz4-3-chained-actual
d12c77f9ce1a44996e5679ef2f873928369dee7e bin/Debug/net6.0/testfile.lz4-3-chained-expected
4522410e07a080d555c90680c3e8d00a39b1e002 bin/Debug/net6.0/testfile.lz4-3-independent-actual
4522410e07a080d555c90680c3e8d00a39b1e002 bin/Debug/net6.0/testfile.lz4-3-independent-expected
Expected behavior The expected behavior is that the encoder produces the same result as the lz4 cli when chaining is enabled.
Actual behavior The encoder results are not the same.
Environment
- CPU: AMD Ryzen 7
- OS: Windows and Ubuntu - tested on both
- .NET: net6
- LZ4: 1.2.16
Additional context I have tried both the lz4 cli on windows and linux - it produces the same result on both platforms.
Is output compatible? Can it compressed with one and decompress with the other? If yes, then this might interesting but low priority.
I wrote chaining code myself (just from spec), so it might behave a little bit differently. Also, it can use x86 or x64 encoder (which do produce different results), so this is another thing to keep in mind.
So, my question is, regardless being different - is it compatible?
Yes, they seem compatible. At lease from the set of files that I have tested. The lz4 cli seem to be able to decompress the encoded files.
I found this when I was implementing #14. The implementation is here: https://gist.github.com/rmja/98dc7e0576c933faa0a75629b46af71c
For this I created a bunch of different random files for testing and found the issue that way.
What about sizes?
They are not the same size. It seems as if this library produces a 1 byte smaller file in the tests that I have made. This is a diff from the last block:
The highlighted bytes are the block header. This library produces a length of 0x00001F65 bytes and the cli produces 0x00001F66 bytes.
The following is the diff within the block:
It is somewhere in the middle of the block.