managed-lzma Data decompresses with C++ but exceptions when decompressed with managed lzma

As the subject states - I have a data block which is 0x100043 in length that decompresses to 0x100000 fine in C++ (code below), but exceptions with a bad data error in Lzma2Dec_DecodeToDic() line 323 when using managed lzma.

It could be possible I am using the API incorrectly, that would be the first thing to verify.

The Lzma2Dec_DecodeToDic functions in managed lzma and C++ lzma 1900 are just different enough that there's no easy change to make things work. It looks like the C# is a direct port of a different version of the C++.

C++ decode (which works fine)

TEST_METHOD_CATEGORY( SimpleDecompressBlob, "SevenZip" )
	{
		Log( "7-zip decompress blob test" );
		SetWorkingDirectoryForTests();

		size_t source_length = 0;
		uint8* compressed_memory = LoadFileToMemory( "TestData/Binary.dat", &source_length );

		size_t dest_length = 1024 * 1024;

		uint8* dest = static_cast< uint8* >( calloc( 1, dest_length ) );

		ISzAlloc alloc = { SevenZipMalloc, SevenZipFree };
		ELzmaStatus status = LZMA_STATUS_NOT_SPECIFIED;
		SRes result = Lzma2Decode( dest, &dest_length, compressed_memory, &source_length, 12, ELzmaFinishMode::LZMA_FINISH_END, &status, &alloc );
		Assert::IsTrue( result == SZ_OK, L"Lzma2Decode returned error" );

		free( dest );
	}

C# code - which exceptions (for testing, I put this at the top of sandbox-7z Program.cs)

        public class FLzma2Data
        {
	        public byte[] SourceData = ( byte[] )null;
	        public int SourceOffset = -1;
	        public int SourceLength = -1;
	        public byte[] DestinationData = ( byte[] )null;
	        public int DestinationOffset = -1;
	        public int DestinationLength = -1;
	        public byte DictionarySize = 40;
        }

	[STAThread]
        static void Main()
        {
	        FLzma2Data data = new FLzma2Data();

	        FileInfo Info = new FileInfo( "d:/Temp/Binary.dat" );
	        BinaryReader reader = new BinaryReader( Info.OpenRead() );
	        data.SourceData = reader.ReadBytes( ( int )Info.Length );
	        data.SourceLength = ( int )Info.Length;
	        data.SourceOffset = 0;
	        reader.Close();

	        data.DestinationData = new byte[1024 * 1024];
	        data.DestinationLength = 1024 * 1024;
	        data.DestinationOffset = 0;

	        data.DictionarySize = 12;

	        ManagedLzma.LZMA2.Decoder decoder = new ManagedLzma.LZMA2.Decoder( new ManagedLzma.LZMA2.DecoderSettings( 12 ) );
	        decoder.Decode( data.SourceData, data.SourceOffset, data.SourceLength, data.DestinationLength, true );
	        decoder.ReadOutputData( data.DestinationData, data.DestinationOffset, data.DestinationLength );

I could send you the file, but I don't want to bog down the forum with a meg of uncompressable data =)

What are your recommendations?

Cheers John

Apr 23 '20 00:04 JohnJScott

I'll have a look later today and see if I can reproduce this, I don't see any immediate mistake in the calling code you posted. If you want to speed this up put your file somewhere you can take it back down again, send me an email, or tell me the compression settings and library you used if you compressed the data yourself. Otherwise I'll just try various settings to see if its a general problem.

The SDK version used while porting was 9.22 (included in the repo) so yeah its way older than 19.00, but I checked a year or two ago and there haven't been any relevant changes to the code (which makes sense since it needs to remain compatible). I'll check again of course to see if there are any recent bugfixes for edge cases.

Apr 23 '20 09:04 weltkante

Thanks for the quick reply!

Here’s a shareable link to the file on Google Drive - https://drive.google.com/drive/folders/1WlRGjE6C26YdhdNAKiK0bR_u9wcp8YQH?usp=sharing

I used the LZMA1900 SDK to compile a library and compress the block.

Compression settings are:

            int8 level{ 2 };
            int8 lc{ 3 };
            int8 lp{ 0 };
            int8 pb{ 3 };
            int8 algo{ 1 };
            int8 btMode{ 1 };
            int8 numHashBytes{ 4 };
            int8 numThreads{ 1 };
            int8 numBlockThreads_Reduced{ 1 };
            int8 numBlockThreads_Max{ 1 };
            int8 numTotalThreads{ 1 };
            bool writeEndMark{ false };
            int16 fb{ 32 };
            uint32 dictSize{ 0 };
            uint32 mc{ 32 };
            uint64 reduceSize{ 0x100000 };
            uint64 blockSize{ 0 };

It seems that an end mark is written even though I don’t request it - that’s a problem for a different day.

Cheers John

Apr 23 '20 13:04 JohnJScott

Thanks, you can take down the file if you want.

I was able to reproduce the issue and it also occurs when using the 9.22 C++ source (i.e. the 9.22 C++ API is able to decode). The "problem" is that your C++ snippet is using the "One Call Interface" while ManagedLzma is using the incremental API. Replicating what the "one call interface" does allows to decode in C# so its not a bug in the low level decoding routines, but rather in how they are called.

Either LZMA2 itself has an edge case bug in the incremental API or I'm using it incorrectly. I'll have to write C++ code to decode your file incrementally and see if that succeeds. Might take a few days to figure everything out, sorry. I'll probably have something ready early next week.

It seems that an end mark is written even though I don’t request it - that’s a problem for a different day.

Its been a while but I think the "end mark" is a feature only used in LZMA, not in LZMA2 - the way LZMA2 works is by splitting the source into chunks so they can be compressed in parallel (or be stored uncompressed if compression doesn't reduce size) - either way it stores the size of each chunk in a control code, so as far as LZMA2 is concerned it will always know where the stream ended and return the "ended with mark" status, even if the last chunk was LZMA and ended with "maybe ended" flag it replaces that because it has better information.

Apr 23 '20 21:04 weltkante

Found the problem, I have a bug in translating the eof argument of the Decode method when you simultanously pass a huge input buffer. The C# decoders are built for streaming and incremental decoding, so passing a huge input buffer can exceed the capacities of the decoding algorithm, I need to detect and handle this case correctly.

Technical details: I need to translate eof because the C++ code demands knowing when the output stream ends, not when the input stream ends. This doesn't fit well with C# APIs which usually are designed around signaling when the input stream ends, so I designed the API to be compatible with that pattern in mind.

There are two things to note:

your snippet ignores the return value of Decode and just assumes it processed all data, this is not necessarily the case; you'll have to write a loop to properly decode
if you fix your code and write a loop you can easily work around my bug by leaving eof false until there is only little input remaining

Sorry for not providing a "one call" decoding API but .NET is not good with large buffers. The decoder needs to allocate internal buffers and I don't like having an API which can become unstable and "sometimes" throw OOM depending on the memory fragmentation of the rest of the .NET program. Things are designed to keep buffers at reasonable sizes, but that means streaming is mandatory.

Here's an example for how the loop can look like for your snippet, including a workaround for the eof detection (I think 16 byte are good enough for a safety margin, if not there are other ways to detect this, but really it should be done automatically so I'm leaving it at this for now and instead work on the proper fix)

int pInput = 0;
int pOutput = 0;
using (var decoder = new Decoder(new DecoderSettings(12)))
{
    do
    {
        pInput += decoder.Decode(data.SourceData, data.SourceOffset + pInput, data.SourceLength - pInput, data.DestinationLength - pOutput, data.SourceLength - pInput < 16);
        pOutput += decoder.ReadOutputData(data.DestinationData, data.DestinationOffset + pOutput, data.DestinationLength - pOutput);
    }
    while (!decoder.IsOutputComplete);
}

Apr 24 '20 20:04 weltkante

Seems to work a treat (at least so far) – thanks!

Regarding the end mark - if I decompress that block with the C++ call:

SRes result = Lzma2Decode( dest, &dest_length, compressed_memory, &source_length, 12, ELzmaFinishMode::LZMA_FINISH_ANY, &status, &alloc );

It returns that it generated the correct amount of decompressed data (0x100000) and processed all but a single byte of the source data (0x100042 of 0x100043). If I change the finish mode to LZMA_FINISH_END then it reports that all source data has been consumed (0x100043 Bytes).

Not sure what to make of that, and this is nothing to do with managed-lzma, so just an FYI!

Cheers John

Apr 25 '20 15:04 JohnJScott

It returns that it generated the correct amount of decompressed data (0x100000) and processed all but a single byte of the source data (0x100042 of 0x100043). If I change the finish mode to LZMA_FINISH_END then it reports that all source data has been consumed (0x100043 Bytes).

Note that the returned status is LZMA_STATUS_NOT_FINISHED meaning the decoder can't tell you if the stream is really done or if another chunk begins if more data were present. Might be a design flaw in the decoder because from looking at the code it might be able to differentiate the cases if it tried harder, but as it is you need to use LZMA_FINISH_END to make sure you didn't observe a truncated stream.

Apr 25 '20 16:04 weltkante

managed-lzma managed-lzma copied to clipboard

Data decompresses with C++ but exceptions when decompressed with managed lzma

managed-lzma
managed-lzma copied to clipboard