Question about Random Access and Decoding blocks
Hi, i wanted to ask because i see no examples or documention of how to do this (maybe i didnt see it).
But i want to be able to seek and decompress individual blocks inside XZ files. I already have the initial code to parse the XZ files and get a list of where every block is on the stream as well as the code to get to the block me looking for when seeking into the compressed data. Now i have these blocks i need to decode.
That is the part i got stuck because is not clear to me exactly what i need to do.
Any help would be appreciated.
This is one of the features that were planned early but a proper API still hasn't been implemented. :-( One has to use the low level functions which is complicated especially because liblzma APIs split Block decoding into a few steps. It makes sense for liblzma's internal use but not so much in the public API.
I wrote a demo which hopefully is helpful:
https://tukaani.org/xz/decode_block_at_offset.c
The decode_block() function likely looks a little cryptic. I suppose I should finally extend the single-threaded .xz decoder to support seeking.
I don't know if you want to seek by uncompressed offset or by Block number. lzma_index_locate() helps you find an uncompressed offset. There's no function to seek by Block number. Looping with lzma_index_iter_next() sort of works to find a Block number, but it's silly and a bit inefficient. I should add an API for that.
For comparison, XZ for Java has had neat API (SeekableXZInputStream) for years. It supports seeking based on both uncompressed offset and Block number.
Yesterday i was able to "get it working" by trial and error before i read your message. Ill check you code as mine is probably very inneficient i was just trying to get it working to get a proof of concept.
for (int i = 0; i < blocks; i++) { xz_block d_blk = block_list[i];
lzma_block block; block.filters = (lzma_filter*)malloc(sizeof(lzma_filter) * 8); block.check = LZMA_CHECK_CRC32; fseek(myfile, d_blk.position, SEEK_SET); auto header_size = varInt(myfile); header_size = lzma_block_header_size_decode(header_size); std::cout << "Block Header Size: " << header_size << "\n"; block.header_size = header_size; uint8_t* header = (uint8_t*)malloc(header_size); fseek(myfile, d_blk.position, SEEK_SET); fread(header, 1, header_size, myfile); std::cout << "Block Header Decode: " << lzma_block_header_decode(&block, NULL, header) << "\n"; uint8_t* inbuf = (uint8_t*)malloc(d_blk.size); uint8_t* outbuf = (uint8_t*)malloc(d_blk.uncompressed_size); fread(inbuf, 1, d_blk.size, myfile); lzma_stream stream = LZMA_STREAM_INIT; stream.allocator = NULL; stream.next_in = inbuf; stream.next_out = outbuf; stream.avail_out = d_blk.uncompressed_size; stream.avail_in = d_blk.size; std::cout << "Block Block Decoder: " << lzma_block_decoder(&stream, &block) << "\n"; auto ret = lzma_code(&stream, LZMA_RUN); std::cout << "LZMA CODE: " << ret << "\n"; std::cout << "TOTAL IN: " << stream.total_in << "\n"; std::cout << "TOTAL OUT: " << stream.total_out << "\n"; std::cout << "\n\n\n\nBlock N° " << i << " Data: " << outbuf << "\n";}
There im decoding all blocks in order, thats not what i want i just wanted to test it could decode all blocks in the file (i use a long compressed json so i can see the decoded results in redeable format), i can just pick any block in the chain and it will work.
If i may i have some questions.
lzma_stream stream = LZMA_STREAM_INIT;
This i dont need to re-init or declare it for each block i want to decode right? i could do it once per file?
And for lzma_block
I really need to decode the header for every block? Header size, block flags, filter list is all the same for all of the blocks? I know the diference is the compressed and uncompressed size, but i can get and update that data from the index, CRC32 is going to be wrong, but using block check none and ignore_check=1 should be fine in this case? Seems to work. I dont know from where to get the block.check either, there i hardcoded to LZMA_CHECK_CRC32 because i knew it where the correct one.
And the filter list size block.filters = (lzma_filter*)malloc(sizeof(lzma_filter) * 8); This works but i dont really know what is the correct value here.
Im also a bit confused on when i need to use the padded block size and the real block size, i know i should use the padded size to read the blocks into memory, but when i pass "stream.avail_in" i should use the padded or real size here?
-updated code based on the ideas i wrote here, this seems to work-
bool header_decoded = false; lzma_block block; lzma_stream stream = LZMA_STREAM_INIT; stream.allocator = NULL; for (int i = 0; i < blocks; i++) { xz_block d_blk = block_list[i]; if (!header_decoded) { fseek(myfile, d_blk.position, SEEK_SET); auto header_size = varInt(myfile); header_size = lzma_block_header_size_decode(header_size); std::cout << "Block Header Size: " << header_size << "\n"; block.header_size = header_size; uint8_t* header = (uint8_t*)malloc(header_size); fseek(myfile, d_blk.position, SEEK_SET); fread(header, 1, header_size, myfile); block.filters = (lzma_filter*)malloc(sizeof(lzma_filter) * 8); block.check = LZMA_CHECK_NONE; std::cout << "Block Header Decode: " << lzma_block_header_decode(&block, NULL, header) << "\n"; free(header); header_decoded = true; } else { block.compressed_size = d_blk.compressed_size; block.uncompressed_size = d_blk.uncompressed_size; fseek(myfile, d_blk.position+block.header_size, SEEK_SET); } uint8_t* inbuf = (uint8_t*)malloc(d_blk.compressed_size_padded); uint8_t* outbuf = (uint8_t*)malloc(d_blk.uncompressed_size); fread(inbuf, 1, d_blk.compressed_size_padded, myfile); stream.next_in = inbuf; stream.next_out = outbuf; stream.avail_out = d_blk.uncompressed_size; stream.avail_in = d_blk.compressed_size_padded; std::cout << "Block Block Decoder: " << lzma_block_decoder(&stream, &block) << "\n"; auto ret = lzma_code(&stream, LZMA_RUN); std::cout << "LZMA CODE: " << ret << "\n"; std::cout << "TOTAL IN: " << stream.total_in << "\n"; std::cout << "TOTAL OUT: " << stream.total_out << "\n"; std::cout << "\n\n\n\nBlock N° " << i << " Data: " << outbuf << "\n"; free(inbuf); free(outbuf); } free(block.filters);
Thanks, ill check your example.
lzma_stream stream = LZMA_STREAM_INIT;This i dont need to re-init or declare it for each block i want to decode right? i could do it once per file?
Correct. It's enough to init lzma_stream once and then initialize or reinitialize it multiple times. Once it's not needed anymore, lzma_end frees the memory. That is, one doesn't need to even lzma_free between reinitializations.
If you always decode the entire Block at once, lzma_block_buffer_decode can save a few lines of code (not many). Using the method with lzma_stream and reusing the same stream might be slightly more efficient when decoding many tiny Blocks because memory allocations may be reused.
I really need to decode the header for every block?
Yes because in a generic situation you don't know that all blocks have the same header. Even if the settings were the same, the header size could differ.
Decoding the Block Header for one Block and then hoping that the result works for all Blocks might work quite often, but it's just begging for trouble to occur some day later. You aren't making the code faster or simpler by reusing the result.
Header size, block flags, filter list is all the same for all of the blocks?
Usually yes but it depends on how the file was created.
CRC32 is going to be wrong, but using block check none and ignore_check=1 should be fine in this case?
CRC32 in Block Header is verified by lzma_block_header_decode even if ignore_check is true.
Block decoder verifies the integrity check of the uncompressed data at the of the Block before returning LZMA_STREAM_END. This verification is skipped if ignore_check is true (and version >= 1).
If verifying the data integrity isn't needed, setting ignore_check to true can make decoding a tiny bit faster, but often it's not enough to be worth it.
The correct value for lzma_block.check comes from Stream Header. See my example program.
block.filters = (lzma_filter*)malloc(sizeof(lzma_filter) * 8); This works but i dont really know what is the correct value here.
It's LZMA_FILTERS_MAX + 1. It's documented in the note in block.h.
If you allocate lzma_block on stack (local variable), I suggest you allocate the filters array on stack too.
Don't forget lzma_filters_free to avoid memory leaks. lzma_block_header_decode allocates memory in filters[i].options. lzma_block_decoder needs that info. After lzma_block_decoder has returned, the memory can be freed with lzma_filters_free (it only frees the options, not filters[] array itself). My example calls lzma_filters_free only after decoding but it doesn't need to be delayed like that.
Im also a bit confused on when i need to use the padded block size and the real block size, i know i should use the padded size to read the blocks into memory, but when i pass "stream.avail_in" i should use the padded or real size here?
It's fine to be confused here, the meanings of the sizes in the format and API are kind of weird. Using lzma_index_iter iter:
-
iter->block.total_sizeis how many bytes of input are needed to decode the entire Block, including Block Header and verification of integrity check. Because Block Header is decoded separately, you want:stream.avail_in = iter->block.total_size - header_size;My example reads input in chunks, thus it sets the values slightly differently, but the end result is the same.
If you only read
unpadded_size, you can still decode the all data but cannot verify the integrity check. Also, it means that you might not getLZMA_STREAM_ENDeven if you setignore_check = true. WithoutLZMA_STREAM_ENDyou don't know if the entire Block was indeed decoded, but it only matters if you want to verify that the input is not corrupt. -
After successful decoding, one can verify that
iter->block.unpadded_sizematches the decoded Block. My example does it right before callinglzma_endindecode_block. Again, this is just to verify that the file isn't corrupt. -
iter->block.uncompressed_sizecan be verified by the Block decoder. My example setsblock.uncompressed_size = iter->block.uncompressed_size;right before initializing the Block decoder with
lzma_block_decoder. This way one doesn't need to verify it oneself like done withunpadded_size. Once again, this check is just to catch corrupt files where Blocks and Index don't match.
Thanks, ill check your example.
I hope it helps. It's confusing and messy for a few reasons. :-( liblzma was supposed to have a high level API for this over a decade ago. Maybe I could get it finally implemented this year... Would you be interested in testing it?
Feel free to ask if something is unclear.
Ok, checking your code helped understanding more about the whole process.
I arrived to this, still unfinished function to extract X bytes of offset Y of the file. This seems to be working so far.
size_t xz_stream_read_random_access(FILE* file_in, lzma_check block_check_id, xz_block* block_list, uint8_t* bytes_out, size_t offset, size_t length) { size_t written_bytes = 0;
/* Special cases */ if (length == 0) return 0; /* Its the real block size set in the compressor or lower in case that the file is smaller than the block size, works either way */ size_t block_size = block_list[0].uncompressed_size; /* The blocks [currentBlock, endBlock] contain the data we want */ uint32_t currentBlock = offset / block_size; uint32_t endBlock = ((offset + length - 1) / block_size) + 1; /*TODO: Check we are reading a valid block range, ex: we are not going over the max blocks and current block is >= 0*/ /* Seek to the first block to read */ fseek(file_in, block_list[currentBlock].position, SEEK_SET); offset = offset % block_size; /* Init stuff needed for block decoding */ lzma_filter block_filters[LZMA_FILTERS_MAX + 1]; block_filters[LZMA_FILTERS_MAX].id = LZMA_VLI_UNKNOWN; lzma_stream stream = LZMA_STREAM_INIT; stream.allocator = NULL; /* Start decoding */ for (; currentBlock < endBlock; ++currentBlock) { /* TODO: Implement optimization, keep the last decoded block bytes and block number in memory from the last time this function has been called on the current file, this way we can skip the file reading and block decoding if we need the requested block is the same as last time. IT WILL BE BAD for huge block sizes. */ /* Reserve memory for block decoding */ uint8_t* inbuf = (uint8_t*)malloc(block_list[currentBlock].compressed_size_padded); uint8_t* outbuf = (uint8_t*)malloc(block_list[currentBlock].uncompressed_size); /* Init block */ lzma_block block; block.filters = block_filters; block.check = block_check_id; // set from the file header /* Read the entire block into a buffer */ fread(inbuf, 1, block_list[currentBlock].compressed_size_padded, file_in); /*TODO: Check the if we read OK */ auto header_size = lzma_block_header_size_decode(inbuf[0]); /*TODO: Check header min/max size*/ block.header_size = header_size; auto reth = lzma_block_header_decode(&block, NULL, inbuf); /* TODO: Check result and if decoded block sizes matches the ones in the index.*/ stream.next_in = inbuf + header_size; stream.next_out = outbuf; stream.avail_out = block_list[currentBlock].uncompressed_size; stream.avail_in = block_list[currentBlock].compressed_size_padded - header_size; auto retb = lzma_block_decoder(&stream, &block); /*TODO: Check result*/ auto ret = lzma_code(&stream, LZMA_RUN); /* TODO: Check result and if the total_out is what is expected here*/ /* Write out the part of the data we care about */ size_t copyLength = (length < (stream.total_out - offset) ? length : stream.total_out - offset); /* TODO: check we are not overruning the bytes_out buffer*/ memcpy(bytes_out + written_bytes, outbuf + offset, copyLength); /*TODO: Check memcpy result*/ written_bytes += copyLength; offset = 0; length -= copyLength; free(inbuf); free(outbuf); stream = LZMA_STREAM_INIT; } lzma_end(&stream); lzma_filters_free(block_filters, NULL); return written_bytes;}
Note that im still using my own block index and stream parsing of the header, footer and index.
Im not sure ill be able to use lzma_file_info_decoder() in my case because im do not know if i can use the index interator in the way i want (like in my code). This is enoght to me to make a test implementation in the open source game i want to get this added, (im worried about loading times). I made this same work for LZ4 some years ago, thats the reason i need to share some stuff, and i might not be able to use a complete high level api call.
But im open to help testing if it is done.
One more question, block decoding is multithreaded? Or i need to set my own kind of MT, like decoding multiple blocks at once?
In the end i got another implementation from your example and based on my last code for random access using all api functions to parse the file with lzma_file_info_decoder() and the index iterator.
Now ill need to do MT.
if you ever do a high level api call for random access ill strongely recommend to implement a cache to avoid decoding the same block again in consecutive calls, It makes a HUGE difference if the calls demand data from the same block more than once.
So you can close this if you want.
One final question, after calling
lzma_index_end(index,null)
i need to call free(index) too? right now doing that breaks my implementation for some reason for next the time i re-open the file.
Nice to hear you got it working. I didn't review your code thoroughly but a few comments still:
-
Remember to check if
mallocreturnedNULL. -
LZMA_STREAM_INITsetsstream.allocator = NULLalready so doing it explicitly is redundant. -
There's a memory leak because at the end of the loop the code sets
stream = LZMA_STREAM_INITwithout freeing the memory withlzma_end. Simply remove that line from the end of the loop. It's correct to init once before the loop, reuse the stream by only callinglzma_block_decoderandlzma_codeon it, and then at the end calllzma_endonce. -
There is another memory leak with
block_filters. Movelzma_filters_free(block_filters, NULL);to the line afterauto retb = lzma_block_decoder(&stream, &block);so that the filters are always freed after decoder init (with both success and failure).
after calling
lzma_index_end(index,null)
i need to call free(index) too?
No. If you have something like this...
lzma_stream strm = LZMA_STREAM_INIT;
...
lzma_index *index; // Old value is ignored, *not* freed.
lzma_ret ret = lzma_file_info_decoder(&strm, &index);
if (ret != LZMA_OK) {
// No memory was left allocated.
// index is undefined so don't call lzma_index_end(index, NULL).
// If strm had been used by earlier decoder, its memory has
// been freed as if lzma_end(&strm) had been called.
return SOME_ERROR_CODE;
}
...
lzma_end(&strm);
...
lzma_index_end(index, NULL);
...then you have freed the memory.
If you already had your own code for blocks and seeking to use with LZ4, perhaps reusing that code with XZ (or raw LZMA or LZMA2) would be simplest unless, for example, using the xz tool in threaded mode makes it convenient to compress the data into seekable form.
Each XZ Block is decoded using a single thread. Threaded decoder decodes multiple Blocks in parallel. The Index field makes it possible to locate the Blocks and calculate the uncompressed size of the file without seeking into each Block first (fewer seeks helped with HDDs and thousands of Blocks).
If I finally get the seeking API done, repeated calls to lzma_code will decode sequentially just like normal decoder does. If one seeks forward, then it would be smart enough to not restart from the beginning of the current Block. If seeking to another Block (backward or forward), then it would obviously seek to the beginning of the new Block and start decoding from there. I don't currently plan any cache that would quickly return data that was already decoded and returned earlier.
In your use case, if you only need to be able to seek to certain offsets, it helps if you start a new Block at those offsets. If there are many such offsets, then it might not be practical because compression ratio will suffer if Blocks are too small.
I suggest you try also Zstandard. It's much faster to decode with a single thread and compression ratio is often pretty close to XZ. If you already had seeking code for LZ4 use, perhaps it's not much effort to try Zstd that way.
Zstd was actually by first choice but i decided to try XZ first because random access is not built-in in Zstd mainline, it needs to use the frame format that no compressor supports because it is not mainline. I really dont want to make my own compressors too like i had to do with LZ4. Also having ".zstd" files that work and do not work is very confusing.
The big advantage of XZ is that is already in the format, i can compress not only with xz-utils, but with the regular 7z compressor and it will just work as long you choose to use small solid block sizes. The disavantage is that when containers are needed the only real solution is to compress a tar and tar are a big nono in this scenario.
Everything has been petty solid so far, not sure how much more high level you will be able to do it, like random access only using api function right now is:
- Use lzma_file_info_decoder() to get the block index
- Use lzma_index_iter_locate() to get to the block you want
- Read block into buffer and get the header size
- Set up block and decode its header with lzma_block_header_decode()
- Finally use lzma_block_decoder()
i think an api call that does 3, 4 and 5 in one is good option here.
The only thing it bothers me with the index system is is that im not able to use it as an array. Like it is really easy to calculate the block number i need to extract, there is no way (that i can see) to get for example, block 50 directly. But ive adapted to it.
And even implemented MT by decoding in parallel when more than one block is requested.
I think this can be closed now, thanks for your help.
My current method for random access with MT is (im still looking to see if i would be able to speed this up, but i dont think so).
-
A standalone block decoder function, this gets the input block already read in memory, an output buffer where the data will be decoded to, and the block check. That does the entire work of decoding the block header size, block header and block data.
-
A random-access function that will keep decoding until all data requested is copied into memory. Mostly involves in using the index to find the first block needed, seek, read the block into a buffer and use a thread vector to emplace the call to the block decoder function. If we are not done, read the next block and repeat.
This way if im using 1MB blocks and the requested data is 10MB(uncompressed) in length, it will read one block, call the decoding function into a thread pool, read the next block, and repeat 10 times. Then wait for all threads to finish.
Always keeping the last decoded block in memory so if the next time the random-access function is called, if the starting block is the same as the last decoded block for that file it will just memcpy from buffer, then start decoding from the next one (if needed). This is specially usefull when you have a considerable number of small reads one after the other that are way below the block size.
The problem with this is that im petty much forced into 1MB blocks to maximize multithreading, but i dont see an alternative.
I created a branch seekdec. It adds a function to locate a .xz block by block number and a seekable single-threaded decoder that supports seeking by both uncompressed offset and block number. So now you can locate blocks like from an array.
Also a new decoder flag is added which makes lzma_code return LZMA_BLOCK_END at the end of each block. It can be useful if one is always decoding complete .xz blocks. One benefit is that when one sees LZMA_BLOCK_END, one knows that the integrity check of the block has been verified. (If one doesn't care about verifying the integrity check when doing random access, then using the LZMA_IGNORE_CHECK flag will provide a minor speed improvement.)
With this your code could be:
-
Use lzma_file_info_decoder() to get lzma_index.
-
Initialize a seekable decoder for each decoder thread. Use
LZMA_TELL_BLOCK_ENDflag if you wantlzma_codeto returnLZMA_BLOCK_ENDat the end of each block. -
Pick a thread from the pool and seek using
LZMA_SEEK_TO_BLOCK(orLZMA_SEEK_TO_OFFSET) and decode as much data as needed. If you need to know the uncompressed size of a block before starting to decode it, you can use an iterator to locate the block and get the info that way. The seekable decoder itself doesn't provide it.
Your threading method sounds reasonable, and the new functionality in the seekdec branch doesn't change the basic idea (you just need a little less code now). I'm not sure if liblzma should provide threaded seekable decoder. I suspect different use cases would have slightly different requirements, and thus it might be tricky to provide a method that is generic enough.
If you have time to test seekdec, I'm happy to hear your feedback. I will let the code stay on a branch for a while before merging it to master. I won't include it in 5.8.x stable releases; this feature is for 5.9.xalpha/beta and then in 5.10.0 stable some day.
The problem with this is that im petty much forced into 1MB blocks to maximize multithreading, but i dont see an alternative.
Yes, block size is a balance between number of threads and file size.
If you wish to minimize decoder memory usage (or at least address space usage), you could set the LZMA2 dictionary size to 1 MiB as well when compressing with 1 MiB block size. The default LZMA2 dictionary size is 8 MiB so the decoder allocates a 8 MiB buffer, but only 1 MiB of it is actually used when the uncompressed size is 1 MiB.
ill try to test it this week. As for optimisations im petty sure i already did all i can, its not too bad because there are other bottlenecks in other parts of the code, but yeah.
Just to confirm, to disable block integrity check, i just need to set version to 1 and disable check to true AFTER decoding the block header right? In this case is ok to disable it because i can enable it only on debug builds. I was not sure on how to set the flag in this case.
Yes. Decode block header, set block_options.version = 1, block_options.ignore_check = true, and then initialize the block decoder with lzma_block_decoder().
If you use lzma_seekable_decoder() from the seekdec branch, then the same thing is done by passing LZMA_IGNORE_CHECK as the "flags" argument (the third argument).
It improves speed only a few percent at best; you need to figure out if it is a good idea in your use case.
I've just come across this issue after looking into how I might add support for seek-able decoding to the Julia package CodecXz.jl (which uses a C FFI to liblzma). The seekdec branch looks great, is there a good chance of it making its way to a release any time soon @Larhzu?
@tecosaur: It would be in a new stable branch, so whenever xz 5.10.0 is released. A wild guess is the first half of 2026. Have you tested the code in practice?