gzrt icon indicating copy to clipboard operation
gzrt copied to clipboard

Wish: extract parts of gzip file

Open ole-tange opened this issue 8 years ago • 1 comments

My users often have huge .gz files that they would like to process in parallel.

Can gzrt be adapted so it can extract a valid gz-file in blocks?

Let us assume I have a 1 GB file.gz and I want to extract blocks of around 1 MB of compressed data. I want to do this in parallel. So first I want to identify positions where a valid gz-block starts:

$ gzrt --next-start-of-block 0 0 $ gzrt --next-start-of-block 1000000 1234888 $ gzrt --next-start-of-block 2000000 2123488 ... $ gzrt --next-start-of-block 999000000 999348877

The idea is to seek to the byte position and then identify the next valid gz-block. When it is identified, print the byteposistion and exit.

After identifying where blocks start I would then be able to extract from one block to another:

gzrt --from-byte 0 --to-byte 1234888 | my_program & gzrt --from-byte 1234888 --to-byte 2123488 | my_program & gzrt --from-byte 2123488 --to-byte 3212348 | my_program & ... gzrt --from-byte 998374753 --to-byte 999348877 | my_program &

ole-tange avatar Jan 20 '17 15:01 ole-tange

Right now there's no option to start and end at a particular byte offset. The very nature of gzrecover is such that it can start extracting good data at any location, however.

The challenge is that gzip doesn't use fixed block sizes and thus you cannot predict where there will be a block boundary. So if you told it to start at byte 1234888, there's no guarantee this would not be in the middle of a block, and thus you'd lose any data in that block.

arenn avatar Jan 20 '17 15:01 arenn