python-zstandard Provide a file-like object to interface with TarFile

I've got a dataset of about 50.000 files. For easier and faster processing, these are inside one tar-archive and compressed with Zstandard. With a file-like object performing decompression, this would be really simple:

with zstd.open('data.tar.zstd', mode='r') as tar:
    with tarfile.open(fileobj=tar, mode='r') as archive:
        // do something with archive...

May 10 '17 19:05 bitzl

The next release has a stream_reader() API that returns an object conforming to the io.RawIOBase interface. It should be possible to use this object anywhere expecting a file-like object in Python, including in tarfile.

Sep 17 '17 00:09 indygreg

Thanks, that's great :-) Will there also be a stream_writer() API to create files?

Sep 17 '17 08:09 bitzl

ZstdCompressor.write_to() can be used to wrap a file object with compression. (It may be renamed to stream_writer() in a future release.

That being said, supporting tarfile natively might be a bit... funky. The reason is that CPython's tarfile insists on doing a seek(-1, os.SEEK_CUR). This is immediately followed by a read(1). Why it does this, I'm not sure.

The latest commit on master does support seek(). But only if advancing: seeking to a previous offset is not supported.

I coded up a test for tarfile round tripping. Although it currently fails due to the need to seek in reverse on the read side of things (writing seems to work fine).

I agree that tar support is worth pursuing. Let me poke at things to see if there's a reasonable way to implement reading from zstd compressed tar files.

Mar 26 '18 03:03 indygreg

The master branch has a ZstdCompressor.stream_writer() API that implements the io.RawIOBase interface and can therefore be used anywhere Python would use a writable file object.

Regarding the seeking problem, tarfile supports a file mode with a | character denoting that the stream is non-seekable. While I haven't tried, it should be possible to do tarfile.open(mode='w|', fileobj=cctx.stream_writer(...)) to write a zstd compressed tar file.

Feb 17 '19 03:02 indygreg

I've added a test to the test suite that confirms tar reading and writing works with modes r| and w|. The only thing left to do here would be to implement a higher-level API to obtain a tarfile.TarFile instance which is already configured for zstd compression.

Feb 17 '19 19:02 indygreg

README says:

The stream returned by stream_reader() is neither writable nor seekable (even if the underlying source is seekable).

which I guess is out of date? Later README says:

The stream returned by stream_reader() is partially seekable. Absolute and relative positions (SEEK_SET and SEEK_CUR) forward of the current position are allowed. Offsets behind the current read position and offsets relative to the end of stream are not allowed and will raise ValueError if attempted.

Jun 26 '19 05:06 anpc

@indygreg wrote:

That being said, supporting tarfile natively might be a bit... funky. The reason is that CPython's tarfile insists on doing a seek(-1, os.SEEK_CUR). This is immediately followed by a read(1). Why it does this, I'm not sure.

This cpython commit is the commit that added the seek followed by a read(1). The commit fixes bpo issue 24259 to detect a truncated tar file when the file is not compressed. In the next() method of the TarFile class, self.offset in the seek statement is the position of the next member in the tar file and the self.fileobj current position is at the first byte of the data segments of the current member. So this is not a backward seek.

The difference between the | mode and the : mode of TarFile.open() is that the first uses the _Stream class as a wrapper that does not accept seeking backward while the last one assumes that seeking backward is possible. Seeking backward is achieved by the lzma and bz2 pure Python modules by implementing a subclass of _compression.BaseStream that has a _rewind() method to start reading from scratch when the seek is backward.

Jan 15 '20 10:01 xdegaye

+1 for this issue. Was there any progress on it?

Sep 25 '20 00:09 ftrofin

The documentation states that ZstdCompressionReader implements io.RawIOBase, however it's not registered with the abstract base class. After registering manually, it works (see code below).

reader = open(filename, 'rb')
cctx = ZstdCompressor()
reader = cctx.stream_reader(reader)
# TODO: this should not be necessary, but is as of 0.15.1.
if not isinstance(reader, io.RawIOBase):
    io.RawIOBase.register(type(reader))
form = aiohttp.FormData()
form.add_field(
    'file',
    reader,
    content_type='application/zstd'
    filename=filename,
)

I think that all zstd classes should be registered with the appropriate abstract base classes. Do others share this opinion? Is this in the scope of this issue or should I create a new one?

Jul 07 '21 12:07 chaoflow

python-zstandard python-zstandard copied to clipboard

Provide a file-like object to interface with TarFile

python-zstandard
python-zstandard copied to clipboard