python-zstandard
python-zstandard copied to clipboard
Provide a file-like object to interface with TarFile
I've got a dataset of about 50.000 files. For easier and faster processing, these are inside one tar-archive and compressed with Zstandard. With a file-like object performing decompression, this would be really simple:
with zstd.open('data.tar.zstd', mode='r') as tar:
with tarfile.open(fileobj=tar, mode='r') as archive:
// do something with archive...
The next release has a stream_reader()
API that returns an object conforming to the io.RawIOBase
interface. It should be possible to use this object anywhere expecting a file-like object in Python, including in tarfile
.
Thanks, that's great :-) Will there also be a stream_writer()
API to create files?
ZstdCompressor.write_to()
can be used to wrap a file object with compression. (It may be renamed to stream_writer()
in a future release.
That being said, supporting tarfile
natively might be a bit... funky. The reason is that CPython's tarfile
insists on doing a seek(-1, os.SEEK_CUR)
. This is immediately followed by a read(1)
. Why it does this, I'm not sure.
The latest commit on master
does support seek()
. But only if advancing: seeking to a previous offset is not supported.
I coded up a test for tarfile
round tripping. Although it currently fails due to the need to seek in reverse on the read side of things (writing seems to work fine).
I agree that tar support is worth pursuing. Let me poke at things to see if there's a reasonable way to implement reading from zstd compressed tar files.
The master branch has a ZstdCompressor.stream_writer()
API that implements the io.RawIOBase
interface and can therefore be used anywhere Python would use a writable file object.
Regarding the seeking problem, tarfile
supports a file mode with a |
character denoting that the stream is non-seekable. While I haven't tried, it should be possible to do tarfile.open(mode='w|', fileobj=cctx.stream_writer(...))
to write a zstd compressed tar file.
I've added a test to the test suite that confirms tar reading and writing works with modes r|
and w|
. The only thing left to do here would be to implement a higher-level API to obtain a tarfile.TarFile
instance which is already configured for zstd compression.
README says:
The stream returned by stream_reader() is neither writable nor seekable (even if the underlying source is seekable).
which I guess is out of date? Later README says:
The stream returned by
stream_reader()
is partially seekable. Absolute and relative positions (SEEK_SET
andSEEK_CUR
) forward of the current position are allowed. Offsets behind the current read position and offsets relative to the end of stream are not allowed and will raiseValueError
if attempted.
@indygreg wrote:
That being said, supporting tarfile natively might be a bit... funky. The reason is that CPython's tarfile insists on doing a seek(-1, os.SEEK_CUR). This is immediately followed by a read(1). Why it does this, I'm not sure.
This cpython commit is the commit that added the seek followed by a read(1)
. The commit fixes bpo issue 24259 to detect a truncated tar file when the file is not compressed. In the next()
method of the TarFile
class, self.offset
in the seek statement is the position of the next member in the tar file and the self.fileobj
current position is at the first byte of the data segments of the current member. So this is not a backward seek.
The difference between the |
mode and the :
mode of TarFile.open()
is that the first uses the _Stream
class as a wrapper that does not accept seeking backward while the last one assumes that seeking backward is possible. Seeking backward is achieved by the lzma and bz2 pure Python modules by implementing a subclass of _compression.BaseStream
that has a _rewind()
method to start reading from scratch when the seek is backward.
+1 for this issue. Was there any progress on it?
The documentation states that ZstdCompressionReader implements io.RawIOBase
, however it's not registered with the abstract base class. After registering manually, it works (see code below).
reader = open(filename, 'rb')
cctx = ZstdCompressor()
reader = cctx.stream_reader(reader)
# TODO: this should not be necessary, but is as of 0.15.1.
if not isinstance(reader, io.RawIOBase):
io.RawIOBase.register(type(reader))
form = aiohttp.FormData()
form.add_field(
'file',
reader,
content_type='application/zstd'
filename=filename,
)
I think that all zstd classes should be registered with the appropriate abstract base classes. Do others share this opinion? Is this in the scope of this issue or should I create a new one?