tensorstore icon indicating copy to clipboard operation
tensorstore copied to clipboard

Zstd compression does not encode content size in header

Open mkitti opened this issue 1 year ago • 2 comments
trafficstars

The Zstd writer implemented here is based on the Zstd streaming API. When encoding a chunk, set pledged size is not used. This means the frame content size is not encoded in the Zstd frame header.

Some Zstd decoding implementations such as numcodecs.js and Zarr numcodecs rely upon ZSTD_getFrameContentSize() to allocate a decompression buffer.

https://github.com/zarr-developers/numcodecs/blob/main/numcodecs%2Fzstd.pyx#L182-L184

As written here, chunks encoded by Zstd via Tensorstore will return ZSTD_CONTENTSIZE_UNKNOWN from ZSTD_getFrameContentSize().

xref: https://github.com/google/neuroglancer/issues/625

mkitti avatar Jul 26 '24 01:07 mkitti

Here's an illustration of saving a zarr array with tensorstore using zstd compression. python-zstandard is unable to open the chunk unless max_output_size is provided.

In [1]: import tensorstore as ts, zstandard as zstd

In [2]: ds = ts.open({
   ...:     'driver': 'zarr',
   ...:     'kvstore': {
   ...:         'driver': 'file',
   ...:         'path': 'tmp/zarr_zstd_dataset',
   ...:     },
   ...:     'metadata': {
   ...:         'compressor': {
   ...:             'id': 'zstd',
   ...:             'level': 3,
   ...:         },
   ...:         'shape': [1024, 1024],
   ...:         'chunks': [64, 64],
   ...:         'dtype': '|u1',
   ...:     }
   ...: }).result()

In [3]: ds[:] = 5

In [4]: with open("tmp/zarr_zstd_dataset/0/0", "rb") as f:
   ...:     src = f.read()
   ...: 

In [5]: zstd.backend_c.frame_content_size(src)
Out[5]: -1

In [6]: zstd.ZstdDecompressor().decompress(src)
---------------------------------------------------------------------------
ZstdError                                 Traceback (most recent call last)
Cell In[6], line 1
----> 1 zstd.ZstdDecompressor().decompress(src)

ZstdError: could not determine content size in frame header

In [7]: zstd.ZstdDecompressor().decompress(src, max_output_size=1024*1024)
Out[7]: b'\x05\x05\x05\x05\x05\x05\x05\x05\x05 [...] \x05\x05\x05 '

For an example of being unable to open the dataset with zarr-python see https://github.com/zarr-developers/zarr-python/issues/2056

mkitti avatar Jul 26 '24 04:07 mkitti

Thanks for investigating this so thoroughly!

We can probably ensure that tensorstore includes the uncompressed size in the header in this case, but in general there could be multiple variable-output-size codecs chained and it is desirable to be able to do streaming encoding.

Therefore in addition to that, other implementations should still support decoding without the size in the header.

jbms avatar Jul 26 '24 05:07 jbms

For zarr v3 the pledged size is specified but not for zarr v2. It might be fixed by a later refactor of the zarr v2 codec handling.

jbms avatar Jan 10 '25 00:01 jbms

Python's numcodecs can now decompress without a pledge size since https://github.com/zarr-developers/numcodecs/pull/707 has been merged.

mkitti avatar Jul 10 '25 22:07 mkitti