tensorstore
tensorstore copied to clipboard
Zstd compression does not encode content size in header
The Zstd writer implemented here is based on the Zstd streaming API. When encoding a chunk, set pledged size is not used. This means the frame content size is not encoded in the Zstd frame header.
Some Zstd decoding implementations such as numcodecs.js and Zarr numcodecs rely upon ZSTD_getFrameContentSize() to allocate a decompression buffer.
https://github.com/zarr-developers/numcodecs/blob/main/numcodecs%2Fzstd.pyx#L182-L184
As written here, chunks encoded by Zstd via Tensorstore will return ZSTD_CONTENTSIZE_UNKNOWN from ZSTD_getFrameContentSize().
xref: https://github.com/google/neuroglancer/issues/625
Here's an illustration of saving a zarr array with tensorstore using zstd compression. python-zstandard is unable to open the chunk unless max_output_size is provided.
In [1]: import tensorstore as ts, zstandard as zstd
In [2]: ds = ts.open({
...: 'driver': 'zarr',
...: 'kvstore': {
...: 'driver': 'file',
...: 'path': 'tmp/zarr_zstd_dataset',
...: },
...: 'metadata': {
...: 'compressor': {
...: 'id': 'zstd',
...: 'level': 3,
...: },
...: 'shape': [1024, 1024],
...: 'chunks': [64, 64],
...: 'dtype': '|u1',
...: }
...: }).result()
In [3]: ds[:] = 5
In [4]: with open("tmp/zarr_zstd_dataset/0/0", "rb") as f:
...: src = f.read()
...:
In [5]: zstd.backend_c.frame_content_size(src)
Out[5]: -1
In [6]: zstd.ZstdDecompressor().decompress(src)
---------------------------------------------------------------------------
ZstdError Traceback (most recent call last)
Cell In[6], line 1
----> 1 zstd.ZstdDecompressor().decompress(src)
ZstdError: could not determine content size in frame header
In [7]: zstd.ZstdDecompressor().decompress(src, max_output_size=1024*1024)
Out[7]: b'\x05\x05\x05\x05\x05\x05\x05\x05\x05 [...] \x05\x05\x05 '
For an example of being unable to open the dataset with zarr-python see https://github.com/zarr-developers/zarr-python/issues/2056
Thanks for investigating this so thoroughly!
We can probably ensure that tensorstore includes the uncompressed size in the header in this case, but in general there could be multiple variable-output-size codecs chained and it is desirable to be able to do streaming encoding.
Therefore in addition to that, other implementations should still support decoding without the size in the header.
For zarr v3 the pledged size is specified but not for zarr v2. It might be fixed by a later refactor of the zarr v2 codec handling.
Python's numcodecs can now decompress without a pledge size since https://github.com/zarr-developers/numcodecs/pull/707 has been merged.