Zarr.jl icon indicating copy to clipboard operation
Zarr.jl copied to clipboard

`ShuffleFilter` fails to round trip

Open nhz2 opened this issue 1 year ago • 3 comments

julia> using Zarr

julia> codec = Zarr.ShuffleFilter(elementsize=4)
Zarr.ShuffleFilter(0x0000000000000004)

julia> Zarr.zdecode(Zarr.zencode(UInt8[0x05], codec), codec)
1-element Vector{UInt8}:
 0xe0

From what I can tell the shuffle filter is missing the "Add leftover to the end of data" step from https://github.com/HDFGroup/hdf5/blob/f2642985d8c23ff7e876c6228c7cc0cf20515923/src/H5Zshuffle.c#L279-L284

@mkitti am I reading that HDF5 code correctly, and do you know if appending leftover data at the end after the shuffle is a standard thing to do? I can't find a place where this is documented.

nhz2 avatar Dec 21 '24 17:12 nhz2

Shuffling under Zarr should error if the input array byte count is not a multiple of the element size.

https://github.com/zarr-developers/numcodecs/blob/main/numcodecs%2Fshuffle.py

HDF5 filter implementations should not be assumed to be compatible with their Zarr counterparts.

Additionally, Zarr v2 codecs and Zarr v3 codecs may have subtly distinct behavior and defaults.

mkitti avatar Dec 21 '24 19:12 mkitti

Interesting, I think the shuffle filter was originally supposed to be compatible with HDF5. Ref: https://github.com/fsspec/kerchunk/issues/11 But they took the implementation from https://github.com/HDFGroup/hsds/blob/03890edfa735cc77da3bc06f6cf5de5bd40d1e23/hsds/util/storUtil.py#L43

nhz2 avatar Dec 22 '24 21:12 nhz2

I've tested in https://github.com/nhz2/ChunkCodecs.jl/pull/6 that HDF5 copies the remaining data at the end if the data length is not evenly divisible by the element size. For example "12312312312345" with element size 3 gets byte shuffled to"11112222333345".

nhz2 avatar Dec 30 '24 04:12 nhz2