Debugging corrupted bitshuffle data
Hi @kiyo-masui, we have some SETI data stored with bitshuffle compression, and a small number of files appear to have become corrupted. (Here is one, FYI: https://bldata.berkeley.edu/blpd30_datax2/blc03_guppi_59132_36704_HIP111595_0078.rawspec.0002.h5)
h5py is happy to open the file, but barfs if you try and access the bitshuffled dataset:
In [3]: a = h5py.File('blc03_guppi_59132_36704_HIP111595_0078.rawspec.0002.h5', 'r')
In [4]: a['data']
Out[4]: <HDF5 dataset "data": shape (279, 1, 65536), type "<f4">
In [5]: d = a['data'][:]
---------------------------------------------------------------------------
OSError Traceback (most recent call last)
<ipython-input-5-fee15ce54759> in <module>
----> 1 d = a['data'][:]
h5py/_objects.pyx in h5py._objects.with_phil.wrapper()
h5py/_objects.pyx in h5py._objects.with_phil.wrapper()
~/opt/anaconda3/lib/python3.8/site-packages/h5py/_hl/dataset.py in __getitem__(self, args)
571 mspace = h5s.create_simple(mshape)
572 fspace = selection.id
--> 573 self.id.read(mspace, fspace, arr, mtype, dxpl=self._dxpl)
574
575 # Patch up the output for NumPy
h5py/_objects.pyx in h5py._objects.with_phil.wrapper()
h5py/_objects.pyx in h5py._objects.with_phil.wrapper()
h5py/h5d.pyx in h5py.h5d.DatasetID.read()
h5py/_proxy.pyx in h5py._proxy.dset_rw()
h5py/_proxy.pyx in h5py._proxy.H5PY_H5Dread()
OSError: Can't read data (filter returned failure during read)
Do you think this file is recoverable (or partly recoverable)? Is there any way to turn on extra debug info in bitshuffle to help diagnose why it fails, and/or can bitshuffle skip over 'bad' chunks?
With a bit of hacking, I think you should be able to recover most of the data. First, I would just add print statements in bshuf_h5filter.c to figure out which exactly what function is returning an error code and the value of that code (the core functions of bitshuffle some some specific error codes with meanings).
Thanks @kiyo-masui, I'll take a look following that strategy.
As it's an issue with decompression, looks like here is a good place to start: https://github.com/kiyo-masui/bitshuffle/blob/fdfcd404ac8dcb828857a90c559d36d8ac4c2968/src/bshuf_h5filter.c#L183
Which calls: https://github.com/kiyo-masui/bitshuffle/blob/ac791b73d164068661566bbe4335fc7158372c49/src/bitshuffle.c#L238
And then each block is done with: https://github.com/kiyo-masui/bitshuffle/blob/fdfcd404ac8dcb828857a90c559d36d8ac4c2968/src/bitshuffle.c#L78