bitshuffle icon indicating copy to clipboard operation
bitshuffle copied to clipboard

Debugging corrupted bitshuffle data

Open telegraphic opened this issue 3 years ago • 2 comments

Hi @kiyo-masui, we have some SETI data stored with bitshuffle compression, and a small number of files appear to have become corrupted. (Here is one, FYI: https://bldata.berkeley.edu/blpd30_datax2/blc03_guppi_59132_36704_HIP111595_0078.rawspec.0002.h5)

h5py is happy to open the file, but barfs if you try and access the bitshuffled dataset:

In [3]: a = h5py.File('blc03_guppi_59132_36704_HIP111595_0078.rawspec.0002.h5', 'r')
In [4]: a['data']
Out[4]: <HDF5 dataset "data": shape (279, 1, 65536), type "<f4">

In [5]: d = a['data'][:]
---------------------------------------------------------------------------
OSError                                   Traceback (most recent call last)
<ipython-input-5-fee15ce54759> in <module>
----> 1 d = a['data'][:]

h5py/_objects.pyx in h5py._objects.with_phil.wrapper()

h5py/_objects.pyx in h5py._objects.with_phil.wrapper()

~/opt/anaconda3/lib/python3.8/site-packages/h5py/_hl/dataset.py in __getitem__(self, args)
    571         mspace = h5s.create_simple(mshape)
    572         fspace = selection.id
--> 573         self.id.read(mspace, fspace, arr, mtype, dxpl=self._dxpl)
    574
    575         # Patch up the output for NumPy

h5py/_objects.pyx in h5py._objects.with_phil.wrapper()

h5py/_objects.pyx in h5py._objects.with_phil.wrapper()

h5py/h5d.pyx in h5py.h5d.DatasetID.read()

h5py/_proxy.pyx in h5py._proxy.dset_rw()

h5py/_proxy.pyx in h5py._proxy.H5PY_H5Dread()

OSError: Can't read data (filter returned failure during read)

Do you think this file is recoverable (or partly recoverable)? Is there any way to turn on extra debug info in bitshuffle to help diagnose why it fails, and/or can bitshuffle skip over 'bad' chunks?

telegraphic avatar Oct 12 '22 05:10 telegraphic

With a bit of hacking, I think you should be able to recover most of the data. First, I would just add print statements in bshuf_h5filter.c to figure out which exactly what function is returning an error code and the value of that code (the core functions of bitshuffle some some specific error codes with meanings).

kiyo-masui avatar Oct 13 '22 13:10 kiyo-masui

Thanks @kiyo-masui, I'll take a look following that strategy.

As it's an issue with decompression, looks like here is a good place to start: https://github.com/kiyo-masui/bitshuffle/blob/fdfcd404ac8dcb828857a90c559d36d8ac4c2968/src/bshuf_h5filter.c#L183

Which calls: https://github.com/kiyo-masui/bitshuffle/blob/ac791b73d164068661566bbe4335fc7158372c49/src/bitshuffle.c#L238

And then each block is done with: https://github.com/kiyo-masui/bitshuffle/blob/fdfcd404ac8dcb828857a90c559d36d8ac4c2968/src/bitshuffle.c#L78

telegraphic avatar Oct 17 '22 05:10 telegraphic