hdf5-blosc
hdf5-blosc copied to clipboard
HDF5 filter and plugin based on c-blosc2?
I would be interested in seeing this plugin updated to work with c-blosc2 code.
That would be great. The new API for C-Blosc2 is backward compatible with C-Blosc, so this should be easy. Just remember that the C-Blosc2 binary format is backward compatible, but not forward compatible.
Does that mean that a HDF5 plugin based on c-blosc1 would not be able to read chunks compressed by a HDF5 plugin based on c-blosc2?
That's correct.
Is there anyway to have c-blosc2 produce a backwards-compatible binary format?
I don't think so. At first I was trying to keep a format that was forward-compatible, but it was too much hassle, and decided not to do it.
Hi,
What is the guideline (is there any?...) regarding registered HDF5 filters compatibility and when it should use a different ID.
The bottom line is being able to read old compressed data with new versions of the filter, and I would expect passing parameters to the filter through HDF5 to be compatible as well.
Also, ideally IMO, writing compressed data with a new version but without using new features should be readable with older versions of the filter.
But looking at other HDF5 filters it is not always the case: For instance, ZFP breaks this when updating the underlying compression library, while the bitshuffle
filter kept compatible when adding ztsd.
By the way, Blosc2 is registered with a new ID, 32026: https://portal.hdfgroup.org/display/support/Filters#Filters-32026
Question: In the event of using c-blosc2
as a compression library for this filter (thus breaking forward compatibility), would older versions of the filter detect that they can't read the chunk and raise an error?
We would need to try, but I am pretty sure that an error will be raised when using C-Blosc1 with chunks generated with C-Blosc2.
On the other hand, the way we are currently using the registered Blosc2 (ID 32026) is by using a CFrame. The CFrame is more flexible than a regular chunk, and will allow to use multidim metalayers, which should be useful for optimizing dataset reads.
After reflecting more on this, one path that we could follow is to add a check in the current HDF5 Blosc1 filter so that, when an error would be detected, check whether what we are decompressing is a CFrame, and if so, call the actual Blosc2 filter. BTW, we have a preliminary version of a HDF5 Blosc2 filter at https://github.com/PyTables/PyTables/tree/direct-chunking-append/hdf5-blosc2/src, and one can use this for the standalone future hdf5-blosc2 filter.
I did a very quick test (no shuffle/default compressor) with updating hdf5plugin
to compile hdf5-blosc
with c-blosc2
and indeed it raises an error:
OSError: Can't read data (Blosc decompression error)
If you want to take the way of switching to c-blosc2
sooner or later (I expect it is better in terms of maintenance) and break forward compatibility, maybe that would be good to add a check of the blosc version in the filter to prepare for forward compatibility breaks and provide a more explicit message in this case.
Great to see a HDF5 Blosc2 filter coming-up!
After pondering a bit more about Blosc/Blosc2 compatibility, I think a better approach is to make the two filters totally separate. So the current hdf5-blosc will continue supporting just the C-Blosc 1.x series, while future hdf5-blosc2 will support just C-Blosc 2.x series. Also, having separate HDF5 Filter IDs will help in this.
It would be convenient if we could decompress Blosc1 with c-blosc2 though.
It would be convenient if we could decompress Blosc1 with c-blosc2 though.
From what I tested, compiling hdf5-blosc1 filter with c-blosc2 can decompress chunks compressed with c-blosc1 (it is backward compatible)... but not the other way around.
Any updates on this? There seem to be some blosc2 plugins availlable (eg pytables) but none support arbitrary filters as far as I can tell. I need BYTEDELTA to get a good compression ratio, I really want to be able to read/write hdf5 datasets and specify the blosc2 filters used to compress.
See https://www.hdfgroup.org/wp-content/uploads/2022/05/Blosc2-and-HDF5-European-HUG2022.pdf
I've seen that before, how does that answer my question? Is hdf5 going to adopt that proposal?
Have you seen https://github.com/Blosc/b2h5py ?
I'm not sure if I understand your question. The Blosc2 filter has been registered as ID 32026. There's nothing more for The HDF Group to do.
b2h5py is mostly out of scope for me. To be clear:
- I want SHUFFLE + BYTEDELTA blosc2 compression
- I want to be able to store the compressed data in hdf5 datasets
- I want to generate these datasets from C/C++
- I want it to be easy to read from C/python i.e. I don't want to have to read a uint8_t buffer from hdf5 and then call blosc2 to decompress it
You mention the plugin 32026, where is the authoritative implementation of that plugin?
- It's not in the hdf5 repo
- There's https://github.com/oscargm98/hdf5-blosc2 but that's 32002
- There's the implementation in pytables, which is 32026 but only allows setting one filter,
cparams.filters[BLOSC_LAST_FILTER] = doshuffle;
I guess I could compress data with blosc2 myself and write it with H5DOwrite_chunk
, trusting that 32026 will just decompress it.
Ok, I tried what I suggested above, but I get an error on decompression because dparams.schunk = NULL
. If I set dparams.schunk
to point to the schunk then it decompresses correctly. Specifically, I believe there should be an assignment of dparams.schunk = schunk;
immediately after this line
@froody You are right that, with the current API, we cannot use the full functionality of Blosc2 pipeline inside HDF5. The solution would be to use the cd_values
in HDF5 API in a more imaginative way, but that requires thought and careful execution so as to avoid collision with existing conventions for storing metainfo in cd_values
(e.g. n-dim info).
Meanwhile, I am glad that you figured out the best workaround, i.e. using direct chunking in HDF5 (via H5DOwrite_chunk
). FWIW, and if others are reading this, you can use direct chunking in h5py as well. Here you have an example where h5py is using grok for compressing with JPEG2000 via Blosc2: https://gist.github.com/t20100/80960ec46abd3a863e85876c013834bb
You mention the plugin 32026, where is the authoritative implementation of that plugin?
It's in PyTables: https://github.com/PyTables/PyTables/tree/master/hdf5-blosc2/src
It's also embedded in hdf5plugin for usage with h5py. Though as already mentioned, for having access to all blosc2 features when writing one has to use HDF5 direct chunk write.
I think we can close this.