hdf5-blosc HDF5 filter and plugin based on c-blosc2?

I would be interested in seeing this plugin updated to work with c-blosc2 code.

Feb 04 '22 20:02 mkitti

That would be great. The new API for C-Blosc2 is backward compatible with C-Blosc, so this should be easy. Just remember that the C-Blosc2 binary format is backward compatible, but not forward compatible.

Feb 06 '22 08:02 FrancescAlted

Does that mean that a HDF5 plugin based on c-blosc1 would not be able to read chunks compressed by a HDF5 plugin based on c-blosc2?

Feb 06 '22 09:02 mkitti

That's correct.

Feb 06 '22 09:02 FrancescAlted

Is there anyway to have c-blosc2 produce a backwards-compatible binary format?

Feb 06 '22 09:02 mkitti

I don't think so. At first I was trying to keep a format that was forward-compatible, but it was too much hassle, and decided not to do it.

Feb 06 '22 13:02 FrancescAlted

Hi,

What is the guideline (is there any?...) regarding registered HDF5 filters compatibility and when it should use a different ID.

The bottom line is being able to read old compressed data with new versions of the filter, and I would expect passing parameters to the filter through HDF5 to be compatible as well. Also, ideally IMO, writing compressed data with a new version but without using new features should be readable with older versions of the filter. But looking at other HDF5 filters it is not always the case: For instance, ZFP breaks this when updating the underlying compression library, while the bitshuffle filter kept compatible when adding ztsd.

Oct 24 '22 12:10 t20100

By the way, Blosc2 is registered with a new ID, 32026: https://portal.hdfgroup.org/display/support/Filters#Filters-32026

Oct 24 '22 13:10 mkitti

Question: In the event of using c-blosc2 as a compression library for this filter (thus breaking forward compatibility), would older versions of the filter detect that they can't read the chunk and raise an error?

Oct 25 '22 09:10 t20100

We would need to try, but I am pretty sure that an error will be raised when using C-Blosc1 with chunks generated with C-Blosc2.

On the other hand, the way we are currently using the registered Blosc2 (ID 32026) is by using a CFrame. The CFrame is more flexible than a regular chunk, and will allow to use multidim metalayers, which should be useful for optimizing dataset reads.

After reflecting more on this, one path that we could follow is to add a check in the current HDF5 Blosc1 filter so that, when an error would be detected, check whether what we are decompressing is a CFrame, and if so, call the actual Blosc2 filter. BTW, we have a preliminary version of a HDF5 Blosc2 filter at https://github.com/PyTables/PyTables/tree/direct-chunking-append/hdf5-blosc2/src, and one can use this for the standalone future hdf5-blosc2 filter.

Oct 25 '22 10:10 FrancescAlted

I did a very quick test (no shuffle/default compressor) with updating hdf5plugin to compile hdf5-blosc with c-blosc2 and indeed it raises an error:

OSError: Can't read data (Blosc decompression error)

If you want to take the way of switching to c-blosc2 sooner or later (I expect it is better in terms of maintenance) and break forward compatibility, maybe that would be good to add a check of the blosc version in the filter to prepare for forward compatibility breaks and provide a more explicit message in this case.

Oct 26 '22 09:10 t20100

Great to see a HDF5 Blosc2 filter coming-up!

Oct 26 '22 09:10 t20100

After pondering a bit more about Blosc/Blosc2 compatibility, I think a better approach is to make the two filters totally separate. So the current hdf5-blosc will continue supporting just the C-Blosc 1.x series, while future hdf5-blosc2 will support just C-Blosc 2.x series. Also, having separate HDF5 Filter IDs will help in this.

Oct 26 '22 10:10 FrancescAlted

It would be convenient if we could decompress Blosc1 with c-blosc2 though.

Oct 26 '22 11:10 mkitti

It would be convenient if we could decompress Blosc1 with c-blosc2 though.

From what I tested, compiling hdf5-blosc1 filter with c-blosc2 can decompress chunks compressed with c-blosc1 (it is backward compatible)... but not the other way around.

Oct 26 '22 11:10 t20100

Any updates on this? There seem to be some blosc2 plugins availlable (eg pytables) but none support arbitrary filters as far as I can tell. I need BYTEDELTA to get a good compression ratio, I really want to be able to read/write hdf5 datasets and specify the blosc2 filters used to compress.

Feb 10 '24 09:02 froody

See https://www.hdfgroup.org/wp-content/uploads/2022/05/Blosc2-and-HDF5-European-HUG2022.pdf

Feb 10 '24 18:02 mkitti

I've seen that before, how does that answer my question? Is hdf5 going to adopt that proposal?

Feb 11 '24 04:02 froody

Have you seen https://github.com/Blosc/b2h5py ?

I'm not sure if I understand your question. The Blosc2 filter has been registered as ID 32026. There's nothing more for The HDF Group to do.

Feb 11 '24 05:02 mkitti

b2h5py is mostly out of scope for me. To be clear:

I want SHUFFLE + BYTEDELTA blosc2 compression
I want to be able to store the compressed data in hdf5 datasets
I want to generate these datasets from C/C++
I want it to be easy to read from C/python i.e. I don't want to have to read a uint8_t buffer from hdf5 and then call blosc2 to decompress it

You mention the plugin 32026, where is the authoritative implementation of that plugin?

It's not in the hdf5 repo
There's https://github.com/oscargm98/hdf5-blosc2 but that's 32002
There's the implementation in pytables, which is 32026 but only allows setting one filter, cparams.filters[BLOSC_LAST_FILTER] = doshuffle;

I guess I could compress data with blosc2 myself and write it with H5DOwrite_chunk, trusting that 32026 will just decompress it.

Feb 11 '24 19:02 froody

Ok, I tried what I suggested above, but I get an error on decompression because dparams.schunk = NULL. If I set dparams.schunk to point to the schunk then it decompresses correctly. Specifically, I believe there should be an assignment of dparams.schunk = schunk; immediately after this line

Feb 12 '24 05:02 froody

@froody You are right that, with the current API, we cannot use the full functionality of Blosc2 pipeline inside HDF5. The solution would be to use the cd_values in HDF5 API in a more imaginative way, but that requires thought and careful execution so as to avoid collision with existing conventions for storing metainfo in cd_values (e.g. n-dim info).

Meanwhile, I am glad that you figured out the best workaround, i.e. using direct chunking in HDF5 (via H5DOwrite_chunk). FWIW, and if others are reading this, you can use direct chunking in h5py as well. Here you have an example where h5py is using grok for compressing with JPEG2000 via Blosc2: https://gist.github.com/t20100/80960ec46abd3a863e85876c013834bb

Feb 12 '24 10:02 FrancescAlted

You mention the plugin 32026, where is the authoritative implementation of that plugin?

It's in PyTables: https://github.com/PyTables/PyTables/tree/master/hdf5-blosc2/src

It's also embedded in hdf5plugin for usage with h5py. Though as already mentioned, for having access to all blosc2 features when writing one has to use HDF5 direct chunk write.

Feb 12 '24 11:02 t20100

I think we can close this.

Jun 07 '24 15:06 FrancescAlted

hdf5-blosc hdf5-blosc copied to clipboard

HDF5 filter and plugin based on c-blosc2?

hdf5-blosc
hdf5-blosc copied to clipboard