HDF5.jl
HDF5.jl copied to clipboard
Files created with blosc compression can't be read by Python and R
This is likely an off-topic, but probably it's still helpful for someone to have it documented.
I created with HDF5.jl
a file, using the default compression (blosc) at level 3. However, I was unable to read this file with Python (using the h5Py
library) or with R (using the rhdf5
package, which produced a slightly more informative error messages). I had the same result on Ubuntu and Manjaro Linux.
Using the deflate
compression solved this issue. Given this compatibility problems, I was wondering if blosc
is a good default choice.
Not off-topic at all. I know little about this myself (CCing @stevengj). My understanding is that blosc
is quite a lot faster than deflate
, and that this matters a lot for large datasets. My experience with Matlab (which turns on compression by default) is that storing large datasets is painful, so much so that I tried to fix it, http://www.mathworks.com/matlabcentral/fileexchange/39721-save-mat-files-more-quickly.
If we change it, rather than default to deflate
I'd rather go back to no compression at all. But I suppose the alternative is to contact h5py
and rhdf5
and try to get them to support blosc
?
CC @andrewcollette. Not quite sure who to CC on the rhdf5 side.
As described in #174, blosc compression incurs a 2× slowdown or less (and is often even faster than uncompressed HDF5 for highly compressible data). In contrast, deflate incurs slowdowns from 10× to 1000×.
Unfortunately, Blosc is not yet bundled with HDF5 by default, so Blosc-compressed HDF5 files are not readable by other programs unless they also link to the Blosc library and enable the Blosc HDF5 plugin from https://github.com/Blosc/hdf5. But we don't blosc-compress files unless you explicitly request it (although it might be reasonable to enable it by default for JLD files: #178). You can always specify deflate
if you want.
@FrancescAlted, have you made any progress on getting Blosc incorporated into h5py or other HDF5 wrappers, or better yet into HDF5 itself?
@timholy, the default should be no compression; has this changed? I think @scheidan was only asking about the default when you specify compress
.
Gotcha
I'm not that familiar with Blosc, but there is some good news on the HDF5 front... recent versions of HDF5 include a "dynamically loaded filter" capability. So if the Blosc HDF5 filter is compiled into a shared library and put in the appropriate directory, it can be loaded by HDF5 automatically. Since this happens at the C level, it would be compatible with Python/R/whatever.
Here's the original RFC:
https://www.hdfgroup.org/HDF5/doc/Advanced/DynamicallyLoadedFilters/HDF5DynamicallyLoadedFilters.pdf
I've been meaning to do this with the LZF filter but haven't had the time.
Yep, we use that: https://github.com/timholy/HDF5.jl/blob/master/src/blosc_filter.jl
@timholy, we aren't using that feature. We are loading our filter manually. What @andrewcollette is referring to would remove the need for blosc_filter.jl
entirely, because when HDF5 is initialized it would load some blosc_filter.so
file automatically. (It is still convenient to have the pure-Julia version in our case, however, because it eliminates a lot of headaches with building and installation.)
(One thing we should be careful of is to avoid having blosc_filter.jl
conflict with an auto-loaded filter; I haven't looked into this.)
It looks like https://github.com/Blosc/hdf5 already implements the required API functions. So, @scheidan just needs to compile it into a shared library and install it into /usr/local/hdf5/lib/plugin
(assuming you have libblosc.so
or its equivalent installed appropriately).
(Okay, I just double-checked the HDF5 source code, and it looks like blosc_filter.jl
will correctly take precedence over any blosc_filter.so
file in the search path — it only searches for a shared-library plugin in cases where the desired filter is not already registered.)
Yes, I think the suggestion by Andrew of using the plugin method would work best for your needs. The Blosc HDF5 repo should support this already, see:
https://github.com/Blosc/hdf5/blob/master/src/blosc_plugin.c
Hope this helps.
Thanks everyone for looking into that!
For my case using deflate
is the simplest option (speed is not critical, but is should be readable on several different machines).
Would it make sense to add a link in the docu to this issue?
Better than a link, please just edit the README to note that deflate
is more portable. https://github.com/JuliaLang/julia/blob/master/CONTRIBUTING.md#improving-documentation
"More portable" is an understatement. zlib compression is supported in basically every HDF5 installation (except where explicitly turned it off), while Blosc seems to be e.g. almost unknown in the HPC community (that often uses HDF5 as preferred storage format). If I write an HDF5 dataset with "compression", I have a very specific idea of what that means -- namely, the default compression mechanism in HDF5 that is transparently decompressed on every other system supporting HDF5.
I'm not speaking about JLD; this is a rather Julia-specific format that can and probably should push the envelope. But HDF5 is marketed for its portability and archivability, and people have come to expect this. I think there's about one new "interesting" compression library every five years: Will Blosc still be supported on all platforms in five years, at least in a way to make Blosc-compressed files readable? HDF5 users will expect this.
I would make "compress" use the built-in HDF5 compression by default. If this is slow, then I'd work with the HDF5 developers to improve this, and provide an option to work around this. An incompatible default makes me uncomfortable.
Yes, Erik made very good points here. If what you want is archivability and compatibility with the standard HDF5 library, then you have to use what it supports by default (zlib and szip). If you want performance, then Blosc (or any other for that matter) could make more sense.
Regarding whether Blosc will be supported in 5 years, well that's a question for general open source maintainability. Blosc as it is now (aka Blosc1) is mainly in maintenance mode, while new improvements are been moved to Blosc2. Also, Blosc closely follows C89, so I would say that maintaining it should be cheap enough for the years to come.
2015-09-08 16:59 GMT+02:00 Erik Schnetter [email protected]:
"More portable" is an understatement. zlib compression is supported in basically every HDF5 installation (except where explicitly turned it off), while Blosc seems to be e.g. almost unknown in the HPC community (that often uses HDF5 as preferred storage format). If I write an HDF5 dataset with "compression", I have a very specific idea of what that means -- namely, the default compression mechanism in HDF5 that is transparently decompressed on every other system supporting HDF5.
I'm not speaking about JLD; this is a rather Julia-specific format that can and probably should push the envelope. But HDF5 is marketed for its portability and archivability, and people have come to expect this. I think there's about one new "interesting" compression library every five years: Will Blosc still be supported on all platforms in five years, at least in a way to make Blosc-compressed files readable? HDF5 users will expect this.
I would make "compress" use the built-in HDF5 compression by default. If this is slow, then I'd work with the HDF5 developers to improve this, and provide an option to work around this. An incompatible default makes me uncomfortable.
— Reply to this email directly or view it on GitHub https://github.com/timholy/HDF5.jl/issues/254#issuecomment-138592872.
Francesc Alted
To install HDF5, Blosc, as well as the HDF5 Blosc plugin, you can use e.g. Spack https://github.com/LLNL/spack. The command spack install hdf5-blosc
should install all three of these.