Use of Zstd compression
I'm a little late to the party, but have been starting to look at using the Zstd library for compression in netCDF.
Am I misunderstanding something, or do I have to explicitly set the environment variable HDF5_PLUGIN_DIR to the location of the directory containing the filter for zstd prior to running an application that wants to use Zstd compression?
It seems like using Zstd instead of zlib is making me jump through lots of hoops instead of "just working" like zlib has/does... Thinking that maybe I am missing something...
I'm also not happy with the extra steps needed, and now there is a HDF5 function which allows us to control the filter path programmatically, which means we can solve this whole problem.
But for now, set HDF5_PLUGIN_DIR, or else you can accept the default plugin install and then you don't have to set anything. Unfortunately, I don't know the details for CMake, but for autoconf, I think you use --with-plugin-dir with no argument, and that will use the default location.
I will try to swing around to this code and make this easier in a future release.
Make sure you also take a look at the quantize feature, which can really improve compression sizes and speeds.
The underlying problem is that we would need to create a list of compressors that are to be "built in" to both libnetcdf and possibly also libhdf5. Zstd is certainly a candidate for that status. But there others -- libblosc for example -- are also possible candidates. I frankly have no criteria on which to decide.
We decide collectively, based on our judgement of what is most useful and can be sustainably supported. Just as we decided to include zstandard. Criteria include:
- FOSS, available everywhere
- significant compression improvement (size or speed of compress/uncompress).
- compatibility with netcdf-java.
I was at the HDF5 workshop for particle physics teams, and they were all using lz4 because it was so much faster. So that's the next one I'll look at. Fortunately John provided a lz4 class for netcdf-java.
Recall that the CCR project exists to prototype and explore. So I would suggest that we deal with zstandard today, and if more compressors are to be built-in, we deal with that on a case-by-case basis, having thoroughly tested our ideas in CCR.
But the problems with zstandard are not in the API, but in the configure and initialization. We need to make that easier for Greg and others like him...
@DennisHeimbigner are you already using BLOSC or LZ4 for Zarr stuff?
We use BLOSC for zarr/nczarr. It is available but unused for HDF5.
or else you can accept the default plugin install and then you don't have to set anything.
Yes, I thought I was doing that, but still had to define the environment variable. Will look again to see what I was missing during build process to make this work.
Make sure you also take a look at the quantize feature, which can really improve compression sizes and speeds.
Yes, that is working and was very simple to get going. Thanks for all the work you all are doing on netCDF.
@gsjaardema I would be really interested in any final results you get for the new compression methods - that is, percent faster, or percent improvement in compressed size...
I havent' been able to get netCDF/HDF5 to find the plugins unless I specify the HDF5_PLUGIN_DIR at runtime. My build is using cmake. I will continue to try different variations of the build to see if I can get it to work...
What value do you assume for the default plugin location?
What value do you assume for the default plugin location?
The library is configured with the location of the local HDF5 plugin directory and that is correctly echoed from nc-config and is in `libnetcdf.settings:
--plugindir -> /root/src/seacas/lib/hdf5/lib/plugin
Plugin Install Prefix: /root/src/seacas/lib/hdf5/lib/plugin
Multi-Filter Support: yes
Quantization: yes
Logging: no
SZIP Write Support: yes
Standard Filters: deflate szip zstd bz2
ZSTD Support: yes
Parallel Filters: yes
If I just run my executable, it doesn't find the Zstd compression filter with 4.9.2 or with main. If I define:
HDF5_PLUGIN_PATH=/root/src/seacas/lib/hdf5/lib/plugin ../bin/io_shell --in_type generated --compress 4 --zstd 100x100x100 tmp-z04.g
Then it correctly finds the Zstd compression filter...
OK, so the way it should work (but apparently does not) is that if you keep your plugins in the directory you told configure, you should not have to set the environment var...
Based on a reading of docs/filters.md, it looks like you need to set the environment variable at runtime: (my highlighting)
The important thing to note is that at run-time, there are several cases to consider:
- HDF5_PLUGIN_PATH is defined and is the same value as it was at build time -- no action needed
- HDF5_PLUGIN_PATH is defined and is has a different value from build time -- the user is responsible for ensuring that the run-time path includes the same directory used at build time, otherwise this case will fail.
- HDF5_PLUGIN_PATH is not defined at either run-time or build-time -- no action needed
- HDF5_PLUGIN_PATH is not defined at run-time but was defined at build-time -- this will probably fail
I can somewhat control this from my application which would do the writing, but then (based on minimal attempts), it looks like any downstream application that wants to read my file that has zstd compression in it will also have to set the plugin path environment variable or it will fail to read the file. NetCDF: Filter error: undefined filter encountered
I would like to use zstandard or even some of the other filters, but currently I think I will be setting myself up for lots of complaints from people who write files with compressed variables and then try to read the file a day/week/month later and have no idea why it fails.
Ideally:
- zstd works like zlib does today (I realize will take time for distributions to update to the version that has this)
- I can control somewhat my toolchain and the netcdf that is used, but I can't make users define an environment variable for the tools they will be using.. (Although since we use modules, I could add the setting to the module files...)
- A 4.9.3 installation or later would say that
zstd compression not supportedor something similar if it tried to read a file and it didn't support zstd.- The "undefined filter encountered" isn't the best although even outputting the filter "id" that you couldn't find would be an improvement in general since you can't always know the name of a filter that you don't have installed... I could then grep an include file or the netcdf source or something to map the filter id back to what the filter actually is..
- If the plugin path is specified at build/install time, then it should be searched at runtime without defining HDF5_PLUGIN_PATH
This is not a complaint, I really appreciate the work that has gone into this and definitely want to use it...
Maybe I am misreading / misinterpreting item 4 in the docs/filters.md I quoted above, but that seems to require setting HDF5_PLUGIN_PATH at runtime no matter what...
Closely related issues: https://github.com/Unidata/netcdf-c/issues/2753
One problem we are up against is that if we use the H5PLxxx API then we have to conform to its loading semantics, which frankly I have not investigated. For example, if we add a new location to the hdf5 internal path list, one would assume it would used the next time a plugin is required (i.e. the equivalent of nc_def_var_filter). It presumably has no affect on already loaded plugins, even if the new location is, say, at the front of the list and overrides some already loaded plugin.
Thanks for the reminder about H5PLprepend and others. I think ed or ward had also told me about that. I did a quick test and it does work and eliminates the need for the environment variable;
In Issue https://github.com/Unidata/netcdf-c/issues/2753, I appear to have promised a proposal, but it was never implemented.
I also note that we may have an option to completely bypass the HDF5 dynamic loading algorithm altogether. THe H5Zregister function just tells HDF5 to be aware of some plugin and assumes that the caller (libnet in this case) is responsible for loading it. Not sure how this interacts with the H5PLxxx API.
@gsjaardema I agree this has to be fixed. But how?
@edwardhartnett I'm not sure what the best solution is. Just looking at Zstd, the "easy" solution seems to be to treat it the same as zlib, quantize, shuffle, and szip -- compile it directly into the library and not rely on any plugin paths or other runtime loading. If it is there at build time, it is there at runtime.
This doesn't scale well since then how do you handle blosc or Z123 or the next five ultimate compression libraries... So I think there is also the issue of how to handle plugins in general...
The difficulty for my usage is that I want to be able to query at my build time what capabilities are available in netCDF, HDF5, CGNS, matio, and maybe the other libraries I use and then decide in my code how to build my libraries and what capabilities to expose/support and then my libraries are used in other applications. If Zstd is advertised as supported by netCDF, then I should just be able to link with netCDF and support Zstd instead of having to wonder if something will happen at runtime that will cause Zstd to not be available.
There is enough difficulty in making sure the entire tool chain on multiple hosts will all have netCDF libraries that support Zstd, quantize, and other features without adding on the issue that this could all be tested at build/install time, but then fail at runtime...
If plugins are the way to do it, then I would like to have the plugin directory that I specify at build time to be searched at runtime without me having to specify anything at runtime. If something does change, then specifying HDF5_PLUGIN_PATH or some other environment variable is helpful and the capability to be able to add new capabilities through plugins is nice to have...
I think for Zstd since there is an explicit nc_def_var_zstandard(exoid, varid, file->compression_level); API function, it seems like it should be part of the netCDF library and if that function exists at build time, then it exists.
There is still the difficulty of using a new feature that does not exist in older versions of the library. Quantize is nice since it is done at write time and does not need to be supported in the applications that are reading the file. Compression is harder since it is needed at both write and read time, so the entire toolchain needs to be updated to support this once it becomes available to write fiels using it and it is difficult since an older library doesn't even know what Zstd is, so can't give a meaningful error message about a feature created after the reading library was installed... (We still get some random failures at times when a users path points to an older netCDF application that doesn't know about netcdf-4 or some other feature that has existed for an eternity...)
So not sure my rambling has any solutions or recommendations in it. It is a hard problem and for read/write libraries the problem is even harder since the need for the capability (zstd, netcdf-4) follows the file which can move among multiple hosts and be consumed by applications that are not always under out control...
I want to be able to query at my build time what capabilities are available in netCDF, HDF5, CGNS, matio, and maybe the other libraries
Currently, the HDF5 API is not very good at exporting that info. NetCDF is doable. Do not know about the other libraries you mention.
I am going to try to start tackling this piecemeal. First, I want to see why the HDF5 default directory is not being used (re: comment https://github.com/Unidata/netcdf-c/issues/2937#issuecomment-2191859740)
Question: when you built libhdf5, did you set the option
--with-default-plugindir=/root/src/seacas/lib/hdf5/lib/plugin
@gsjaardema did you get this resolved?
I have just made a bunch of changes for next release to make this work a little better, and to document it. Hopefully that will make it easier for future users.
If there's no remaining problem, please close this issue.
I will try to look at it this week. Thanks for all the work you and Dennis did on this.
OK, I am trying main and just looking at configuration currently. The first configure seems to give correct plugin directory:
Plugins Enabled: yes
Plugin Install Dir: /Users/gdsjaar/src/seacas-plugin/lib/hdf5/lib/plugin
I then simply did a touch CMakeCache.txt; make and the resulting reconfigure gives:
✔ ~/src/seacas-plugin/TPL/netcdf/netcdf-c/build [main {origin/main}|✔]
16:11 $ touch CMakeCache.txt
✔ ~/src/seacas-plugin/TPL/netcdf/netcdf-c/build [main {origin/main}|✔]
16:12 $ make
-- Checking for Deprecated Options
CMake Warning at CMakeLists.txt:482 (message):
NETCDF_ENABLE_NETCDF4 is deprecated; please use NETCDF_ENABLE_HDF5
-- Defaulting to -DPLUGIN_INSTALL_DIR=/usr/local/hdf5/lib/plugin
-- Final value of-DPLUGIN_INSTALL_DIR=/usr/local/hdf5/lib/plugin
-- ENABLE_PLUGIN_INSTALL=YES PLUGIN_INSTALL_DIR=/usr/local/hdf5/lib/plugin
-- NETCDF_ENABLE_PLUGINS: ON
-- Found HDF5 version: 1.14.4
... deleted lines...
-- Installing: lib__nch5bzip2.dylib into /usr/local/hdf5/lib/plugin
-- Installing: lib__nch5zstd.dylib into /usr/local/hdf5/lib/plugin
And then the Configuration Summary:
Configuration Summary:
... deleted lines ...
Plugins Enabled: yes
Plugin Install Dir: /usr/local/hdf5/lib/plugin
So it looks like the plugin installation directory is not being persisted and gets reset on a subsequent reconfigure.
This is correct and expected behavior; expected in that I've just been working in that part of the code and, indeed, that is what the logic dictates should happen. I'm open to the discussion about having the cached value used if it is set!
Should it be cached?
If I have configured the CMake build and edit the CMakeCache.txt to, for example, change the build from RELEASE to DEBUG or enable testing, I don't expect my Plugin directory to change when I build the code following that change which is what is happening now...
The NETCDF_PLUGIN_INSTALL_DIR CMake variable is also somewhat confusing. It seems to get its value from checking an environment variable HDF5_PLUGIN_PATH if set, but has a default value of "YES" and a somewhat confusing doc string:
set(NETCDF_PLUGIN_INSTALL_DIR "YES" CACHE STRING "Whether and where we should install plugins; defaults to yes")
I have ended up with a directory named "YES" on some of the builds.
If I explicitly set NETCDF_PLUGIN_INSTALL_DIR to a location and don't set the HDF5_PLUGIN_PATH, then the build doesn't use my value...
Appreciate all the work being done in this area, but just giving some feedback on some non-intuitive (at least to me) behavior I am seeing.
Question: when you built libhdf5, did you set the option
--with-default-plugindir=/root/src/seacas/lib/hdf5/lib/plugin
I use the CMake build, so use:
-DH5_DEFAULT_PLUGINDIR:PATH=${INSTALL_PATH}/lib/hdf5/lib/plugin
I was not using this at the beginning, but started at some point in this process...