netcdf-c
netcdf-c copied to clipboard
Feature request - lz4 compression
This is a feature request. NetCDF4 compression is one of its main features. With the increase in disk speeds (SSDs are now the norm) and CPU performances, the current compression algorithm is starting to show its limitations. Even with fast disks and CPUs, compression/decompression speed is rarely above a few hundered MB/s. In our in-house climate model, we do not make use of netCDF's compression capabilities since they slow down the model significantly (10-20%), even at deflate level 1 and on fast hardware.
Some modern compression algorithms seem to be particularly suited to netCDF usecases; the first that comes to mind is lz4, as it is designed with fast compression/decompression speeds. lz4 offers compression several times faster than DEFLATE/lz77 (at the cost of slightly lower compression ratios). Another possibility is lz4hc, which is a high compression version of lz4.
Testing results comparing lz77 and lz4 are all over the net, here is an example and here an extremely well-researched SO answer with lots of useful links.
Thanks for taking this feature request into consideration.
Additional resources: Delaunay2018 paper Similar discussion for netCDF-Python Compression filters for HDF5 NetCDF-C filter support
I believe we have the capability to use any HDF5 filter. So are those compression methods available as HDF5 filters?
I believe we have the capability to use any HDF5 filter. So are those compression methods available as HDF5 filters?
It is my understanding (might be a bit naive) that any client supporting netCDF-4 should also support deflate (gzip / lz77) compression. Every program I know of includes this feature.
However, if compressing via an external HDF5 filter (such as the lz4 filter) then the file will not be portable anymore. External programs (e.g. cdo, nco, netCDF libraries for all programming languages, grads, ncl, ...) will return error or gabled up data. Is this correct?
If so, then I'm not sure allowing generic HDF5 filters will be able to move the industry towards a more modern compressing option, unfortunately.
Well that is a good point about portability. I am in favor of adding support for more compression options.
What parameters would be required? For example, the zlib compression uses an int to turn deflate on and off, and an int which is the deflate level.
Another aspect is that requiring it everywhere means requiring that the compression library be installed on all installation platforms. Are there good libraries to support these compression algorithms?
Also, will either work with parallel I/O writes? If so, that would be of significant additional interest.
I am by no means an expert on this, I merely thought that it would be a handy feature for many netCDF users. The reference implementation is in C (BSD licence) and the gitHub page says it is "scalable with multi-cores CPU". It supports compression levels from 1 to 9 (like zlib). It is my understanding that some additional tunables are available.
As for portability, as far as I know lz4 is nowadays one of the most common compression libraries. It is used by the Linux kernel, ZFS, BTRFS, Hadoop etc. The lz4 webpage mentions bindings for every possible language I can think of. Most if not all Linux distributions include lz4 by default, I think.
As for support for parallel filesystems, maybe we can ask some of the people from lz4 themselves? @Cyan4973?
Sure,
for information, lz4 is designed to be multi-threadable:
multiple threads can invoke lz4 compression and decompression at the same time, it will work without issue.
This setup already exists in ZFS for example, or data base engine like RocksDB, and many other high-speed systems dealing with ton of little stuff in parallel.
Related requests have come up before. To date, we have not decided what to do about supporting extra compressors as part of our code base. I believe at one time, the HDF5 group was considering providing a filter repository. They already support a registry for compression IDs. Ideally, we could provide a github repo into which compressors could be stored. But it would source only. Also there are serious security concerns since we (Unidata) would have to validate the archived code wrt trojans. Perhaps, this is something that could be be handled by the rpm/yum/apt communitites, at least for linux.
Extra compression needs to be supported. Compression is a strong feature of netCDF-4, perhaps the number one cited feature for users adopting netCDF-4.
NOAA is also very, very interested in automatic compression. NOAA generates a LOT of data, and the cost of compressed vs. uncompressed is a very significant amount of money. Better compression is always desired. (And soon I will be taking a look at HDF5's new ability to use deflate with parallel I/O writes.)
Over the years I have had a lot of requests to support extra compression filters, so I am excited to make use of the work of @DennisHeimbigner to somehow make that happen. lz4 seems like a good place to start.
Even if lz4 is eventually required for all netcdf-4 installs, we don't have to start that way. Let's have an --enable-lz4 option, which causes lz4 support to be added to netcdf-4.
Installations that build without lz4 will not be able to read files produced with it. This is the same for szip right now.
I could modify the meaning of the deflate parameter here: https://www.unidata.ucar.edu/software/netcdf/netcdf-4/newdocs/netcdf-c/nc_005fdef_005fvar_005fdeflate.html
Instead of non-zero indicating zlib deflation, we define some non-zero constants NC_DEFLATE_ZLIB, NC_DEFLATE_LZ4.
What does anyone think? Worth trying?
Once this is all working, if it seems like a good idea, we could require it, the same way we require zlib. But let's make that a separate decision as I believe it will be controversial. How about we add support for this useful feature for users that want it, while we argue out the case of whether or not everyone needs it.
This is not an uncommon feature request, and is one that definitely requires a broader discussion within the user community (and internally) regarding the technical debt Unidata would be incurring.
With our upcoming emphasis on multiple storage options (as opposed to the current "native netCDF or HDF5-based storage" choice), we have a much broader picture to consider as well; how to handle 'which storage formats support which compression routines' is a mechanism we're going to have to address and, as always, it will probably end up being more tricky than anticipated. Or perhaps it won't be, I'm ever the optimist :). But still.
Which is not to say the Unidata netCDF team is against adopting additional native compression support. There is just a lot to consider. We also have a pretty full workload with the extant dev roadmap; I'm going to tag this as a future release issue so that we can revisit it and give it a proper target, once we have cleared our current slate. For the time being, it's probably better not to invest too much time into a PR in support of adding additional native compression routines. Dynamic filter support will have to fill the gap in the meantime; the tradeoff between portability and compression is going to have to be one made at the individual organization level.
@WardF I hope this issue remains under discussion and you are not ruling out progress here. I believe an lz4 addition can be done as discussed, without impact on the existing API or build for users that choose not to use it.
I’m not ruling it out and will leave it open, but it’s not on the immediate radar internally; I wanted to make that clear before too much work was put into it. We don’t have a lot of bandwidth for this at the moment, to the extent I haven’t had time to go sort out the lz4 options under Windows; I expect it’s no more difficult than zlib, but I won’t be able to worry about that for a while.
The basic addition of, say, lz4 is pretty straightforward. We just need to follow
the bzip2 example in plugins directory.
I would prefer not to have a --enable-lz4 configure flag because as other compressors
are added, we would end up with flag proliferation. A single --enable-filters=
flag would be preferable.
But, as Ward notes, we need to decide what to do about the technical debt:
e.g.maintenance and security-hole checking.
[Ed, maybe we should bring this up with the HDF5 people when they come to visit?]
I'm happy to see that this issue started a discussion on this topic.
I've spoken with our resident modellist, who says that faster parallel compression/decompression would be very useful to us.
Additionally, he would be even more interested in applying GRIB-like lossy compression methods. After all, we rarely need to know that the temperature was 23.23465287 C. He says that he implemented the Grib complex packing algorithm into the netCDF library in the past and that it was quite straightforward. Additionally, the Grib algorithms are already supported by the ECMWF ecCodes library (the one used by cdo, for example) and thus many of the building blocks for user applications are already in place.
Some references re the Grib compression methods and comparisons with netCDF:
The python interface has lossy compression (http://unidata.github.io/netcdf4-python/#section9), as does the nco netcdf operators (https://www.geosci-model-dev.net/9/3199/2016/). All you need to do is quantize the data before applying the zlib filter.
The python interface has lossy compression (http://unidata.github.io/netcdf4-python/#section9), as does the nco netcdf operators (https://www.geosci-model-dev.net/9/3199/2016/). All you need to do is quantize the data before applying the zlib filter.
I did not go into the details, are you sure this is not only the usage of scale_factor and add_offset, which can already be applied transparently? They are very useful and work well, but only with data whose range is relatively small compared to your required precision (thus it works well with precipitation and temperature, for example). Also, with this method and you are still limited to having at least 8 bits for each variable (NC_BYTE or NC_UBYTE).
No, this is not related to the use of scale_factor and add_offset. It involves truncating or quantizing the data to a certain level of precision (say 0.1 K for temperature) and then applying the zlib/shuffle filters. The last couple of papers you reference utilize this technique.
No, this is not related to the use of scale_factor and add_offset. It involves truncating or quantizing the data to a certain level of precision (say 0.1 K for temperature) and then applying the zlib/shuffle filters. The last couple of papers you reference utilize this technique.
I took a look at this, and in the end this is just truncating data in order to reduce the number of words used in the compression algorithm. Floats stay floats, doubles stay doubles etc.
Some GRIB compression filters, to my understanding, are capable of reducing the number of bits used per variable.
Anyway this might be a discussion for another issue.
This issue should probably be closed. Here's some updates:
- The CCR project adds additional forms of compression for netCDF/HDF5 files: https://github.com/ccr/ccr
- This does not include LZ4, because the LZ4 filter was found to be broken - using it increases data size. The lz4 command line tool works fine and does compress as expected, so this is surely an error in the LZ4 filter code.
- CCR does include the Zstandard compression filter. This yields the same compression as zlib, but much faster for both reading and writing.
- The CCR also includes the BitGroom filter, which allows additional compression of floats/doubles by allowing lossy compression, in which the user specifies the number of significant digits to keep.
When (and if) the lz4 filter code is fixed, we will be happy to add it to CCR.
+1 thanks @edhartnett. Happy New Year!
@edhartnett Regarding "lz4 makes files bigger", I believe that issue is resolved. Please see thread https://github.com/HDFGroup/hdf5_plugins/issues/185
Awesome I will take a look
The hdf5_plugins were updated today. The update includes a fix for the small buffer size used in the lz4 example and a fix for an error that resulted in "decompressed size not the same:"
Request that this issue be reopened.
I re-opened it. What is the outstanding issue you wish to address?
I wish that LZ4 compression be available through NetCDF either directly or via the CCR package.
We initially tried to support LZ4 in CCR (much of the code is still there, just "turned off"). The codec bugs prevented us from doing this. The other CCR codecs (Bzip2 and Zstandard) eventually were migrated from CCR to libnetcdf. LZ4 would be complementary to the existing netCDF-accessible codecs because it has the fastest decompression speed of any reliably benchmarked major codec. According to https://facebook.github.io/zstd/, LZ4 reads about 2x as fast as Zstandard (interestingly, both were written by the same person, Yann Collet). @edwardhartnett wrote much of the netCDF support for Bzip2 and Zstandard. Ed, what are your thoughts on the matter?
I'm traveling at the moment but will be happy to take a look at it when I return to Colorado next month.
On Fri, Jun 20, 2025, 01:54 Charlie Zender @.***> wrote:
czender left a comment (Unidata/netcdf-c#1353) https://github.com/Unidata/netcdf-c/issues/1353#issuecomment-2989295107
We initially tried to support LZ4 in CCR (much of the code is still there, just "turned off"). The codec bugs prevented us from doing this. The other CCR codecs (Bzip2 and Zstandard) eventually were migrated from CCR to libnetcdf. LZ4 would be complementary to the existing netCDF-accessible codecs because it has the fastest decompression speed of any reliably benchmarked major codec. According to https://facebook.github.io/zstd/, LZ4 reads about 2x as fast as Zstandard (interestingly, both were written by the same person, Yann Collet). @edwardhartnett https://github.com/edwardhartnett wrote much of the netCDF support for Bzip2 and Zstandard. Ed, what are your thoughts on the matter?
— Reply to this email directly, view it on GitHub https://github.com/Unidata/netcdf-c/issues/1353#issuecomment-2989295107, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABCSXXEOVWKAHEAKCNXYJRT3EM5R5AVCNFSM6AAAAAB6V727HOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDSOBZGI4TKMJQG4 . You are receiving this because you were mentioned.Message ID: @.***>
@edhartnett Have you had a chance to give this some thought?