numcodecs icon indicating copy to clipboard operation
numcodecs copied to clipboard

ZFP Compression

Open rabernat opened this issue 7 years ago • 62 comments

I just learned about a new compression library called ZFP: https://github.com/LLNL/zfp

zfp is an open source C/C++ library for compressed numerical arrays that support high throughput read and write random access. zfp also supports streaming compression of integer and floating-point data, e.g., for applications that read and write large data sets to and from disk.

zfp was developed at Lawrence Livermore National Laboratory and is loosely based on the algorithm described in the following paper:

Peter Lindstrom
"Fixed-Rate Compressed Floating-Point Arrays"
IEEE Transactions on Visualization and Computer Graphics
20(12):2674-2683, December 2014
doi:10.1109/TVCG.2014.2346458

zfp was originally designed for floating-point arrays only, but has been extended to also support integer data, and could for instance be used to compress images and quantized volumetric data. To achieve high compression ratios, zfp uses lossy but optionally error-bounded compression. Although bit-for-bit lossless compression of floating-point data is not always possible, zfp is usually accurate to within machine epsilon in near-lossless mode.

zfp works best for 2D and 3D arrays that exhibit spatial correlation, such as continuous fields from physics simulations, images, regularly sampled terrain surfaces, etc. Although zfp also provides a 1D array class that can be used for 1D signals such as audio, or even unstructured floating-point streams, the compression scheme has not been well optimized for this use case, and rate and quality may not be competitive with floating-point compressors designed specifically for 1D streams.

zfp is freely available as open source under a BSD license, as outlined in the file 'LICENSE'. For more information on zfp and comparisons with other compressors, please see the zfp website. For questions, comments, requests, and bug reports, please contact Peter Lindstrom.

It would be excellent to add ZFP compression to Zarr! What would be the best path towards this? Could it be added to numcodecs?

rabernat avatar Oct 15 '18 14:10 rabernat

It looks like there are already some python/numpy bindings: https://github.com/seung-lab/fpzip

jhamman avatar Oct 15 '18 15:10 jhamman

Hi Ryan, to implement a new codec you just need to implement the numcodecs.abc.Codec interface:

https://numcodecs.readthedocs.io/en/latest/abc.html

...then register your new codec class with a call to register_codec():

https://numcodecs.readthedocs.io/en/latest/registry.html#numcodecs.registry.register_codec

If you want to just try this out as an experiment, you could just knock up a codec implementation in a notebook or wherever, using the Python bindings for zfp the codec implementation would be very simple. I'd suggest getting some benchmark results showing useful speed and/or compression ratio, then consider adding to numcodecs if it looks promising.

If this did look a useful addition to numcodecs, couple of things I noticed from a quick glance at the source for the Python bindings. First it expects arrays of at least 3 and at most 4 dimensions. Don't know if this is a constraint might be good to relax. Also it converts data to Fortran order before compression, which will mean an extra data copy for most users where data is usually in C order. This may be a hard requirement of the zfp library, and so unavoidable, but just something to note.

On Mon, 15 Oct 2018, 16:57 Joe Hamman, [email protected] wrote:

It looks like there are already some python/numpy bindings: https://github.com/seung-lab/fpzip

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/zarr-developers/zarr/issues/307#issuecomment-429911840, or mute the thread https://github.com/notifications/unsubscribe-auth/AAq8QkUZrTh_hJbm34GeyCQNcHiCPuV3ks5ulLCGgaJpZM4XceDs .

alimanfoo avatar Oct 15 '18 16:10 alimanfoo

Digging deeper, it looks like fpzip does not support missing data: https://zfp.readthedocs.io/en/release0.5.4/directions.html#directions

This is a dealbreaker for now (at least for me), but it looks like it might be addressed soon.

rabernat avatar Oct 15 '18 19:10 rabernat

How do you handle missing data with Zarr currently?

jakirkham avatar Oct 18 '18 23:10 jakirkham

@rabernat I'm the author of the fpzip python bindings. I think you may be conflating zfp and fpzip. fpzip is a lossless codec while zfp isn't. Try the following:

import fpzip
import numpy as np

x = np.array([[[ np.nan ]]], dtype=np.float32)
y = fpzip.compress(x)
z = fpzip.decompress(y)
print(z)

Worth noting that fpzip is for 3D and 4D data though 1D and 2D data compression is supported via adding dimensions of size 1.

william-silversmith avatar Oct 19 '18 19:10 william-silversmith

Also, if you guys find fpzip interesting, I'd be happy to add support for more flexible support for Fortran vs C order. I don't think it's a hard constraint, it's just what I got working for my use case.

UPDATE: fpzip 1.1.0 respects C order arrays.

william-silversmith avatar Oct 19 '18 19:10 william-silversmith

I think you may be conflating zfp and fpzip. fpzip is a lossless codec while zfp isn't.

@william-silversmith, thanks for clearing this up. I'm going to blame the mix-up on @jhamman with this comment.

Thanks also for writing fpzip and sharing it with the community!

I did some benchmarking of fpzip vs. zstd on real ocean data. You can find my full analysis in the this notebook. The results are summarized by these two figures, which compare the compression ratio vs. encoding / decoding time of the two codecs on two different ocean datasets, one with a land mask (i.e. nan's) over about 43% of the domain and one without.

encoding: image

decoding: image

It's interesting to note how zstd finds the missing data (encoded as nans) immediately; the compression ratios on the data with the mask are much higher. Since fpzip doesn't allow nans, I just filled with zeros. With fpzip, there are only very minor compression ratio differences between the masked and the unmaksed arrays.

Based on this analysis, in terms of decoding time (which is my main interest), fpzip is nowhere close to zstd. To get fpzip to speed up encoding or decoding, I have to go out to precisions of < 20, which results in acceptable losses:

image

The caveat is that this is all done on my laptop, so might not be very robust or representative. Please have a look at my notebook and see if I have done anything obvious wrong. (It should be fully runnable from anywhere.)

rabernat avatar Oct 21 '18 03:10 rabernat

Also, in case this was not clear, I am basically not getting any compression of either dataset with fpzip using the default precision:

>>> len(fpzip.compress(data_sst))/data_sst.nbytes
1.0000054012345678
>>> len(fpzip.compress(data_ssh))/data_ssh.nbytes
1.0000054012345678

Am I doing something wrong?

rabernat avatar Oct 21 '18 03:10 rabernat

Hi Ryan!

Thanks for the info. I'm glad you found it useful to evaluate fpzip! I think Dr. Lindstrom, whom I've corresponded with, would be happy more people are looking at it.

I have a few comments that I'll just put into bullets:

  • fpzip does support NaNs, so you can use your land mask as-is. It's zfp that doesn't support missing values.
  • I think fpzips efficacy depends a lot on the dataset. You might very well be justified in using zstd on your dataset. For my lab's 4D data (X,Y,Z, channel) generated during boundary detection on a 3D biomedical image, we saw gzip compress to 65%, zstd compress to 59%, and fpzip compress to 46% losslessly. My colleague, Nico Kemnitz, found that with some manipulations that cause a loss of one machine epsilon, we can compress to 56%, 50%, and 32% respectively on our benchmark. You can read more here: https://github.com/seung-lab/cloud-volume/wiki/Advanced-Topic:-fpzip-and-kempressed-Encodings
  • You're correct that fpzip's decompress time is greater than zstd. I find that it has often near symmetric compression and decompression rates. It tends to win by a lot on compression and lose by a bit on decompression compared with gzip and zstd.
  • The use case my lab is concerned with is that we are generating (without compression) 5 PB of floating point data. So we were less concerned with decode time than simply getting things crunched down.

I did a small test of some of the pangeo data using the following script and I think I must have injected a bug. I think the reason there's so little compression is because the tail of the compressed data are all zeros. Let me investigate this.... Apologies, this python library is pretty new and we haven't put it into production use yet so it's not 100% battle tested.

import fpzip
import numpy as np
import gcsfs
import pandas as pd
import xarray as xr

gcs = gcsfs.GCSFileSystem(project='pangeo-181919', token='anon')
# ds_ssh = xr.open_zarr(gcsfs.GCSMap('pangeo-data/dataset-duacs-rep-global-merged-allsat-phy-l4-v3-alt',
#                                gcs=gcs))

ds_llc_sst = xr.open_zarr(gcsfs.GCSMap('pangeo-data/llc4320_surface/SST',
                               gcs=gcs), auto_chunk=False)
data = ds_llc_sst.SST[:5, 1, 800:800+720, 1350:1350+1440].values

x = fpzip.compress(data)
print(x)

william-silversmith avatar Oct 22 '18 01:10 william-silversmith

I just updated fpzip to version 1.1.1 that should have the trailing zeros problem fixed. Give it a try again!

EDIT: I did a quick test and I saw the following results for the above script:

20736000 bytes raw 
13081082 bytes fpzip (63%)
17396243 bytes gzip -9 (84%)

william-silversmith avatar Oct 22 '18 16:10 william-silversmith

@rabernat, I'm the developer of fpzip (and zfp). Having worked a lot with climate scientists, I would certainly expect high-resolution ocean data like yours to compress quite well (definitely to less than 100%!). @william-silversmith, who has done a great job adding a Python interface to fpzip, reports a reduction to 63% after his latest fix, which seems more plausible. Perhaps you can give fpzip another try?

If you're willing to sacrifice some loss to within some acceptable error tolerance, then I would suggest checking zfp out. It's a far faster and more advanced compressor than fpzip.

lindstro avatar Nov 06 '18 03:11 lindstro

Hi @lindstro -- thanks for droping in here. And most of all, thanks for your important work on compression! As a computational oceanographer, I recognize that this is really crucial for the future sustainability of high-resolution modeling.

I updated my analysis with the latest version of fpzip. Indeed it actually compresses the data now! 😉

Here is a figure which summarizes the important results. In contrast to my previous figure, I have eliminated the precision argument for fpzip and am using it as intended: lossless mode (directly comparable to zstd).

image

In terms of compression ratio, fpzip beats zstd when there is no land mask (holds regardless of the zstd level). With a land mask, zstd is able to do a bit better. zstd is a lot faster, particularly on decoding, but that is not a dealbreaker for us. (50 MB/s is still fast compared to network transfer times.)

Based on this analysis, I think we definitely want to add fpzip support to numcodecs.

If you're willing to sacrifice some loss to within some acceptable error tolerance, then I would suggest checking zfp out. It's a far faster and more advanced compressor than fpzip.

Yes we would love to try zfp. Is there a python wrapper for it yet?

rabernat avatar Nov 07 '18 15:11 rabernat

Another point about fpzip: compression ratio is considerably higher (0.4 vs 0.55) if I transpose the arrays from their native python / C order to Fortran order when I feed them in. @william-silversmith -- I notice you have added the order keyword to fpzip.decompress but not to fpzip.compress. Would it be possible to add support for compression of C-order arrays? This would allow us to avoid the manual transpose step at the numcodecs level.

rabernat avatar Nov 07 '18 15:11 rabernat

@alimanfoo - presuming we want to move forward with adding fpzip and zfp to numcodecs, what would be the best path? Would you want to use @william-silversmith's python package, or would you want to re-wrap the C code within numcodecs, as currently done for c-blosc? The latter approach would also allow us to use zfp without an independent python implementation. But it is a lot more work. Certainly not something I can volunteer for.

rabernat avatar Nov 07 '18 15:11 rabernat

If there are existing python wrappers then I'd suggest to use those, at least as a first pass - can always optimise later if there is room for improvement. PRs to numcodecs for fpzip and zfp would be welcome.

On Wed, 7 Nov 2018, 15:23 Ryan Abernathey <[email protected] wrote:

@alimanfoo https://github.com/alimanfoo - presuming we want to move forward with adding fpzip and zfp to numcodecs, what would be the best path. Would you want to use @william-silversmith https://github.com/william-silversmith's python package, or would you want to re-wrap the C code within numcodecs, as currently done for c-blosc? The latter approach would also allow us to use zfp without an independent python implementation. But it is a lot more work. Certainly not something I can volunteer for.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/zarr-developers/zarr/issues/307#issuecomment-436660574, or mute the thread https://github.com/notifications/unsubscribe-auth/AAq8QqIqooHdlM0F2h0PHbVDubZlr1kqks5usvsOgaJpZM4XceDs .

alimanfoo avatar Nov 07 '18 16:11 alimanfoo

@rabernat, thanks fore rerunning your compression study. fpzip was developed back in 2005-2006, when 5 MB/s was fast. I've been meaning to rewrite and parallelize it, but I've got my hands full with zfp development.

Regarding transposition, clearly we'd want to avoid making a copy. fpzip and zfp both use the convention that x varies faster than y, which varies faster than z. So an array of size nx * ny * nz should have a C layout

float array[nz][ny][nx];

@william-silversmith's Python wrappers should ideally do nothing but pass the array dimensions to fpzip in the "correct" order and not physically move any data. zfp (but not fpzip) also supports strided access to handle arrays of structs without having to make a copy.

As far as Python wrappers for zfp, we're just now starting on that and have an end-of-the-year milestone to deliver such a capability. The developer working on this has suggested using cffi. As I'm virtually Python illiterate, I'm not sure how that would interact with numcodecs. Any suggestions are welcome.

lindstro avatar Nov 07 '18 16:11 lindstro

For numcodecs it doesn't matter how you wrap zfp. As long as the zfp python module provides a compress() function that accepts any python object that implements the buffer protocol and returns a python object that implements the buffer protocol (e.g., bytes), and similar for decompress(). And in both cases minimises memory copies, i.e., use the buffer protocol to read data directly from the buffers exposed by the python objects.

On Wed, 7 Nov 2018, 16:51 Peter Lindstrom <[email protected] wrote:

@rabernat https://github.com/rabernat, thanks fore rerunning your compression study. fpzip was developed back in 2005-2006, when 5 MB/s was fast. I've been meaning to rewrite and parallelize it, but I've got my hands full with zfp development.

Regarding transposition, clearly we'd want to avoid making a copy. fpzip and zfp both use the convention that x varies faster than y, which varies faster than z. So an array of size nx * ny * nz should have a C layout

float array[nz][ny][nx];

@william-silversmith https://github.com/william-silversmith's Python wrappers should ideally do nothing but pass the array dimensions to fpzip in the "correct" order and not physically move any data. zfp (but not fpzip) also supports strided access to handle arrays of structs without having to make a copy.

As far as Python wrappers for zfp, we're just now starting on that and have an end-of-the-year milestone to deliver such a capability. The developer working on this has suggested using cffi. As I'm virtually Python illiterate, I'm not sure how that would interact with numcodecs. Any suggestions are welcome.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/zarr-developers/zarr/issues/307#issuecomment-436694477, or mute the thread https://github.com/notifications/unsubscribe-auth/AAq8QgLIbCWX8I2K6mObZKt9Mw_Ucum-ks5usw-YgaJpZM4XceDs .

alimanfoo avatar Nov 07 '18 20:11 alimanfoo

Ok, so we are clearly in a "help wanted" situation.

Implementing the necessary wrappers / documentation / tests for fpzip and zfp in numcodecs would make an excellent student / intern project. It is a clearly defined task with plenty of examples to draw from. And it will have a big impact by bringing these codecs to a broader community via zarr. So if anyone knows of any students or other suitable candidates looking to get their feet wet in open source, please send them to this issue!

Also, I'm transferring this to the numcodecs issue tracker, where it clearly belongs.

rabernat avatar Nov 07 '18 20:11 rabernat

Just to elaborate a little, what we would like to do is implement a codec class for ZFP. The Codec interface is defined here. Once there is a Python package binding ZFP, then the codec implementation is very simple, basically the Codec.encode(buf) method implementation would just pass through to a zfp.compress(buf) function, and similarly the Codec.decode(buf, out=None) method would ideally just pass through to a zfp.decompress(buf, out) function.

There is a detail on the decode method in that numcodecs supports an out argument which can be used by compressors that have the ability to decompress directly into an existing buffer. This potentially means that decompressing involves zero memory copies. So if a zfp package offered this ability to decompress directly into a buffer exposed by a Python object via the buffer interface, this could add an additional optimisation. However, most compression libraries don't offer this, so it is not essential. I.e., if zfp does not offer this, then in numcodecs if an out argument is provided, we just do a memory copy into out. For an example, see the zlib codec implementation.

One other detail, ideally a zfp Python binding would be able to accept any object exposing the Python buffer interface. We currently also require codecs to be able to handle array.array which in Python 2 needs a special case because it doesn't implement the buffer interface. But we can work around that inside numcodecs. I.e., it is fine if a zfp binding just uses the buffer interface, no special case needed for array.array in Python 2. E.g., this is part of the reason why there is some special case code for Python 2 within the zlib codec implementation.

Hth.

alimanfoo avatar Nov 08 '18 13:11 alimanfoo

We currently also require codecs to be able to handle array.array which in Python 2 needs a special case because it doesn't implement the buffer interface.

We can smooth over this a bit. Opened PR ( https://github.com/zarr-developers/numcodecs/pull/119 ) as a proof-of-concept.

jakirkham avatar Nov 08 '18 17:11 jakirkham

@alimanfoo, zfp does indeed decompress directly into a user-allocated buffer. Similarly, compression writes to a user-allocated buffer whose size zfp conservatively estimates for the user. Alternatively, the user can specify how much memory to use for the compressed data, and zfp will ensure that the compressed data fits within the buffer (with quality commensurate with buffer size).

I know very little about zarr, but my understanding is that it partitions arrays into equal-shape chunks whose compressed storage is of variable length. I'm guessing chunks have to be large enough to amortize the overhead of compression metadata (I see 1 MB or so recommended). zfp provides similar functionality but uses very small chunks (4^d values for d-dimensional arrays) that are each (lossily) compressed to a fixed number of bits per chunk (the user specifies how many bits; for 3D arrays, 1024 bits is common). I wonder if this capability could be exploited in zarr without having to rely on zfp's streaming compression interface.

lindstro avatar Nov 17 '18 00:11 lindstro

We've revamped how codecs handle data internally. Outlined in this comment. This should make it much easier to contribute new codecs. Would be great if those interested took a look and provided any feedback.

jakirkham avatar Nov 28 '18 00:11 jakirkham

@lindstro thanks for the information, very helpful.

Apologies if I have not fully understood the details in your comment, but you are right to say that zarr partitions an array into equal-shape chunks, and then passes the data for each chunk to a compression codec (in fact this can be a pipeline of codecs configured by the user, but typically it is just a compressor). The result of encoding the chunk is then stored, and this will be of variable length depending on the data in each chunk.

In the zarr architecture, ZFP would be wrapped as a codec, which means it could be used as the compressor for an array. So zarr would pass each chunk in the array to ZFP, and then store whatever ZFP gives it back. Zarr passes all of the data for a chunk to a codec in a single API call, so in general there is no need to use streaming compression, you can just do a one-shot encoding. Typically I find that chunks should be at a minimum 1 MB uncompressed size, usually I find upwards of 16 MB is better, depending somewhat on various factors like compression ratio and type of storage being used.

A compressor (or any other type of codec) is a black box as far as zarr is concerned. If a compressor like ZFP chooses to break the chunk down further into smaller pieces, that is an internal implementation detail. E.g., the Blosc compressor does something similar, it breaks down whatever it is given into smaller blocks, so it can then use multiple threads and compress blocks in parallel.

If it is possible to vary the size of chunks that ZFP is using internally, then this is an option you'd probably want to expose to the user when they instantiate the ZFP codec, so they could tune ZFP for a particular dataset.

Hth.

alimanfoo avatar Nov 28 '18 13:11 alimanfoo

@alimanfoo, let me try to clarify.

Whereas zarr prefers uncompressed chunks as large as 16 MB, zfp in its fixed-rate mode uses compressed chunks on the order of 8-128 bytes (a cache line or two), which provides for far finer granularity of access. Think of zfp as a compressed floating-point format for SSE vectors and similar. I was just thinking out loud whether a capability like that could be exploited by zarr, for example, if traversing a 3D array a 2D or 1D slice at a time, when only a very small subset of a 16 MB chunk is needed.

My reference to zfp streaming compression was meant in the sense of sequentially (de)compressing an entire array (or large portions thereof) in contrast to zfp's inline (de)compression capability, where tiny blocks (4^d scalars in d dimensions) are (de)compressed on demand in response to random access reads or writes to individual array elements.

Of course, one could use zfp as a black box codec with zarr to (de)compress large chunks, but my question had to do with whether zarr could benefit from fine-grained access to individual scalars or a few handfuls of scalars, as provided by zfp's own compressed arrays.

If not, then the best approach to adding zfp support to zarr would most likely be to use the Python bindings we're currently developing to zfp's high-level C interface, which is designed for infrequent (de)compression of the megabyte-sized chunks that zarr prefers.

lindstro avatar Nov 29 '18 06:11 lindstro

Many thanks @lindstro, very nice clarification.

Of course, one could use zfp as a black box codec with zarr to (de)compress large chunks, but my question had to do with whether zarr could benefit from fine-grained access to individual scalars or a few handfuls of scalars, as provided by zfp's own compressed arrays.

We had a similar discussion a while back, as the blosc compressor offers something analogous, which is the ability to decompress specific items within a compressed buffer, via a function called blosc_getitem. The discussion is here: https://github.com/zarr-developers/zarr/issues/40, see in particular comments from here: https://github.com/zarr-developers/zarr/issues/40#issuecomment-236844228.

The bottom line is that I think it could, in principle, be possible to modify numcodecs and zarr to leverage this kind of feature. However, it would require some reworking of the API layer between zarr and numcodecs and some non-trivial implementation work. I don't have bandwidth to do work in that direction at the moment, but if someone had the time and motivation then they'd be welcome AFAIC.

As I understand it, the type of use case that would make this feature valuable are where you are doing a lot of random access operations into small regions of an array, i.e., pulling out individual values or small clusters of nearby values. I don't have any use cases like that personally, but maybe others do, and if so I'd be interested to know.

All my use cases involve running parallel computations over large regions of an array. For those use cases, it is fine to decompress entire chunks, as it is generally only a few chunks around the edge of an array or region where you don't need to decompress everything, and the overhead of doing a little extra decompression than strictly needed is negligible. Occasionally I do need to pull out single values or a small sub-region, but that is generally a one-off, so again the extra overhead of decompressing entire chunks is easy to live with.

If not, then the best approach to adding zfp support to zarr would most likely be to use the Python bindings we're currently developing to zfp's high-level C interface, which is designed for infrequent (de)compression of the megabyte-sized chunks that zarr prefers.

This sounds like a good place to start.

alimanfoo avatar Nov 29 '18 10:11 alimanfoo

Regarding use cases, you'll have to excuse my ignorance of how zarr works and the use cases it was designed for, but here are a few that motivated the design of zfp and its small blocks:

  • zfp serves as a substitute for conventional multidimensional random-accessible arrays. zfp generally avoids the need to rewrite existing application code to traverse the arrays in an order best suited for the underlying data structure (e.g., one chunk at a time). Ideally, you substitute only C/C++ array or STL vector declarations with zfp array declarations while leaving the remaining code intact. Sometimes the traversal order can be hidden behind iterators, but it is common in C and even C++ to use explicit indexing and nested for loops for array computations, especially in stencil-based computations that require localized random access. Because zfp's blocks are so small, it does not matter much in what order the (multidimensional) array is accessed, whereas if you use large chunks on the order of 16 MB, you'll want to process all elements in a chunk before moving on to the next one. That is, if the traversal enters and exits a chunk many times, then you don't want to (de)compress the chunk each time, and you may not be able to afford caching many large uncompressed chunks. (Does zarr cache decompressed chunks?)

  • If chunks are much larger than the hardware cache size, then performance will suffer when making repeated accesses to a chunk, e.g., via kernel fusion, stencil operations, gathers and scatters between staggered grids, etc. The compressed chunk size in zfp is usually on the order of one L1 cache line, while a decompressed chunk is a few cache lines, allowing both compressed and uncompressed data to fit in L1 cache if the access pattern exhibits good locality. If chunks are as large as 16 MB, then the decompressed data has already been evicted from L1 and L2 cache once computation on the decompressed data is executed.

  • Some applications traverse subsets of arrays. Examples include 1D and 2D slices of a 3D array, data-dependent isocontours and integral paths, boundary layers for ghost data exchange in distributed settings, regions of interest and range queries (e.g., in visualization), subsampling (e.g., for data decimation), etc. If chunks are too coarse, then there's a lot of overhead associated with decompressing data that is not accessed.

  • zfp supports both read and write access. Only (uncompressed and cached) blocks that are modified need to be written back to compressed storage when evicted from zfp's software cache. If chunks are large, then a change of a single value in a chunk would trigger a lot of data to be compressed.

I do agree, however, that it is often possible to structure the traversal and computation in a streaming manner to amortize the latency associated with (de)compression. This, however, typically requires refactoring existing code.

lindstro avatar Dec 03 '18 23:12 lindstro

Thanks @lindstro, very interesting.

FWIW Zarr was designed primarily as a storage partner to the Dask array module, which implements a subset of the Numpy interface but as chunked parallel computations. It's also designed to work well either on a single machine (either multi-threaded or multi-process) or on distributed clusters (e.g., Pangeo-like clusters reading from object storage). With Dask, the sweet spot tends to be towards larger chunks anyway, because there is some scheduling overhead associated with each task. So there has not (yet) been any pressure to optimise access to smaller array regions.

But as I mentioned above it is conceivable that Zarr could be modified to leverage capabilities of ZFP or Blosc to decompress chunk sub-regions. So if someone has a compelling use case and wants to explore this then please feel free.

alimanfoo avatar Dec 03 '18 23:12 alimanfoo

Very interesting discussion. Thanks @lindstro for taking the time to explain ZFP in such detail!

I think it is important to distinguish between what ZFP calls "chunks" and what zarr calls "chunks". ZFP chunks are clearly intended to be small. Zarr chunks tend to be much larger. An important point is that zarr is focused on serialization: zarr chunks correspond to individual files. It makes no sense to store millions of tiny files, due to the overhead of opening a file. This overhead becomes even more severe when the "files" are actually objects in cloud storage (a primary use case for zarr). So we would never want a 1:1 correspondence between ZFP chunks and zarr chunks.

Instead, what we are talking about here is using ZFP as a compression layer for zarr chunks. There will probably be a huge number of ZFP chunks in one zarr chunk (i.e. file). All that really matters here is:

  • The compression ratio for the whole (~10 - 100 MB) zarr chunk
  • The speed of compression / decompression of the whole zarr chunk

Although zarr in its current form will clearly not be able to leverage all of the cool features of ZFP like random access to individual elements, the bottom line is that ZFP provides high compression ratios for multidimensional floating point array data. This alone is justification for exposing it via numcodecs.

At this point, it would be great to move beyond speculation and see some actual implementation! 😉

rabernat avatar Dec 04 '18 14:12 rabernat

Hey! I just wanted to announce that we (NCAR) might be able to help out with this, if @lindstro approves.

My colleague, Haiying Xu (@halehawk), already has some Python bindings for fpzip and zfp. They are currently in a private NCAR Github repository until we can iron out licensing issues, so I'll let @lindstro comment on whether he wants zfp/fpzip bindings in NCAR repositories and what license he is comfortable with. Regardless, I think that with a small bit of code refactoring (in addition to the licensing issues, if there are any), we are close to a zfp (and a second fpzip) Python package that can be used with some new codecs.

kmpaul avatar Dec 04 '18 18:12 kmpaul

Oh, and I also wanted to comment that @william-silversmith mentioned that fpzip was lossless...which is true, but kind of misleading. The fpzip library can do both lossy and lossless compression, while the zfp library can only (?) do lossy compression. A user-supplied switch to the fpzip library can enable lossless compression, or lossy compression at whatever level you want.

kmpaul avatar Dec 04 '18 18:12 kmpaul