xbitinfo icon indicating copy to clipboard operation
xbitinfo copied to clipboard

Adding the python bitinfo to numcodecs

Open thodson-usgs opened this issue 1 year ago • 15 comments

I've been looking into modifying the numcodecs bitround codec to accept a user defined function to determine the number of bits to round. namely xbitinfo.bitinformation. This would streamline the process of chunk-wise bitrounding to something like

from numcodecs import Blosc, BitRound
compressor = Blosc(cname="zstd", clevel=3)

filters = [BitRound(custom_function())]

encoding = {"precip": {"compressor": compressor, "filters": filters}}
ds.to_zarr(<file_name>, encoding=encoding)

where custom_function would essentially wrap get_bitinformation and get_keepbits. All in all, that may not offer much over your current chunk-wise approach, except enabling us to use bitinformation in pangeo-forge compression and rechunking pipelines.

Alternatively, we could add some stripped down bitinfo implementation to numcodec and avoid the need for custom_function(). I'd be happy to help with that, but I don't want to advance that without permission from the xbitinfo team. Both projects have MIT license.

thodson-usgs avatar Jan 30 '24 15:01 thodson-usgs

Sounds great to me, I let Hauke comment though.

milankl avatar Jan 30 '24 16:01 milankl

Hi @thodson-usgs, Thanks for reaching out! I like your proposal. The challenges I can see and which also emerge from your issue at numcodecs is that we would like to have as few implementations as possible to keep the algorithms maintainable as we develop them further. I therefore favour your custom_function suggestion over a new implementation into numcodecs. It should also be noted also that the custom_function needs only to be applied in the writing process and not in the reading process. The filters argument in the output file could therefore remain the same. In this case the user would not need to have xbitinfo or any other dependencies installed that provide custom_function.

observingClouds avatar Jan 30 '24 22:01 observingClouds

Exactly right, of course. I'll proceed with custom_function for now and see where we get.

Along that tack, a possible outcome might be to implement such a function in xbitinfo, or at the very least add an example construction to numcodecs doc.

filters = [BitRound(xbitinfo.helper_function)]

thodson-usgs avatar Jan 30 '24 22:01 thodson-usgs

https://github.com/zarr-developers/numcodecs/pull/503#issuecomment-1919916377 Unfortunately, I'm not sure this will work.

I'm by no means the expert, but as I look through the respective code bases, a reimplementation might be the simplest option, and could refer users to xbitinfo for the lastest and greatest version. Even if passing custom_function did work, xbitinfo's whole implementation is really built to leverage xarray. In general, I like that design choice. However, for this purpose, I'm finding myself trying to unravel some of that into a simpler function that operates on arrays. I didn't double check, but the only dependency might be numpy.

thodson-usgs avatar Jan 31 '24 21:01 thodson-usgs

...true, this might reduce direct traffic to xbitinfo, but I also think it's a valuable step into mainstream adoption.

...or numcodecs might accept xbitinfo as an optional dependency, allowing us to import xbitinfo within the bitround codec. I haven't advanced that idea yet.

thodson-usgs avatar Jan 31 '24 21:01 thodson-usgs

@martindurant, I'm bringing you into this discussion with the xbitinfo team before continuing on the numcodecs side.

The two options I'd like both camps to mull over are:

  1. Reimplementing a basic bitinfo algorithm in numcodecs. I think the only dependency is numpy.
  2. Having xbitinfo as an optional dependency in the bitround codec.

thodson-usgs avatar Jan 31 '24 21:01 thodson-usgs

Rather than including this in bitround, lets create a bitinfo codec that calls bitround.

Here's my draft proposal: https://github.com/thodson-usgs/numcodecs/blob/adaptive-bitrounding/numcodecs/bitinfo.py

(ah, still a couple bugs here, but you'll get the gist)

thodson-usgs avatar Feb 02 '24 17:02 thodson-usgs

Okay works now. Going for multithread.

thodson-usgs avatar Feb 02 '24 18:02 thodson-usgs

Going for multithread.

Zarr is commonly used with dask, so it should be enough to have your algorithms not hold the GIL

martindurant avatar Feb 02 '24 18:02 martindurant

Thanks @thodson-usgs for all your work! If I understand it correctly you copy/reimplement part of the xbitinfo package to add to a numcodec algorithm. Can't we find a different solution, where e.g. the xbitinfo package provides an entry point to numcodecs? This will require xbitinfo to get installed to provide the codec but as mentioned above this should only be necessary for users who write data and not for those that read it and we make it easier for new features to be implemented across the board.

observingClouds avatar Feb 02 '24 20:02 observingClouds

I sense that could get complicated, but I defer to the respective maintainers for guidance.

Currently,

import xarray as xr
ds = xr.tutorial.open_dataset("air_temperature")

from numcodecs import Blosc, BitInfo
compressor = Blosc(cname="zstd", clevel=3)
filters = [BitInfo(info_level=0.99)]

encoding = {"air": {"compressor": compressor, "filters": filters}}

ds.to_zarr('xbit.zarr', mode="w", encoding=encoding)

thodson-usgs avatar Feb 02 '24 21:02 thodson-usgs

@observingClouds, I'm also new to numcodecs, but I think it's designed to prohibit our initial proposal (probably a sign of a good codec). But you too can join the ranks of zarr-devs, if you can fix this or any other typo :)

def _cdf_from_info_per_bit(info_per_bit):
    """Convert info_per_bit to cumulative distribution function"""
    # I suspect something is wrong with tol
    #tol = info_per_bit[-4:].max() * 1.5
    #info_per_bit[info_per_bit < tol] = 0
    cdf = info_per_bit.cumsum()
    return cdf / cdf[-1]

I think the first objective is to write a good codec. Once that's done, we can assess how and to what extent this gets rolled back into xbitinfo.

thodson-usgs avatar Feb 03 '24 04:02 thodson-usgs

Maybe register_codec is the route to adding the codec to xbitinfo. For example, https://github.com/cgohlke/tifffile/blob/master/tifffile/numcodecs.py

thodson-usgs avatar Feb 04 '24 06:02 thodson-usgs

That looks indeed promising. Thanks for the pointer! I actually have played with an external BitRounding codec before it got implemented directly in numcodecs. This should be relatively straight forward to add to xbitinfo.

observingClouds avatar Feb 05 '24 05:02 observingClouds

I take that back, register_codec might be the wrong approach here. I believe the plan is:

  1. implement a numpy-based codec in numcodec (basically done)
  2. that codec could be migrated to xbitinfo such that the bitinfo codec imports xbitinfo, but we can assess the practicality of that after the codec is setup.

Please let me know if I've misjudged the necessity of a numpy-based codec, otherwise we can jump to 2 and import the dask-xarray version in numcodecs

thodson-usgs avatar Feb 05 '24 17:02 thodson-usgs