xbitinfo
xbitinfo copied to clipboard
Adding the python bitinfo to numcodecs
I've been looking into modifying the numcodecs bitround
codec to accept a user defined function to determine the number of bits to round. namely xbitinfo.bitinformation
. This would streamline the process of chunk-wise bitrounding to something like
from numcodecs import Blosc, BitRound
compressor = Blosc(cname="zstd", clevel=3)
filters = [BitRound(custom_function())]
encoding = {"precip": {"compressor": compressor, "filters": filters}}
ds.to_zarr(<file_name>, encoding=encoding)
where custom_function
would essentially wrap get_bitinformation
and get_keepbits
. All in all, that may not offer much over your current chunk-wise approach, except enabling us to use bitinformation
in pangeo-forge
compression and rechunking pipelines.
Alternatively, we could add some stripped down bitinfo
implementation to numcodec and avoid the need for custom_function()
. I'd be happy to help with that, but I don't want to advance that without permission from the xbitinfo
team. Both projects have MIT license.
Sounds great to me, I let Hauke comment though.
Hi @thodson-usgs,
Thanks for reaching out! I like your proposal. The challenges I can see and which also emerge from your issue at numcodecs is that we would like to have as few implementations as possible to keep the algorithms maintainable as we develop them further. I therefore favour your custom_function
suggestion over a new implementation into numcodecs. It should also be noted also that the custom_function
needs only to be applied in the writing process and not in the reading process. The filters
argument in the output file could therefore remain the same. In this case the user would not need to have xbitinfo or any other dependencies installed that provide custom_function
.
Exactly right, of course. I'll proceed with custom_function
for now and see where we get.
Along that tack, a possible outcome might be to implement such a function in xbitinfo
, or at the very least add an example construction to numcodecs
doc.
filters = [BitRound(xbitinfo.helper_function)]
https://github.com/zarr-developers/numcodecs/pull/503#issuecomment-1919916377 Unfortunately, I'm not sure this will work.
I'm by no means the expert, but as I look through the respective code bases, a reimplementation might be the simplest option, and could refer users to xbitinfo
for the lastest and greatest version. Even if passing custom_function
did work, xbitinfo
's whole implementation is really built to leverage xarray
. In general, I like that design choice. However, for this purpose, I'm finding myself trying to unravel some of that into a simpler function that operates on arrays.
I didn't double check, but the only dependency might be numpy
.
...true, this might reduce direct traffic to xbitinfo
, but I also think it's a valuable step into mainstream adoption.
...or numcodecs
might accept xbitinfo
as an optional dependency, allowing us to import xbitinfo within the bitround
codec. I haven't advanced that idea yet.
@martindurant, I'm bringing you into this discussion with the xbitinfo
team before continuing on the numcodecs
side.
The two options I'd like both camps to mull over are:
- Reimplementing a basic bitinfo algorithm in
numcodecs
. I think the only dependency isnumpy
. - Having
xbitinfo
as an optional dependency in thebitround
codec.
Rather than including this in bitround
, lets create a bitinfo
codec that calls bitround
.
Here's my draft proposal: https://github.com/thodson-usgs/numcodecs/blob/adaptive-bitrounding/numcodecs/bitinfo.py
(ah, still a couple bugs here, but you'll get the gist)
Okay works now. Going for multithread.
Going for multithread.
Zarr is commonly used with dask, so it should be enough to have your algorithms not hold the GIL
Thanks @thodson-usgs for all your work! If I understand it correctly you copy/reimplement part of the xbitinfo package to add to a numcodec algorithm. Can't we find a different solution, where e.g. the xbitinfo package provides an entry point to numcodecs? This will require xbitinfo to get installed to provide the codec but as mentioned above this should only be necessary for users who write data and not for those that read it and we make it easier for new features to be implemented across the board.
I sense that could get complicated, but I defer to the respective maintainers for guidance.
Currently,
import xarray as xr
ds = xr.tutorial.open_dataset("air_temperature")
from numcodecs import Blosc, BitInfo
compressor = Blosc(cname="zstd", clevel=3)
filters = [BitInfo(info_level=0.99)]
encoding = {"air": {"compressor": compressor, "filters": filters}}
ds.to_zarr('xbit.zarr', mode="w", encoding=encoding)
@observingClouds,
I'm also new to numcodecs
, but I think it's designed to prohibit our initial proposal (probably a sign of a good codec).
But you too can join the ranks of zarr-devs
, if you can fix this or any other typo :)
def _cdf_from_info_per_bit(info_per_bit):
"""Convert info_per_bit to cumulative distribution function"""
# I suspect something is wrong with tol
#tol = info_per_bit[-4:].max() * 1.5
#info_per_bit[info_per_bit < tol] = 0
cdf = info_per_bit.cumsum()
return cdf / cdf[-1]
I think the first objective is to write a good codec. Once that's done, we can assess how and to what extent this gets rolled back into xbitinfo
.
Maybe register_codec
is the route to adding the codec to xbitinfo
.
For example, https://github.com/cgohlke/tifffile/blob/master/tifffile/numcodecs.py
That looks indeed promising. Thanks for the pointer! I actually have played with an external BitRounding codec before it got implemented directly in numcodecs. This should be relatively straight forward to add to xbitinfo.
I take that back, register_codec
might be the wrong approach here. I believe the plan is:
- implement a numpy-based codec in
numcodec
(basically done) - that codec could be migrated to
xbitinfo
such that thebitinfo
codec importsxbitinfo
, but we can assess the practicality of that after the codec is setup.
Please let me know if I've misjudged the necessity of a numpy
-based codec, otherwise we can jump to 2 and import the dask
-xarray
version in numcodecs