BitInformation.jl
BitInformation.jl copied to clipboard
Discuss best practices for `xr.Dataset.to_netcdf()`
I want to use BitInformation
and BitRound
on NetCDF output. Anyone with experience with xarray
?
What I do: see also https://github.com/milankl/BitInformation.jl/issues/25#issuecomment-1060889412
My routine:
- run
plot_bitinformation.jl
and save keepbits between 0 and 23 to adict
withjson
- use code from https://github.com/zarr-developers/numcodecs/pull/299/files to apply
BitRound
for each variable -
ds_rounded.to_netcdf(path, encoding, engine='h5netcdf')
I use engine="h5netcdf"
because I can use {"compression": True}
in encoding. I do this because some compression is needed for BitRound
to be effective on disk storage. (I have only very little knowledge of compression and netcdf4 basics.) But probably there is a better way. What I do not want is that the compression impacts the read performance too much, i.e. reading the compressed data should not take much more (subjective I know) time that compared to before (or is this the tradeoff anyways).
https://xarray.pydata.org/en/latest/generated/xarray.Dataset.to_netcdf.html
# https://github.com/zarr-developers/numcodecs/pull/299/files
import numpy as np
from numcodecs.abc import Codec
from numcodecs.compat import ensure_ndarray, ndarray_copy
class BitRound(Codec):
codec_id = 'bitround'
def __init__(self, keepbits: int):
if (keepbits < 0) or (keepbits > 23):
raise ValueError("keepbits must be between 0 and 23")
self.keepbits = keepbits
def encode(self, buf):
if self.keepbits == 23:
return buf
# TODO: figure out if we need to make a copy
# Currently this appears to be overwriting the input buffer
# Is that the right behavior?
a = ensure_ndarray(buf).view()
assert a.dtype == np.float32
b = a.view(dtype=np.int32)
maskbits = 23 - self.keepbits
mask = (0xFFFFFFFF >> maskbits) << maskbits
half_quantum1 = (1 << (maskbits - 1)) - 1
b += ((b >> maskbits) & 1) + half_quantum1
b &= mask
return b
def decode(self, buf, out=None):
data = ensure_ndarray(buf).view(np.float32)
out = ndarray_copy(data, out)
return out
def bitround(data, keepbits):
codec = BitRound(keepbits=keepbits)
data = data.copy() # otherwise overwrites the input
encoded = codec.encode(data)
return codec.decode(encoded)
# my code below
import xarray as xr
import sys
print('Require three args: python BitRound_file.py ifile.nc keepbits.json ofile')
assert len(sys.argv) == 3+1
_, ifile, keepbits_json, ofile = sys.argv
import json
print("keepbits_json",keepbits_json)
keepbits_loopup = json.load(open(keepbits_json))
ds = xr.open_dataset(ifile)
ds_round = ds.copy()
for v in ds.data_vars:
if v in keepbits_loopup.keys():
keepbits = keepbits_loopup[v]
# here corrections could be applied
# some hard fixes maybe
# if keepbits < 2:
# keepbits = 5
ds_round[v].values = bitround(ds[v].values,keepbits)
ds_round=ds_round.assign_coords(ds.coords)
encoding={v:{'compression':True} for v in ds.data_vars if v in keepbits_loopup.keys()}
print('BitRound_file.py writes ', ofile)
ds_round.to_netcdf(ofile, mode='w', engine='h5netcdf', encoding=encoding)
EDIT:
Result:
>>> ds_rounded[v].encoding
{'zlib': True, 'shuffle': False, 'complevel': 4, 'fletcher32': False, 'contiguous': False, 'chunksizes': (1, 5, 28, 64), 'source': '/work/bm1124/m300524/experiments/asp_esmControl_PMassim_1850_ATMsfn_over_2006/outdata/hamocc/asp_esmControl_PMassim_1850_ATMsfn_over_2006_hamocc_data_3d_ym_18590101_18591231_comp99.nc', 'original_shape': (1, 40, 220, 256), 'dtype': dtype('float32'), '_FillValue': nan, 'coordinates': 'lon lat'}
Some further interesting reading for me:
- https://www.unidata.ucar.edu/blogs/developer/entry/netcdf_compression
- https://unidata.github.io/netcdf4-python/#efficient-compression-of-netcdf-variables: use
unlimited_dims=[]
into_netcdf()
- https://discourse.pangeo.io/t/xarray-and-compression-options-for-large-netcdf-files/1154/8
- video training: https://www.youtube.com/watch?v=ml4RkEaB4DQ
What I do not want is that the compression impacts the read performance too much, i.e. reading the compressed data should not take much more (subjective I know) time that compared to before (or is this the tradeoff anyways).
Many lossless compressors have a decompression speed that's approximately independent of the compression level. So read performance is rarely something to worry about with this method. E.g.
(from https://www.nature.com/articles/s43588-021-00156-2)
compress and save with xarray in xbitinfo
: https://xbitinfo.readthedocs.io/en/latest/api.html#save-compressed