Saving and loading arrays with boolean eltypes
As far as I can tell, netCDF does not support a boolean eltype, so boolean variables need to be written as integers. I'm working with some netCDF files saved using the netcdf library via xarray, which seems to handle this by saving boolean data as integers with the attribute dtype="bool". Is there an option to tell NCDatasets during load time to set the eltype based on an attribute like this?
Can you give me an example code in python-xarray which writes and reads such a Boolean array?
Can you give me an example code in python-xarray which writes and reads such a Boolean array?
Sure, here's an example:
>>> import xarray as xr
>>> import numpy as np
>>> x = np.random.normal(size=(4, 100)) > 0
>>> ds = xr.Dataset(
... data_vars=dict(x=(['chain', 'draw'], x)),
... coords=dict(chain=range(4), draw=range(100)),
... )
>>> ds.x.dtype
dtype('bool')
>>> ds.to_netcdf("foo.nc")
>>> ds2 = xr.open_dataset("foo.nc")
>>> ds2.x.dtype
dtype('bool')
>>> np.array_equal(ds.x, ds2.x)
True
When we load foo.nc using NCDatasets, we can see the dtype attribute:
julia> using NCDatasets
julia> ds = NCDataset("foo.nc");
julia> ds["x"]
x (100 × 4)
Datatype: Int8
Dimensions: draw × chain
Attributes:
dtype = bool
Thanks for the example!
When reading the variable in python-NetCDF4 package, it seems that the variable is also returned an integer. I am not aware than any other package (Matlab, Octave or R) threat the attribute dtype in a special way. dtype is also not mentioned in the CF standard which I aim to follow.
This reminds me of the discussion about _Unsigned = "true": it was introduced before NetCDF has real unsigned types (now we have them least for the HDF5 format), but leading to inconsistencies and errors. Some of these issues are fixed by now, by adding the unsigned data types also to OPENDAP.
It is also not quite clear to me how to handle _FillValue, valid_min, valid_max, valid_range properties in this case when dtype attribute modifies the element type of an array.
Unfortunately, h5py implemented boolean types is a incompatible way than xarray (using enums).
So I don't think, that we should import this xarray specific extension to NCDatasets.
Maybe, we can can give an API to the user so that the user can implement specific encoding/decoding functions, like
function transformation(v::NCDataset.Variable)
if get(v.attrib,"dtype","") == "bool"
# encode, decode function
return x -> Int8(x), x -> Bool(x)
else
return identity, identity
end
Would this be worth the effort ?
The true fix would be to add a native boolean type to NetCDF/HDF5. Is there any feature request about this?