kerchunk
kerchunk copied to clipboard
Single value variable of type int32 in NetCDF becomes float64 in Kerchunk
@martindurant, looks like we still have a single-value variable problem. In these AWS Open Data NetCDF files, the variable 'spherical' has a single int32 value but it becomes a float64 after kerchunk: https://nbviewer.org/gist/rsignell-usgs/5971951d348496229ce121b52a2fb750
(I discovered this because the xroms package designed to work with these ROMS NetCDF files bombed -- took me a while to figure out this was the reason...)
I am fairly puzzled, the metadata says int:
>>> fs = fsspec.filesystem("reference", fo=single_json, remote_protocol="s3", remote_options=so)
>>> fs.cat("spherical/.zarray")
b'{"chunks":[],"compressor":null,"dtype":"<i4","fill_value":-2147483647,"filters":null,"order":"C","shape":[],"zarr_format":2}'
and zarr agrees:
>>> g = zarr.open(fs.get_mapper())
>>> g.spherical.dtype
dtype('int32')
xarray has a bunch of "decode*" flags in open_dataset, but I can't immediately see one that might do the right thing here.
The value, by the way, is just 1. This is actually a boolean?
I believe the reason is the fill_value
. At the moment, float*
is one of the few data types that can have missing values (using nan
), while int*
can't represent missing values. mask_and_scale=False
should be what you're looking for, and I believe you can convert only the ones you need using:
In [20]: import xarray as xr
...:
...: ds = xr.Dataset(
...: {
...: "a": ("x", [0, 1, 2], {"_FillValue": 1}),
...: "b": ("x", [0.1, 0.2, 1.0], {"_FillValue": 1.0}),
...: }
...: )
...: skipped_variables = [
...: name
...: for name, var in ds.variables.items()
...: if "_FillValue" in var.attrs and var.dtype.kind not in "cfmMO"
...: ]
...:
...:
...: def decode_with_skip(ds, skip=None):
...: if not skip:
...: return xr.decode_cf(ds)
...:
...: return ds[skip].merge(xr.decode_cf(ds.drop_vars(skip)))
...:
...:
...: display(ds)
...: display(ds.pipe(decode_with_skip, skip=skipped_variables).compute())
<xarray.Dataset> Size: 48B
Dimensions: (x: 3)
Dimensions without coordinates: x
Data variables:
a (x) int64 24B 0 1 2
b (x) float64 24B 0.1 0.2 1.0
<xarray.Dataset> Size: 48B
Dimensions: (x: 3)
Dimensions without coordinates: x
Data variables:
a (x) int64 24B 0 1 2
b (x) float64 24B 0.1 0.2 nan
(This might change with the custom dtypes in numpy
, but it will take some effort to get working "nullable integer" dtypes)
@keewis : but the data here has an int fill_value and no _Fill_Value. Are you saying that having a fill value of any sort will cause a cast int->float even when there are actually no nulls?
Ah indeed, if I set the fill_value to null
in the JSON, you get an int :|
zarr
's fill_value
is translated to the _FillValue
attribute. The masking is applied without checking the actual values (which is potentially expensive) using where
, and the mask value and the promoted dtypes are decided in xarray.core.dtypes.maybe_promote.