kerchunk icon indicating copy to clipboard operation
kerchunk copied to clipboard

Single value variable of type int32 in NetCDF becomes float64 in Kerchunk

Open rsignell opened this issue 11 months ago • 5 comments

@martindurant, looks like we still have a single-value variable problem. In these AWS Open Data NetCDF files, the variable 'spherical' has a single int32 value but it becomes a float64 after kerchunk: https://nbviewer.org/gist/rsignell-usgs/5971951d348496229ce121b52a2fb750

(I discovered this because the xroms package designed to work with these ROMS NetCDF files bombed -- took me a while to figure out this was the reason...)

rsignell avatar Mar 02 '24 19:03 rsignell

I am fairly puzzled, the metadata says int:

>>> fs = fsspec.filesystem("reference", fo=single_json, remote_protocol="s3", remote_options=so)
>>> fs.cat("spherical/.zarray")
b'{"chunks":[],"compressor":null,"dtype":"<i4","fill_value":-2147483647,"filters":null,"order":"C","shape":[],"zarr_format":2}'

and zarr agrees:

>>> g = zarr.open(fs.get_mapper())
>>> g.spherical.dtype
dtype('int32')

xarray has a bunch of "decode*" flags in open_dataset, but I can't immediately see one that might do the right thing here.

The value, by the way, is just 1. This is actually a boolean?

martindurant avatar Mar 05 '24 18:03 martindurant

I believe the reason is the fill_value. At the moment, float* is one of the few data types that can have missing values (using nan), while int* can't represent missing values. mask_and_scale=False should be what you're looking for, and I believe you can convert only the ones you need using:

In [20]: import xarray as xr
    ...: 
    ...: ds = xr.Dataset(
    ...:     {
    ...:         "a": ("x", [0, 1, 2], {"_FillValue": 1}),
    ...:         "b": ("x", [0.1, 0.2, 1.0], {"_FillValue": 1.0}),
    ...:     }
    ...: )
    ...: skipped_variables = [
    ...:     name
    ...:     for name, var in ds.variables.items()
    ...:     if "_FillValue" in var.attrs and var.dtype.kind not in "cfmMO"
    ...: ]
    ...: 
    ...: 
    ...: def decode_with_skip(ds, skip=None):
    ...:     if not skip:
    ...:         return xr.decode_cf(ds)
    ...: 
    ...:     return ds[skip].merge(xr.decode_cf(ds.drop_vars(skip)))
    ...: 
    ...: 
    ...: display(ds)
    ...: display(ds.pipe(decode_with_skip, skip=skipped_variables).compute())
<xarray.Dataset> Size: 48B
Dimensions:  (x: 3)
Dimensions without coordinates: x
Data variables:
    a        (x) int64 24B 0 1 2
    b        (x) float64 24B 0.1 0.2 1.0
<xarray.Dataset> Size: 48B
Dimensions:  (x: 3)
Dimensions without coordinates: x
Data variables:
    a        (x) int64 24B 0 1 2
    b        (x) float64 24B 0.1 0.2 nan

(This might change with the custom dtypes in numpy, but it will take some effort to get working "nullable integer" dtypes)

keewis avatar Mar 06 '24 14:03 keewis

@keewis : but the data here has an int fill_value and no _Fill_Value. Are you saying that having a fill value of any sort will cause a cast int->float even when there are actually no nulls?

martindurant avatar Mar 06 '24 14:03 martindurant

Ah indeed, if I set the fill_value to null in the JSON, you get an int :|

martindurant avatar Mar 06 '24 14:03 martindurant

zarr's fill_value is translated to the _FillValue attribute. The masking is applied without checking the actual values (which is potentially expensive) using where, and the mask value and the promoted dtypes are decided in xarray.core.dtypes.maybe_promote.

keewis avatar Mar 06 '24 14:03 keewis