xarray icon indicating copy to clipboard operation
xarray copied to clipboard

Cannot open dataset with empty list units

Open antscloud opened this issue 1 year ago • 5 comments

What happened?

I found myself using a netcdf with empty units and by using xarray i was unable to use open_dataset due to the parsing of cf conventions. I reproduce the bug, and it happens in a particular situation when the units is an empty list (See Minimal Complete Verifiable Example)

What did you expect to happen?

To parse the units attribute as an empty string ?

Minimal Complete Verifiable Example

temp = 15 + 8 * np.random.randn(2, 2, 3)
precip = 10 * np.random.rand(2, 2, 3)
lon = [[-99.83, -99.32], [-99.79, -99.23]]
lat = [[42.25, 42.21], [42.63, 42.59]]

# for real use cases, its good practice to supply array attributes such as
# units, but we won't bother here for the sake of brevity
ds = xr.Dataset(
        {
            "temperature": (["x", "y", "time"], temp),
            "precipitation": (["x", "y", "time"], precip),
        },
        coords={
            "lon": (["x", "y"], lon),
            "lat": (["x", "y"], lat),
            "time": pd.date_range("2014-09-06", periods=3),
            "reference_time": pd.Timestamp("2014-09-05"),
        },
    )
ds.temperature.attrs["units"] = []

ds.to_netcdf("test.nc")

ds = xr.open_dataset("test.nc")
ds.close()

MVCE confirmation

  • [X] Minimal example — the example is as focused as reasonably possible to demonstrate the underlying issue in xarray.
  • [ ] Complete example — the example is self-contained, including all data and the text of any traceback.
  • [X] Verifiable example — the example copy & pastes into an IPython prompt or Binder notebook, returning the result.
  • [ ] New issue — a search of GitHub Issues suggests this is not a duplicate.

Relevant log output

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Input In [3], in <cell line: 1>()
----> 1 ds = xr.open_dataset("test.nc")
      2 print(ds["temperature"].attrs)
      3 ds.close()

File ~/.local/src/miniconda/envs/uptodatexarray/lib/python3.10/site-packages/xarray/backends/api.py:495, in open_dataset(filename_or_obj, engine, chunks, cache, decode_cf, mask_and_scale, decode_times, decode_timedelta, use_cftime, concat_characters, decode_coords, drop_variables, backend_kwargs, *args, **kwargs)
    483 decoders = _resolve_decoders_kwargs(
    484     decode_cf,
    485     open_backend_dataset_parameters=backend.open_dataset_parameters,
   (...)
    491     decode_coords=decode_coords,
    492 )
    494 overwrite_encoded_chunks = kwargs.pop("overwrite_encoded_chunks", None)
--> 495 backend_ds = backend.open_dataset(
    496     filename_or_obj,
    497     drop_variables=drop_variables,
    498     **decoders,
    499     **kwargs,
    500 )
    501 ds = _dataset_from_backend_dataset(
    502     backend_ds,
    503     filename_or_obj,
   (...)
    510     **kwargs,
    511 )
    512 return ds

File ~/.local/src/miniconda/envs/uptodatexarray/lib/python3.10/site-packages/xarray/backends/netCDF4_.py:564, in NetCDF4BackendEntrypoint.open_dataset(self, filename_or_obj, mask_and_scale, decode_times, concat_characters, decode_coords, drop_variables, use_cftime, decode_timedelta, group, mode, format, clobber, diskless, persist, lock, autoclose)
    562 store_entrypoint = StoreBackendEntrypoint()
    563 with close_on_error(store):
--> 564     ds = store_entrypoint.open_dataset(
    565         store,
    566         mask_and_scale=mask_and_scale,
    567         decode_times=decode_times,
    568         concat_characters=concat_characters,
    569         decode_coords=decode_coords,
    570         drop_variables=drop_variables,
    571         use_cftime=use_cftime,
    572         decode_timedelta=decode_timedelta,
    573     )
    574 return ds

File ~/.local/src/miniconda/envs/uptodatexarray/lib/python3.10/site-packages/xarray/backends/store.py:27, in StoreBackendEntrypoint.open_dataset(self, store, mask_and_scale, decode_times, concat_characters, decode_coords, drop_variables, use_cftime, decode_timedelta)
     24 vars, attrs = store.load()
     25 encoding = store.get_encoding()
---> 27 vars, attrs, coord_names = conventions.decode_cf_variables(
     28     vars,
     29     attrs,
     30     mask_and_scale=mask_and_scale,
     31     decode_times=decode_times,
     32     concat_characters=concat_characters,
     33     decode_coords=decode_coords,
     34     drop_variables=drop_variables,
     35     use_cftime=use_cftime,
     36     decode_timedelta=decode_timedelta,
     37 )
     39 ds = Dataset(vars, attrs=attrs)
     40 ds = ds.set_coords(coord_names.intersection(vars))

File ~/.local/src/miniconda/envs/uptodatexarray/lib/python3.10/site-packages/xarray/conventions.py:503, in decode_cf_variables(variables, attributes, concat_characters, mask_and_scale, decode_times, decode_coords, drop_variables, use_cftime, decode_timedelta)
    499     continue
    500 stack_char_dim = (
    501     concat_characters and v.dtype == "S1" and v.ndim > 0 and stackable(v.dims[-1])
    502 )
--> 503 new_vars[k] = decode_cf_variable(
    504     k,
    505     v,
    506     concat_characters=concat_characters,
    507     mask_and_scale=mask_and_scale,
    508     decode_times=decode_times,
    509     stack_char_dim=stack_char_dim,
    510     use_cftime=use_cftime,
    511     decode_timedelta=decode_timedelta,
    512 )
    513 if decode_coords in [True, "coordinates", "all"]:
    514     var_attrs = new_vars[k].attrs

File ~/.local/src/miniconda/envs/uptodatexarray/lib/python3.10/site-packages/xarray/conventions.py:354, in decode_cf_variable(name, var, concat_characters, mask_and_scale, decode_times, decode_endianness, stack_char_dim, use_cftime, decode_timedelta)
    351         var = coder.decode(var, name=name)
    353 if decode_timedelta:
--> 354     var = times.CFTimedeltaCoder().decode(var, name=name)
    355 if decode_times:
    356     var = times.CFDatetimeCoder(use_cftime=use_cftime).decode(var, name=name)

File ~/.local/src/miniconda/envs/uptodatexarray/lib/python3.10/site-packages/xarray/coding/times.py:537, in CFTimedeltaCoder.decode(self, variable, name)
    534 def decode(self, variable, name=None):
    535     dims, data, attrs, encoding = unpack_for_decoding(variable)
--> 537     if "units" in attrs and attrs["units"] in TIME_UNITS:
    538         units = pop_to(attrs, encoding, "units")
    539         transform = partial(decode_cf_timedelta, units=units)

TypeError: unhashable type: 'numpy.ndarray'

Anything else we need to know?

The following assignation produces the bug :

ds.temperature.attrs["units"] = []

But these ones does not produce the bug :

ds.temperature.attrs["units"] = "[]"
ds.temperature.attrs["units"] = ""

Also, i don't know how the units attributes get encoded for writing but i see no difference between ds.temperature.attrs["units"] = "" and ds.temperature.attrs["units"] = [] when using ncdump on the file

Environment

This bug was encountered with versions below this one.

INSTALLED VERSIONS

commit: None python: 3.10.4 (main, Mar 31 2022, 08:41:55) [GCC 7.5.0] python-bits: 64 OS: Linux OS-release: 5.13.0-52-generic machine: x86_64 processor: x86_64 byteorder: little LC_ALL: None LANG: fr_FR.UTF-8 LOCALE: ('fr_FR', 'UTF-8') libhdf5: 1.10.6 libnetcdf: 4.6.1

xarray: 0.20.1 pandas: 1.4.3 numpy: 1.22.3 scipy: None netCDF4: 1.5.7 pydap: None h5netcdf: None h5py: None Nio: None zarr: None cftime: 1.5.1.1 nc_time_axis: None PseudoNetCDF: None rasterio: None cfgrib: None iris: None bottleneck: 1.3.5 dask: None distributed: None matplotlib: None cartopy: None seaborn: None numbagg: None fsspec: None cupy: None pint: None sparse: None setuptools: 61.2.0 pip: 22.1.2 conda: None pytest: None IPython: 8.4.0 sphinx: None

antscloud avatar Jul 13 '22 12:07 antscloud

@antscloud As a workaround you could use keyword argument decode_cf=False in the call to xr.open_dataset. After fixing the units attribute to some reasonable value you can call ds = xr.decode_cf(ds).

kmuehlbauer avatar Jul 13 '22 13:07 kmuehlbauer

@antscloud As a workaround you could use keyword argument decode_cf=False in the call to xr.open_dataset. After fixing the units attribute to some reasonable value you can call ds = xr.decode_cf(ds).

Thank you, i'll do this. One could just loop over variables attributes and replace [] by an empty string in this particular case

antscloud avatar Jul 13 '22 13:07 antscloud

I guess we could take a PR to change

if "units" in attrs and attrs["units"] in TIME_UNITS:

to

if "units" in attrs and isinstance(attrs["units"], str) and attrs["units"] in TIME_UNITS:

dcherian avatar Jul 13 '22 14:07 dcherian

I was wondering why the units attribute is parsed this way in the first place ? It seems that this attribute is converted to a Python object (a list), is it xarray that does this or the binding of netcdf4 ?

If it's xarray, wouldn't it be better to just not parse it ?

antscloud avatar Jul 13 '22 14:07 antscloud

It is checking to see if we can decode it as a time variable

dcherian avatar Jul 13 '22 14:07 dcherian

I think this is now fixed by #7085 (thanks @ghislainp )

dcherian avatar Oct 03 '22 20:10 dcherian