xarray
xarray copied to clipboard
reading netcdf with engine=scipy fails with a typeerror under certain conditions
What happened?
Saving and loading from netcdf with engine=scipy produces an unexpected valueerror on read. The file seems to be corrupted.
What did you expect to happen?
reading works just fine.
Minimal Complete Verifiable Example
import numpy as np
import xarray as xr
ds = xr.Dataset(
{
"values": (
["name", "time"],
np.array([[]], dtype=np.float32).T,
)
},
coords={"time": [1], "name": []},
).expand_dims({"index": [0]})
ds.to_netcdf("file.nc", engine="scipy")
_ = xr.open_dataset("file.nc", engine="scipy")
MVCE confirmation
- [X] Minimal example — the example is as focused as reasonably possible to demonstrate the underlying issue in xarray.
- [X] Complete example — the example is self-contained, including all data and the text of any traceback.
- [X] Verifiable example — the example copy & pastes into an IPython prompt or Binder notebook, returning the result.
- [X] New issue — a search of GitHub Issues suggests this is not a duplicate.
- [X] Recent environment — the issue occurs with the latest version of xarray and its dependencies.
Relevant log output
KeyError Traceback (most recent call last)
File .../python3.11/site-packages/xarray/backends/file_manag
er.py:211, in CachingFileManager._acquire_with_cache_info(self, needs_lock)
210 try:
--> 211 file = self._cache[self._key]
212 except KeyError:
File .../python3.11/site-packages/xarray/backends/lru_cache.
py:56, in LRUCache.__getitem__(self, key)
55 with self._lock:
---> 56 value = self._cache[key]
57 self._cache.move_to_end(key)
KeyError: [<function _open_scipy_netcdf at 0x7fe96afa9120>, ('/home/eivind/Projects/ert/file.nc',),
'r', (('mmap', None), ('version', 2)), '264ec6b3-78b3-4766-bb41-7656d6a51962']
During handling of the above exception, another exception occurred:
TypeError Traceback (most recent call last)
Cell In[1], line 18
4 ds = (
5 xr.Dataset(
6 {
(...)
15 .expand_dims({"index": [0]})
16 )
17 ds.to_netcdf("file.nc", engine="scipy")
---> 18 _ = xr.open_dataset("file.nc", engine="scipy")
File .../python3.11/site-packages/xarray/backends/api.py:572
, in open_dataset(filename_or_obj, engine, chunks, cache, decode_cf, mask_and_scale, decode_times, d
ecode_timedelta, use_cftime, concat_characters, decode_coords, drop_variables, inline_array, chunked
_array_type, from_array_kwargs, backend_kwargs, **kwargs)
560 decoders = _resolve_decoders_kwargs(
561 decode_cf,
562 open_backend_dataset_parameters=backend.open_dataset_parameters,
(...)
568 decode_coords=decode_coords,
569 )
571 overwrite_encoded_chunks = kwargs.pop("overwrite_encoded_chunks", None)
--> 572 backend_ds = backend.open_dataset(
573 filename_or_obj,
574 drop_variables=drop_variables,
575 **decoders,
576 **kwargs,
577 )
578 ds = _dataset_from_backend_dataset(
579 backend_ds,
580 filename_or_obj,
(...)
590 **kwargs,
591 )
592 return ds
File .../python3.11/site-packages/xarray/backends/scipy_.py:
315, in ScipyBackendEntrypoint.open_dataset(self, filename_or_obj, mask_and_scale, decode_times, con
cat_characters, decode_coords, drop_variables, use_cftime, decode_timedelta, mode, format, group, mm
ap, lock)
313 store_entrypoint = StoreBackendEntrypoint()
314 with close_on_error(store):
--> 315 ds = store_entrypoint.open_dataset(
316 store,
317 mask_and_scale=mask_and_scale,
318 decode_times=decode_times,
319 concat_characters=concat_characters,
320 decode_coords=decode_coords,
321 drop_variables=drop_variables,
322 use_cftime=use_cftime,
323 decode_timedelta=decode_timedelta,
324 )
325 return ds
File .../python3.11/site-packages/xarray/backends/store.py:4
3, in StoreBackendEntrypoint.open_dataset(self, filename_or_obj, mask_and_scale, decode_times, conca
t_characters, decode_coords, drop_variables, use_cftime, decode_timedelta)
29 def open_dataset( # type: ignore[override] # allow LSP violation, not supporting **kwargs
30 self,
31 filename_or_obj: str | os.PathLike[Any] | BufferedIOBase | AbstractDataStore,
(...)
39 decode_timedelta=None,
40 ) -> Dataset:
41 assert isinstance(filename_or_obj, AbstractDataStore)
---> 43 vars, attrs = filename_or_obj.load()
44 encoding = filename_or_obj.get_encoding()
46 vars, attrs, coord_names = conventions.decode_cf_variables(
47 vars,
48 attrs,
(...)
55 decode_timedelta=decode_timedelta,
56 )
File .../python3.11/site-packages/xarray/backends/common.py:
210, in AbstractDataStore.load(self)
188 def load(self):
189 """
190 This loads the variables and attributes simultaneously.
191 A centralized loading function makes it easier to create
(...)
207 are requested, so care should be taken to make sure its fast.
208 """
209 variables = FrozenDict(
--> 210 (_decode_variable_name(k), v) for k, v in self.get_variables().items()
211 )
212 attributes = FrozenDict(self.get_attrs())
213 return variables, attributes
File .../python3.11/site-packages/xarray/backends/scipy_.py:
181, in ScipyDataStore.get_variables(self)
179 def get_variables(self):
180 return FrozenDict(
--> 181 (k, self.open_store_variable(k, v)) for k, v in self.ds.variables.items()
182 )
File .../python3.11/site-packages/xarray/backends/scipy_.py:
170, in ScipyDataStore.ds(self)
168 @property
169 def ds(self):
--> 170 return self._manager.acquire()
File .../python3.11/site-packages/xarray/backends/file_manag
er.py:193, in CachingFileManager.acquire(self, needs_lock)
178 def acquire(self, needs_lock=True):
179 """Acquire a file object from the manager.
180
181 A new file is only opened if it has expired from the
(...)
191 An open file object, as returned by ``opener(*args, **kwargs)``.
192 """
--> 193 file, _ = self._acquire_with_cache_info(needs_lock)
194 return file
File .../python3.11/site-packages/xarray/backends/file_manag
er.py:217, in CachingFileManager._acquire_with_cache_info(self, needs_lock)
215 kwargs = kwargs.copy()
216 kwargs["mode"] = self._mode
--> 217 file = self._opener(*self._args, **kwargs)
218 if self._mode == "w":
219 # ensure file doesn't get overridden when opened again
220 self._mode = "a"
File .../python3.11/site-packages/xarray/backends/scipy_.py:
109, in _open_scipy_netcdf(filename, mode, mmap, version)
106 filename = io.BytesIO(filename)
108 try:
--> 109 return scipy.io.netcdf_file(filename, mode=mode, mmap=mmap, version=version)
110 except TypeError as e: # netcdf3 message is obscure in this case
111 errmsg = e.args[0]
File .../python3.11/site-packages/scipy/io/_netcdf.py:278, i
n netcdf_file.__init__(self, filename, mode, mmap, version, maskandscale)
275 self._attributes = {}
277 if mode in 'ra':
--> 278 self._read()
File .../python3.11/site-packages/scipy/io/_netcdf.py:607, i
n netcdf_file._read(self)
605 self._read_dim_array()
606 self._read_gatt_array()
--> 607 self._read_var_array()
File .../python3.11/site-packages/scipy/io/_netcdf.py:688, i
n netcdf_file._read_var_array(self)
685 data = None
686 else: # not a record variable
687 # Calculate size to avoid problems with vsize (above)
--> 688 a_size = reduce(mul, shape, 1) * size
689 if self.use_mmap:
690 data = self._mm_buf[begin_:begin_+a_size].view(dtype=dtype_)
TypeError: unsupported operand type(s) for *: 'int' and 'NoneType'
Anything else we need to know?
No response
Environment
INSTALLED VERSIONS
commit: None python: 3.11.4 (main, Dec 7 2023, 15:43:41) [GCC 12.3.0] python-bits: 64 OS: Linux OS-release: 6.2.0-39-generic machine: x86_64 processor: x86_64 byteorder: little LC_ALL: None LANG: en_US.UTF-8 LOCALE: ('en_US', 'UTF-8') libhdf5: 1.12.2 libnetcdf: 4.9.3-development
xarray: 2024.1.1 pandas: 2.1.1 numpy: 1.26.1 scipy: 1.11.3 netCDF4: 1.6.5 pydap: None h5netcdf: None h5py: 3.10.0 Nio: None zarr: None cftime: 1.6.3 nc_time_axis: None iris: None bottleneck: None dask: None distributed: None matplotlib: 3.8.0 cartopy: None seaborn: 0.13.1 numbagg: None fsspec: None cupy: None pint: None sparse: None flox: None numpy_groupies: None setuptools: 63.4.3 pip: 23.3.1 conda: None pytest: 7.4.4 mypy: 1.8.0 IPython: 8.17.2 sphinx: 7.2.6
@eivindjahren Thanks for bringing this to attention.
From the description it's a bit unclear which engine you want/need to use. You mentioned engine=netcdf (should that be netcdf4?) and in the code example you use engine="scipy". From what I can tell engine scipy uses NETCDF3 data model which has some restrictions on the dimensions of variables. So it understands only 1 dimension as unlimited which need to be the first dimension of the variable.
If you do
$ ncdump file.nc
ncdump: file.nc: NetCDF: NC_UNLIMITED in the wrong index
But if we move the zero dimension to the front before saving:
ds = ds.transpose("name", "index", "time")
This isn't even recognized by ncdump:
$ ncdump file.nc
ncdump: file.nc: NetCDF: Unknown file format
Whereas it can be read perfectly fine with engine="scipy".
I did not explore further, but there is something weird going on with engine scipy here.
From the description it's a bit unclear which engine you want/need to use. You mentioned
engine=netcdf(should that benetcdf4?)
Sorry, I ment engine=scipy, that was a typo. We have decided to use that in our application for performance reasons.
Sorry, I ment
engine=scipy, that was a typo. We have decided to use that in our application for performance reasons.
A bit offtopic now, but can you elaborate a bit what performance benefits you have with NETCDF3 format in your use case? What is preventing you from using netcdf4 backend?
For the scipy backend issue I'd appreciate if someone with more knowledge in that part could chime in here.
Sorry, I ment
engine=scipy, that was a typo. We have decided to use that in our application for performance reasons.A bit offtopic now, but can you elaborate a bit what performance benefits you have with NETCDF3 format in your use case? What is preventing you from using netcdf4 backend?
I don't have the specifics about the benchmarks that were performed, but I will see what I can find. We have planned to change to netcdf4 because we want to use datetime[64].
Closing for now. If this is still an issue please reopen with updated information. Thanks!