Ugly error from xarray chunkmanager when loading ManifestArray
If you create the air1.nc and air2.nc files in the same way as in the docs, then concatenate them with compat='equals', you get a super ugly error:
In [9]: vds1 = open_virtual_dataset('air1.nc', indexes={})
In [10]: vds2 = open_virtual_dataset('air2.nc', indexes={})
In [11]: xr.concat([vds1, vds2], dim='time', compat='equals')
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
Cell In[11], line 1
----> 1 xr.concat([vds1, vds2], dim='time', compat='equals')
File ~/Documents/Work/Code/xarray/xarray/core/concat.py:276, in concat(objs, dim, data_vars, coords, compat, positions, fill_value, join, combine_attrs, create_index_for_new_dim)
263 return _dataarray_concat(
264 objs,
265 dim=dim,
(...)
273 create_index_for_new_dim=create_index_for_new_dim,
274 )
275 elif isinstance(first_obj, Dataset):
--> 276 return _dataset_concat(
277 objs,
278 dim=dim,
279 data_vars=data_vars,
280 coords=coords,
281 compat=compat,
282 positions=positions,
283 fill_value=fill_value,
284 join=join,
285 combine_attrs=combine_attrs,
286 create_index_for_new_dim=create_index_for_new_dim,
287 )
288 else:
289 raise TypeError(
290 "can only concatenate xarray Dataset and DataArray "
291 f"objects, got {type(first_obj)}"
292 )
File ~/Documents/Work/Code/xarray/xarray/core/concat.py:538, in _dataset_concat(datasets, dim, data_vars, coords, compat, positions, fill_value, join, combine_attrs, create_index_for_new_dim)
532 datasets = [
533 ds.expand_dims(dim, create_index_for_new_dim=create_index_for_new_dim)
534 for ds in datasets
535 ]
537 # determine which variables to concatenate
--> 538 concat_over, equals, concat_dim_lengths = _calc_concat_over(
539 datasets, dim, dim_names, data_vars, coords, compat
540 )
542 # determine which variables to merge, and then merge them according to compat
543 variables_to_merge = (coord_names | data_names) - concat_over
File ~/Documents/Work/Code/xarray/xarray/core/concat.py:437, in _calc_concat_over(datasets, dim, dim_names, data_vars, coords, compat)
434 concat_over.update(opt)
436 process_subset_opt(data_vars, "data_vars")
--> 437 process_subset_opt(coords, "coords")
438 return concat_over, equals, concat_dim_lengths
File ~/Documents/Work/Code/xarray/xarray/core/concat.py:391, in _calc_concat_over.<locals>.process_subset_opt(opt, subset)
384 concat_over.add(k)
386 elif equals[k] is None:
387 # Compare the variable of all datasets vs. the one
388 # of the first dataset. Perform the minimum amount of
389 # loads in order to avoid multiple loads from disk
390 # while keeping the RAM footprint low.
--> 391 v_lhs = datasets[0].variables[k].load()
392 # We'll need to know later on if variables are equal.
393 computed = []
File ~/Documents/Work/Code/xarray/xarray/core/variable.py:976, in Variable.load(self, **kwargs)
959 def load(self, **kwargs):
960 """Manually trigger loading of this variable's data from disk or a
961 remote source into memory and return this variable.
962
(...)
974 dask.array.compute
975 """
--> 976 self._data = to_duck_array(self._data, **kwargs)
977 return self
File ~/Documents/Work/Code/xarray/xarray/namedarray/pycompat.py:129, in to_duck_array(data, **kwargs)
126 from xarray.namedarray.parallelcompat import get_chunked_array_type
128 if is_chunked_array(data):
--> 129 chunkmanager = get_chunked_array_type(data)
130 loaded_data, *_ = chunkmanager.compute(data, **kwargs) # type: ignore[var-annotated]
131 return loaded_data
File ~/Documents/Work/Code/xarray/xarray/namedarray/parallelcompat.py:165, in get_chunked_array_type(*args)
159 selected = [
160 chunkmanager
161 for chunkmanager in chunkmanagers.values()
162 if chunkmanager.is_chunked_array(chunked_arr)
163 ]
164 if not selected:
--> 165 raise TypeError(
166 f"Could not find a Chunk Manager which recognises type {type(chunked_arr)}"
167 )
168 elif len(selected) >= 2:
169 raise TypeError(f"Multiple ChunkManagers recognise type {type(chunked_arr)}")
TypeError: Could not find a Chunk Manager which recognises type <class 'virtualizarr.manifests.array.ManifestArray'>
This is because compat='equals' (which is currently xarray's default, though that should change, see https://github.com/pydata/xarray/issues/8778) tries to load the coordinate variables in order to compare their values. Xarray thinks it needs to load them because it sees the .chunks attribute, assumes it's a computable array like dask or cubed, then searches for a corresponding Chunk Manager to use to compute this chunked array.
Basically ManifestArray breaks one of xarray's assumptions by being chunked but not computable. So it's another example of an array that causes the same issue as https://github.com/pydata/xarray/issues/8733.
The behaviour we want here is for xarray to be more lenient, and not attempt to load the array. Then it will progress to the __eq__ comparison, and ManifestArray can report a more useful error if it gets coerced to an index, or actually just return its own definition of equality.
@TomNicholas sorry if this is a silly question, I am but a neophyte and I'm trying to place VirtualiZarr in my mind: what can a user do with a collection of ManifestArrays besides concating/merging them since they do not load data?
I think the eventual goal is to:
- Read a file and create a
ChunkManifestfor it. - Repeat for many files and concat & merge them.
- Write a Zarr store for the dataset but use the
ChunkManifests instead of chunks themselves (avoiding the copies). - Open the Zarr store with Zarr or Xarray, where those libraries do understand how to load data from the serialized
ChunkManifests.
Is that correct? If so, I will stop trying to load data from the xr.Dataset returned by virtualizarr.open_virtual_dataset
@ghidalgo3 that's exactly right. The only thing you've missed is that although writing out a "virtual" zarr store is a future aim, writing out actual kerchunk reference files works today! Which means that you can you this package for the same purpose that lots of people were already using kerchunk.combine.MultiZarrToZarr.
I will stop trying to load data from the xr.Dataset returned by virtualizarr.open_virtual_dataset
Yes you cannot load the ManifestArray objects directly. And there wouldn't be any point in doing so either. The whole point is that those arrays then get serialized back to disk as references (either virtual zarr via chunk manifests or as kerchunk reference files).
Serious question: Did you read the documentation? If so how do you think it could be improved so as to make all this clearer?
EDIT: In my opinion there is no such thing as silly questions, only imperfect documentation :)
I did read the documentation, I think what was unclear to me was that there are 2 kinds of xr.Dataset objects I encountered with VirtualiZarr: the one returned by virtualizarr.open_virtual_dataset and the one returned by xr.open_dataset(virtualizarr_produced_kerchunk_or_zarr_store).
The first xr.Dataset is good for concating/merging and writing to storage so that it can be read by xarray later into a new xr.Dataset which can be read. I know this is said on this page, but only in hindsight did I understand the implication:
VirtualiZarr aims to allow you to use the same xarray incantation you would normally use to open and combine all your files, but cache that result as a virtual Zarr store.
If I could wave a magic wand to make it better I would not want the return value of open_virtual_dataset to be an xr.Dataset because that dataset doesn't behave like other xr.Dataset and instead return something like a VirtualiZarr.Dataset which only has the functions that do work, but I understand that xr.Dataset is close enough and users (me) shouldn't expect it to load data.
(@ghidalgo3 I replied in #171 so as to preserve the original topic of this issue)