VirtualiZarr icon indicating copy to clipboard operation
VirtualiZarr copied to clipboard

Slicing not supported with Indexer when trying to open many MUR files.

Open betolink opened this issue 1 year ago • 5 comments

xref: https://github.com/nsidc/earthaccess/issues/903

I'm trying to open 10 years of high resolution MUR data using earthaccess, the upcoming example for 1 month (32 granules) works fine but when I try to open ~1000 I get the following error:

File /srv/conda/envs/notebook/lib/python3.11/site-packages/xarray/core/indexing.py:1031, in apply_indexer(indexable, indexer)
   1029     return indexable.vindex[indexer]
   1030 elif isinstance(indexer, OuterIndexer):
-> 1031     return indexable.oindex[indexer]
   1032 else:
   1033     return indexable[indexer]

File [/srv/conda/envs/notebook/lib/python3.11/site-packages/xarray/core/indexing.py:369](https://openscapes.2i2c.cloud/srv/conda/envs/notebook/lib/python3.11/site-packages/xarray/core/indexing.py#line=368), in IndexCallable.__getitem__(self, key)
    368 def __getitem__(self, key: Any) -> Any:
--> 369     return self.getter(key)

File [/srv/conda/envs/notebook/lib/python3.11/site-packages/xarray/core/indexing.py:1508](https://openscapes.2i2c.cloud/srv/conda/envs/notebook/lib/python3.11/site-packages/xarray/core/indexing.py#line=1507), in NumpyIndexingAdapter._oindex_get(self, indexer)
   1506 def _oindex_get(self, indexer: OuterIndexer):
   1507     key = _outer_to_numpy_indexer(indexer, self.array.shape)
-> 1508     return self.array[key]

File [/srv/conda/envs/notebook/lib/python3.11/site-packages/virtualizarr/manifests/array.py:214](https://openscapes.2i2c.cloud/srv/conda/envs/notebook/lib/python3.11/site-packages/virtualizarr/manifests/array.py#line=213), in ManifestArray.__getitem__(self, key)
    212     return self
    213 else:
--> 214     raise NotImplementedError(f"Doesn't support slicing with {indexer}")

NotImplementedError: Doesn't support slicing with (array([....]))

I'm not sure if there is something wrong with the data (misaligned grid in one granule or something like that). @TomNicholas mentioned that this may be related to #51

The full example(earthaccess has to be installed from source):


# NASA JPL Multiscale Ultrahigh Resolution (MUR) Sea Surface Temperature (SST) dataset - 0.01 degree resolution
results = []
for year in range(2010, 2020):
    seasonal_granules = earthaccess.search_data(
        temporal=(f"{year}-12", f"{year+1}-02"), short_name="MUR-JPL-L4-GLOB-v4.1",
        cloud_hosted=True
    )
    results.extend(seasonal_granules)
len(results)
913

# and then 

%%time
mur = earthaccess.open_virtual_mfdataset(
    results,
    access="indirect",
    load=True, # it also failed with load=False
    parallel=True,
    concat_dim="time",
    coords="minimal",
    compat="override",
    combine_attrs="drop_conflicts",
)
mur

betolink avatar Dec 20 '24 17:12 betolink

I'm not sure why this operation requires indexing at all...

Can you open the files individually and then concatenate them to try to narrow down where / for which dataset the error is occurring?

TomNicholas avatar Dec 20 '24 17:12 TomNicholas

So I did that approach and narrowed it down a bit. This code snippet uses the same results variable linked above by @betolink and reaches the same error with just two xr.Datasets. The main difference I found between the two datasets is that one has an extra variable (dt_1km_data) which is also shown below. I guess at some point the MUR netcdf product added a variable to the datasets? This presumably makes it difficult to concatenate with older references.

Also FYI I checked the source netcdfs and verified that one does have an extra variable to make sure this isn't a dmrpp reader error

import dask
import virtualizarr as vz
fs = earthaccess.get_s3_filesystem(results=results)
fs.storage_options["anon"] = False  # type: ignore
open_ = dask.delayed(vz.open_virtual_dataset)  # type: ignore
vdatasets = []
# Get list of virtual datasets (or dask delayed objects)
for g in results:
    vdatasets.append(
        open_(
            filepath=g.data_links(access="direct")[0] + ".dmrpp",
            filetype="dmrpp",  # type: ignore
            indexes={},
            reader_options={"storage_options": fs.storage_options},  # type: ignore
        )
    )
vdatasets = dask.compute(vdatasets)[0]
xr.combine_nested(vdatasets[495:497], concat_dim="time",
    coords="minimal",
    compat="override",
    combine_attrs="drop_conflicts")
print(vdatasets[495])
print(vdatasets[496])
<xarray.Dataset> Size: 5GB
Dimensions:           (time: 1, lat: 17999, lon: 36000)
Coordinates:
    time              (time) int32 4B ManifestArray<shape=(1,), dtype=int32, ...
    lat               (lat) float32 72kB ManifestArray<shape=(17999,), dtype=...
    lon               (lon) float32 144kB ManifestArray<shape=(36000,), dtype...
Data variables:
    mask              (time, lat, lon) int8 648MB ManifestArray<shape=(1, 179...
    sea_ice_fraction  (time, lat, lon) int8 648MB ManifestArray<shape=(1, 179...
    dt_1km_data       (time, lat, lon) int8 648MB ManifestArray<shape=(1, 179...
    analysed_sst      (time, lat, lon) int16 1GB ManifestArray<shape=(1, 1799...
    analysis_error    (time, lat, lon) int16 1GB ManifestArray<shape=(1, 1799...
Attributes: (12/47)
    Conventions:                CF-1.5
    title:                      Daily MUR SST, Final product
    summary:                    A merged, multi-sensor L4 Foundation SST anal...
    references:                 http://podaac.jpl.nasa.gov/Multi-scale_Ultra-...
    institution:                Jet Propulsion Laboratory
    history:                    created at nominal 4-day latency; replaced nr...
    ...                         ...
<xarray.Dataset> Size: 4GB
Dimensions:           (time: 1, lat: 17999, lon: 36000)
Coordinates:
    time              (time) int32 4B ManifestArray<shape=(1,), dtype=int32, ...
    lat               (lat) float32 72kB ManifestArray<shape=(17999,), dtype=...
    lon               (lon) float32 144kB ManifestArray<shape=(36000,), dtype...
Data variables:
    mask              (time, lat, lon) int8 648MB ManifestArray<shape=(1, 179...
    sea_ice_fraction  (time, lat, lon) int8 648MB ManifestArray<shape=(1, 179...
    analysed_sst      (time, lat, lon) int16 1GB ManifestArray<shape=(1, 1799...
    analysis_error    (time, lat, lon) int16 1GB ManifestArray<shape=(1, 1799...
Attributes: (12/47)
    Conventions:                CF-1.5
    title:                      Daily MUR SST, Final product
    summary:                    A merged, multi-sensor L4 Foundation SST anal...
    references:                 http://podaac.jpl.nasa.gov/Multi-scale_Ultra-...
    institution:                Jet Propulsion Laboratory
    history:                    created at nominal 4-day latency; replaced nr...
    ...                         ...

ayushnag avatar Dec 20 '24 18:12 ayushnag

haha I knew this was an issue with model output (variables can appear and disappear) but it's news to me that it can happen with nominally-standard-archival data. FWIW I think Xarray will try to reindex dt_1km_data to the full time vector by indexing with [-1, -1, -1, -1 ..... N-1, N] which is why you get the error.

dcherian avatar Dec 20 '24 18:12 dcherian

@dcherian Yup, the error output if I change the range of results to [545:555] is:

NotImplementedError: Doesn't support slicing with (array([-1, -1, -1,  0,  1,  2,  3,  4,  5,  6]), slice(None, None, None), slice(None, None, None))

ayushnag avatar Dec 20 '24 18:12 ayushnag

Huh. I'm not sure I fully understand yet - is this an example of trying to auto-generate indexes? Or an xarray "virtual variable" thing (not virtual variable in the virtualizarr sense).

TomNicholas avatar Dec 20 '24 19:12 TomNicholas