VirtualiZarr icon indicating copy to clipboard operation
VirtualiZarr copied to clipboard

Challenges with earthaccess.open_virtual_mfdataset for grouped netCDFs with nested structure

Open danielfromearth opened this issue 9 months ago • 12 comments

Tagging @betolink, @ayushnag, with whom I've previously discussed parts of this issue. The TEMPO data files referenced here can be downloaded here from Earthdata Search (which requires a free login account).

Overview

The TEMPO satellite instrument measures air quality, and I am working to ensure access of its data through the earthaccess.open_virtual_mfdataset() function (see this work-in-progress notebook). However, I run into several errors depending on how I try to open and/or define a preprocessing function.

Original Data Structure – arrays are two-dimensional:

  • Dimension 1) xtrack's values are identical for each granule
  • Dimension 2) mirror_step's values repeat only after a "scan" of the instrument is completed (and there are multiple scans per day).

The following illustration shows the general structure, for two days with three scans per day. Granule (each granule is a netCDF file) numbers are depicted increasing from right-to-left, because the instrument scans East-to-West. Note that mirror_step is occasionally cut short in the last granule of a scan...

Image

Desired outcome

Ideally, I'd like to open the data as a virtual dataset, concatenated in the following logical structure, with (1) the date & scan numbers consolidated to form a new dimension, and (2) the granules within each scan concatenated along the mirror_step dimension:

Image

However, any way of opening all of the granules with earthaccess.open_virtual_mfdataset() would be a success.

Problems I've run into while experimenting

Problem 1

Combining the ManifestArrays seems to be difficult, partly because earthaccess currently processes datasets as a single list passed to xr.combine_nested(), seen here in earthaccess:

vds = xr.combine_nested(vdatasets, **xr_combine_nested_kwargs)

So, passing two concatenation dimensions, as in concat_dim=["datescan", "mirror_step"], raises

Image

Problem 2

Even when opening granules within a scan, or preprocessing the datasets by assigning a new coordinate that represents date & scan, which I'm terming datescan, I run into another issue:

Image

Here is some of the testing code I am experimenting with (e.g., in this work-in-progress notebook)

But even running xarray.combine_nested() on local files presents the same challenge:

def preprocess_root_group(ds: xr.Dataset) -> xr.Dataset:
    # Add datetime with scan number as a new dimension
    yearmonthday_timestamp = datetime.fromisoformat(ds.attrs["time_coverage_start"]).strftime("%Y%m%d")

    datescan = f"{yearmonthday_timestamp}scan{ds.attrs['scan_num']:>02}"
    
    return (
        ds
        .expand_dims(["datescan"])
        .assign_coords(datescan=[datescan])
    )

open_options = {
    "access": "direct", 
    "load": True,
    "preprocess": preprocess_root_group,
    "concat_dim": ["datescan", "mirror_step"], 
    "coords": 'minimal', 
    "compat": 'override',
    "join": 'override',
    "combine_attrs": 'drop_conflicts',
    "parallel": False,
}

result_root = earthaccess.open_virtual_mfdataset(
    granules=results,
    **open_options
)

danielfromearth avatar Mar 12 '25 21:03 danielfromearth

Your visualizations and explanation are really excellent! I think this is more of an xarray/virtualizarr question so pinging @TomNicholas and @dcherian

I think problem 2 could be happening because mirror_step is cut short. The dimensions need to match exactly to combine the virtual references and when they don't you get: conflicting dimension sizes: {131, 132}. I have encountered this issue before and have trimmed the dataset in the preprocess step. So basically trim all granules to mirror_step: 131 and then concatenate. NaN padding all granules to mirror_step: 132 also seems doable but I have never tried it.

ayushnag avatar Mar 12 '25 22:03 ayushnag

Thank you for writing this up @danielfromearth .

earthaccess currently processes datasets as a single list passed to xr.combine_nested()

You could get around this by using VirtualiZarr directly. Soon we should have a virtualizarr.open_virtual_mfdataset() function but for now you will have to just use virtualizarr.open_virtual_dataset() and xr.concat() directly.

ValueError: conflicting dimension sizes

This is not solvable. VirtualiZarr (and Xarray!) relies on your data being representable by the Zarr data model without modifying data. Mismatched dimension sizes mean your data doesn't even form a hypercube of arrays (i.e. your data is "ragged"), nevermind having the consistent chunks/encoding that the Zarr model also requires. This means if you want to get your data into Icechunk you're going to have to make some modifications to the data, which means writing at least some "native chunks".

You might still be able to write most of your data as virtual chunks and only modify the chunks at the edges (e.g. in the ways @ayushnag has suggested), but you're going to have to do that in a pretty bespoke way using VirtualiZarr + Xarray to manipulate individual chunks directly.

EDIT: Another idea might be to put the inconsistent Scan 3 data into a separate group, but I don't know if that will work for you (you don't want hundreds of groups).

TomNicholas avatar Mar 12 '25 23:03 TomNicholas

Thanks for the responses!

Based on the above, what seems to me like promising directions are:

Option 1 - modifying the edges

From @ayushnag,

NaN padding all granules to mirror_step: 132 also seems doable but I have never tried it.

In theory, that seems like it could be the ideal way to go! However, from https://docs.xarray.dev/en/stable/generated/xarray.Dataset.pad.html, I wonder (1) the documentation says Dataset.pad is "experimental"—so what does that mean?—and (2) can that be applied in this case, i.e., applied to ManifestArrays? Not sure who would know the answer to this.. @dcherian ?

@TomNicholas, you said

...but you're going to have to do that in a pretty bespoke way using VirtualiZarr + Xarray to manipulate individual chunks directly.

How might that work? Do you envision using Dataset.pad, or something else?

Option 2 - separate group

From @TomNicholas,

Another idea might be to put the inconsistent Scan 3 data into a separate group, but I don't know if that will work for you (you don't want hundreds of groups).

I don't understand how that would work — do you mean creating a separate group when opening with VirtualiZarr, or do you mean changing the original netCDF file structure? I don't think the latter would work, because the netCDF structure is fixed.

Option 3 - dummy index

A third idea, with a focus on simply making the files openable, may be to discard the idea of representing the intuitive 3D arrays, and instead considering whether there would be a way to replace mirror_step with an dummy index instead. Might that be doable? I'm imagining it as a 2D virtual array, expanding on the dummy index—something like what's depicted here:

Image

Other notes:

1. Changing the interaction with VirtualiZarr

From @TomNicholas,

You could get around this by using VirtualiZarr directly. Soon we should have a virtualizarr.open_virtual_mfdataset() function but for now you will have to just use virtualizarr.open_virtual_dataset() and xr.concat() directly.

@ayushnag, I think we would need to modify the logic and API for earthaccess.open_virtual_mfdataset() to enable this sort of interaction with VirtualiZarr. If we can determine a way to make this overall usage case work, perhaps we could talk further about whether such changes to earthaccess would make sense.

2. Trimming the edges

Another option, mentioned by @ayushnag,

I have encountered this issue before and have trimmed the dataset in the preprocess step. So basically trim all granules to mirror_step: 131 and then concatenate

That seems less preferable in this case, because it would discard a sizable amount of valid measurement data. In this case, we want to avoid discarding data.

danielfromearth avatar Mar 14 '25 14:03 danielfromearth

Some top-level comments.

That seems less preferable in this case, because it would discard a sizable amount of valid measurement data. In this case, we want to avoid discarding data.

The fundamental issue here is that this data is not (yet) a consistent level 3 datacube. You cannot provide virtual access to the whole thing as a single datacube without modification if the original data does not form a datacube.

I think we would need to modify the logic and API for earthaccess.open_virtual_mfdataset() to enable this sort of interaction with VirtualiZarr.

This is an entirely orthogonal issue. You're not going to be able to paper over these limitations of the zarr data model through changes to the earthaccess API.


Now let's discuss the various ways you could modify the data you present to earthaccess users to make it into a valid datacube.

the documentation says Dataset.pad is "experimental"—so what does that mean?

Good question. Not great that we have apparently had that say "experimental" for 4 years...

Nevertheless, by following git blame on the docstring, I found this issue. There is sounds like the experimentalness is focused on questions about how padding xarray indexes should work. But Indexes don't get serialized to the store (they are recreated by decoding metadata upon opening), so I think this experimentalness needn't concern you.

can that be applied in this case, i.e., applied to ManifestArrays?

Not to ManifestArrays, at least not yet. It would likely require a new codec - see https://github.com/zarr-developers/VirtualiZarr/issues/22.

...but you're going to have to do that in a pretty bespoke way using VirtualiZarr + Xarray to manipulate individual chunks directly.

How might that work? Do you envision using Dataset.pad, or something else?

Whilst you cannot concatenate a ManifestArray and numpy array together in memory, with Icechunk you could store virtual and non-virtual chunks side-by-side in the same array. You would need to use Xarray and VirtualiZarr separately to write to different regions of the same zarr array in icechunk. You can do whatever manipulations you like to the non-virtual chunks, so yes you could use Dataset.pad to adjust those. @abarciauskas-bgse has recently done this mixing of virtual and non-virtual chunks to get around messiness when virtualizing the GPM-IMERG data.

do you mean creating a separate group when opening with VirtualiZarr, or do you mean changing the original netCDF file structure?

Every suggestion I am making here assumes you have no power to change the original netCDF files. But you can still make arbitrary rearrangements to the metadata in the virtual store (because it's duplicated in the virtual store), including which chunks are referred to from which arrays (so for example you could rename every variable and put them in a different group in the virtual store). Also you can put some non-virtual chunks in an icechunk store as described above. But that comes with the downside of duplicating the original data for those non-virtual chunks.

discard the idea of representing the intuitive 3D arrays, and instead considering whether there would be a way to replace mirror_step with an dummy index instead.

So basically you create a new "dummy" dimension and concatenate along that? That could work, but your arrays still need to have consistent shapes, chunks, dtypes and codecs along that new dimension.

TomNicholas avatar Mar 14 '25 15:03 TomNicholas

@TomNicholas, following along, although this topic dances along the edge of my understanding...

Regarding Dataset.pad


can that [Dataset.pad] be applied in this case, i.e., applied to ManifestArrays?

Not to ManifestArrays, at least not yet. It would likely require a new codec - see https://github.com/zarr-developers/VirtualiZarr/issues/22.

I don't understand the context of "codecs", but this #22 issue does seem to me like it's tackling the same problem. If there was a new codec developed, would that eliminate the need for manipulating a ManifestArray alongside numpy arrays or manipulating individual chunks? And if a new codec is the solution, then.... what's involved in developing a new codec? (sounds like a huge effort.)

Regarding concatenating different type chunks


Whilst you cannot concatenate a ManifestArray and numpy array together in memory, with Icechunk you could store virtual and non-virtual chunks side-by-side in the same array. You would need to use Xarray and VirtualiZarr separately to write to different regions of the same zarr array in icechunk. You can do whatever manipulations you like to the non-virtual chunks, so yes you could use Dataset.pad to adjust those. @abarciauskas-bgse has recently done this mixing of virtual and non-virtual chunks to get around messiness when virtualizing the GPM-IMERG data.

This sounds quite messy! And you would lose some of the efficiencies of using virtual chunks, correct?

Regarding concatenating along a dummy index


discard the idea of representing the intuitive 3D arrays, and instead considering whether there would be a way to replace mirror_step with an dummy index instead.

So basically you create a new "dummy" dimension and concatenate along that? That could work, but your arrays still need to have consistent shapes, chunks, dtypes and codecs along that new dimension.

Is this because of the still-open variable-length chunk issue?
If the data are being concatenated along the mirror_step dimension—which is the dimension that has different lengths—then you'd consider it a "variable-length" chunk problem as opposed to a "ragged" datacube problem, right?

danielfromearth avatar Mar 18 '25 15:03 danielfromearth

Update

I found a way to concatenate the netCDF granules on my local machine, although it's not so elegant. It follows the "Option 3 - dummy index" approach from one of my previous comments:

def preprocess_root_group(ds: xr.Dataset) -> xr.Dataset:
    """Add datetime with scan number and mirror_step as a new dimension and coordinate.

    Datetime, scan number (padded to 2 digits), and mirror_step (padded to 4 digits with leading zeros) 
    are combined into a single integer.
    
    For example, the new coordinate's value for date==20250311, scan_num==10, mirror_step==1
        will be 20250311100001.
    """
    yearmonthday_timestamp = datetime.fromisoformat(ds.attrs["time_coverage_start"]).strftime("%Y%m%d")

    datescan = f"{yearmonthday_timestamp}{ds.attrs['scan_num']:>02}"

    datescan_mirror_step = np.char.add(datescan, np.char.mod('%04d', ds.mirror_step)).astype(int)
    
    return (
        ds
        .rename_dims({"mirror_step": "datescan_mirror_step"})
        .assign_coords({"datescan_mirror_step": ("mirror_step", datescan_mirror_step)})
        .swap_dims({"mirror_step": "datescan_mirror_step"})
    )

open_options = {
    "preprocess": preprocess_root_group,
    "combine": 'nested',
    "coords": 'all', 
    "combine_attrs": 'drop_conflicts'
}

combined_dataset_root = xr.open_mfdataset(sorted(filepaths), **open_options)

A partial success

Here is the concatenated Dataset, with the newly created datescan_mirror_step dimension: Image

Continuing issues when attempting to do this with ManifestArrays

However, when running this through earthaccess and VirtualiZarr, this error gets raised: Image

which, according to the traceback, happens on this line:

new_coords_value = np.char.add(datescan, np.char.mod('%04d', ds.mirror_step)).astype(int)

Here is the full traceback:

---------------------------------------------------------------------------
NotImplementedError                       Traceback (most recent call last)
Cell In[5], line 29
     12     return (
     13         ds
     14         .rename_dims({"mirror_step": "datescan_mirror_step"})
   (...)
     17         .assign_coords({"scan_number": ("datescan_mirror_step", scan_num_coord)})
     18     )
     20 open_options = {
     21     "access": "direct", 
     22     "load": True,
   (...)
     27     "parallel": False,
     28 }
---> 29 result_root = earthaccess.open_virtual_mfdataset(
     30     granules=results,
     31     **open_options
     32 )

File [/srv/conda/envs/notebook/lib/python3.11/site-packages/earthaccess/dmrpp_zarr.py:121](https://openscapes.2i2c.cloud/srv/conda/envs/notebook/lib/python3.11/site-packages/earthaccess/dmrpp_zarr.py#line=120), in open_virtual_mfdataset(granules, group, access, load, preprocess, parallel, **xr_combine_nested_kwargs)
    111     vdatasets.append(
    112         open_(
    113             filepath=g.data_links(access=access)[0] + ".dmrpp",
   (...)
    118         )
    119     )
    120 if preprocess is not None:
--> 121     vdatasets = [preprocess(ds) for ds in vdatasets]
    122 if parallel:
    123     vdatasets = dask.compute(vdatasets)[0]  # type: ignore

File [/srv/conda/envs/notebook/lib/python3.11/site-packages/earthaccess/dmrpp_zarr.py:121](https://openscapes.2i2c.cloud/srv/conda/envs/notebook/lib/python3.11/site-packages/earthaccess/dmrpp_zarr.py#line=120), in <listcomp>(.0)
    111     vdatasets.append(
    112         open_(
    113             filepath=g.data_links(access=access)[0] + ".dmrpp",
   (...)
    118         )
    119     )
    120 if preprocess is not None:
--> 121     vdatasets = [preprocess(ds) for ds in vdatasets]
    122 if parallel:
    123     vdatasets = dask.compute(vdatasets)[0]  # type: ignore

Cell In[5], line 7, in preprocess_root_group(ds)
      3 yearmonthday_timestamp = datetime.fromisoformat(ds.attrs["time_coverage_start"]).strftime("%Y%m%d")
      5 datescan = f"{yearmonthday_timestamp}{ds.attrs['scan_num']:>02}"
----> 7 new_coords_value = np.char.add(datescan, np.char.mod('%04d', ds.mirror_step)).astype(int)
      9 # print(new_coords_value)
     10 scan_num_coord = np.full_like(new_coords_value, ds.attrs['scan_num']).astype(int)

File [/srv/conda/envs/notebook/lib/python3.11/site-packages/numpy/_core/strings.py:194](https://openscapes.2i2c.cloud/srv/conda/envs/notebook/lib/python3.11/site-packages/numpy/_core/strings.py#line=193), in mod(a, values)
    173 def mod(a, values):
    174     """
    175     Return (a % i), that is pre-Python 2.6 string formatting
    176     (interpolation), element-wise for a pair of array_likes of str
   (...)
    191 
    192     """
    193     return _to_bytes_or_str_array(
--> 194         _vec_string(a, np.object_, '__mod__', (values,)), a)

File [/srv/conda/envs/notebook/lib/python3.11/site-packages/xarray/core/common.py:180](https://openscapes.2i2c.cloud/srv/conda/envs/notebook/lib/python3.11/site-packages/xarray/core/common.py#line=179), in AbstractArray.__array__(self, dtype, copy)
    178         except TypeError:
    179             copy = False
--> 180 return np.array(self.values, dtype=dtype, copy=copy)

File [/srv/conda/envs/notebook/lib/python3.11/site-packages/xarray/core/dataarray.py:814](https://openscapes.2i2c.cloud/srv/conda/envs/notebook/lib/python3.11/site-packages/xarray/core/dataarray.py#line=813), in DataArray.values(self)
    801 @property
    802 def values(self) -> np.ndarray:
    803     """
    804     The array's data converted to numpy.ndarray.
    805 
   (...)
    812     to this array may be reflected in the DataArray as well.
    813     """
--> 814     return self.variable.values

File [/srv/conda/envs/notebook/lib/python3.11/site-packages/xarray/core/variable.py:566](https://openscapes.2i2c.cloud/srv/conda/envs/notebook/lib/python3.11/site-packages/xarray/core/variable.py#line=565), in Variable.values(self)
    563 @property
    564 def values(self) -> np.ndarray:
    565     """The variable's data as a numpy.ndarray"""
--> 566     return _as_array_or_item(self._data)

File [/srv/conda/envs/notebook/lib/python3.11/site-packages/xarray/core/variable.py:363](https://openscapes.2i2c.cloud/srv/conda/envs/notebook/lib/python3.11/site-packages/xarray/core/variable.py#line=362), in _as_array_or_item(data)
    349 def _as_array_or_item(data):
    350     """Return the given values as a numpy array, or as an individual item if
    351     it's a 0d datetime64 or timedelta64 array.
    352 
   (...)
    361     TODO: remove this (replace with np.asarray) once these issues are fixed
    362     """
--> 363     data = np.asarray(data)
    364     if data.ndim == 0:
    365         if data.dtype.kind == "M":

File [/srv/conda/envs/notebook/lib/python3.11/site-packages/virtualizarr/manifests/array.py:122](https://openscapes.2i2c.cloud/srv/conda/envs/notebook/lib/python3.11/site-packages/virtualizarr/manifests/array.py#line=121), in ManifestArray.__array__(self, dtype)
    121 def __array__(self, dtype: np.typing.DTypeLike = None) -> np.ndarray:
--> 122     raise NotImplementedError(
    123         "ManifestArrays can't be converted into numpy arrays or pandas Index objects"
    124     )

NotImplementedError: ManifestArrays can't be converted into numpy arrays or pandas Index objects

@TomNicholas, @ayushnag, thoughts on how to get around this? Or ways to go from "NotImplemented" to "Implemented"? 😊

danielfromearth avatar Mar 20 '25 21:03 danielfromearth

It's unclear from your error report where virtualizarr is being invoked. But I suspect you could get past the error by adding whatever variable that error was raised on to virtualizarr's loadable_variables kwarg.

TomNicholas avatar Mar 20 '25 22:03 TomNicholas

@TomNicholas, thanks for the suggestion!

New error

I tried setting mirror_step with loadable_variables for this use-case that involves earthaccess. However, there is another difficulty seemingly because of the separation of DMR++ from original netCDF files. When mirror_step is set with loadable_variables, then it disappears from the Dataset object...

<xarray.Dataset> Size: 8kB
Dimensions:  (xtrack: 2048)
Coordinates:
    xtrack   (xtrack) int32 8kB ManifestArray<shape=(2048,), dtype=int32, chu...
Data variables:
    *empty*

...and this error is raised:

Image

Is this use of loadable_variables unworkable then for this DMR++ case, or is there a way to still load the variable?

(The code I changed for this attempt are in the following "Details"...)

Here is the code I changed, with loadable_variables kwarg added to the earthaccess code on this line:

    # Get list of virtual datasets (or dask delayed objects)
    for g in granules:
        vdatasets.append(
            open_(
                filepath=g.data_links(access=access)[0] + ".dmrpp",
                filetype="dmrpp",  # type: ignore
                group=group,
                indexes={},
                reader_options={"storage_options": fs.storage_options},
                loadable_variables=["mirror_step"]
            )
        )

danielfromearth avatar Mar 21 '25 16:03 danielfromearth

Is this use of loadable_variables unworkable then for this DMR++ case, or is there a way to still load the variable?

I don't immediately know, sorry.

Generally there isn't enough information here for me to quickly reproduce your issue. I would really need an MCVE that doesn't require me to run earthaccess.

But also I'm right in the middle of significantly refactoring how loadable_variables works (https://github.com/zarr-developers/VirtualiZarr/pull/477), and if you can wait until next week these issues might just all go away.

TomNicholas avatar Mar 21 '25 16:03 TomNicholas

Just commenting here to signal intent and subscribe. I might soon be working on SWOT L2 data, which is also served in groups (but I am not clear yet about whether things are actually ragged or not), but nonetheless I am curious to experiment with this and potentially bring some more general functionality downstream to earthaccess.

Also FWIW @TomNicholas Id be happy to sync some time to get you earthdata login access if that would help here.

jbusecke avatar Jul 23 '25 05:07 jbusecke

But also I'm right in the middle of significantly refactoring how loadable_variables works (https://github.com/zarr-developers/VirtualiZarr/pull/477), and if you can wait until next week these issues might just all go away.

These changes were released as part of v2.0.0.

get you earthdata login access

Access would be useful, but in general I would much rather you guys trimmed down bug reports to be pure virtualizarr with a tiny file (or synthetic data), so that access to the data or knowledge of earthaccess internals isn't a bottleneck!

TomNicholas avatar Jul 23 '25 08:07 TomNicholas

Access would be useful, but in general I would much rather you guys trimmed down bug reports to be pure virtualizarr with a tiny file (or synthetic data), so that access to the data or knowledge of earthaccess internals isn't a bottleneck!

Of course! Just wanted to offer in case it becomes handy at some point!

jbusecke avatar Jul 23 '25 13:07 jbusecke