VirtualiZarr
VirtualiZarr copied to clipboard
Challenges with earthaccess.open_virtual_mfdataset for grouped netCDFs with nested structure
Tagging @betolink, @ayushnag, with whom I've previously discussed parts of this issue. The TEMPO data files referenced here can be downloaded here from Earthdata Search (which requires a free login account).
Overview
The TEMPO satellite instrument measures air quality, and I am working to ensure access of its data through the earthaccess.open_virtual_mfdataset() function (see this work-in-progress notebook). However, I run into several errors depending on how I try to open and/or define a preprocessing function.
Original Data Structure – arrays are two-dimensional:
- Dimension 1)
xtrack's values are identical for each granule - Dimension 2)
mirror_step's values repeat only after a "scan" of the instrument is completed (and there are multiple scans per day).
The following illustration shows the general structure, for two days with three scans per day. Granule (each granule is a netCDF file) numbers are depicted increasing from right-to-left, because the instrument scans East-to-West. Note that mirror_step is occasionally cut short in the last granule of a scan...
Desired outcome
Ideally, I'd like to open the data as a virtual dataset, concatenated in the following logical structure, with (1) the date & scan numbers consolidated to form a new dimension, and (2) the granules within each scan concatenated along the mirror_step dimension:
However, any way of opening all of the granules with earthaccess.open_virtual_mfdataset() would be a success.
Problems I've run into while experimenting
Problem 1
Combining the ManifestArrays seems to be difficult, partly because earthaccess currently processes datasets as a single list passed to xr.combine_nested(), seen here in earthaccess:
vds = xr.combine_nested(vdatasets, **xr_combine_nested_kwargs)
So, passing two concatenation dimensions, as in concat_dim=["datescan", "mirror_step"], raises
Problem 2
Even when opening granules within a scan, or preprocessing the datasets by assigning a new coordinate that represents date & scan, which I'm terming datescan, I run into another issue:
Here is some of the testing code I am experimenting with (e.g., in this work-in-progress notebook)
But even running xarray.combine_nested() on local files presents the same challenge:
def preprocess_root_group(ds: xr.Dataset) -> xr.Dataset:
# Add datetime with scan number as a new dimension
yearmonthday_timestamp = datetime.fromisoformat(ds.attrs["time_coverage_start"]).strftime("%Y%m%d")
datescan = f"{yearmonthday_timestamp}scan{ds.attrs['scan_num']:>02}"
return (
ds
.expand_dims(["datescan"])
.assign_coords(datescan=[datescan])
)
open_options = {
"access": "direct",
"load": True,
"preprocess": preprocess_root_group,
"concat_dim": ["datescan", "mirror_step"],
"coords": 'minimal',
"compat": 'override',
"join": 'override',
"combine_attrs": 'drop_conflicts',
"parallel": False,
}
result_root = earthaccess.open_virtual_mfdataset(
granules=results,
**open_options
)
Your visualizations and explanation are really excellent! I think this is more of an xarray/virtualizarr question so pinging @TomNicholas and @dcherian
I think problem 2 could be happening because mirror_step is cut short. The dimensions need to match exactly to combine the virtual references and when they don't you get: conflicting dimension sizes: {131, 132}. I have encountered this issue before and have trimmed the dataset in the preprocess step. So basically trim all granules to mirror_step: 131 and then concatenate. NaN padding all granules to mirror_step: 132 also seems doable but I have never tried it.
Thank you for writing this up @danielfromearth .
earthaccesscurrently processes datasets as a single list passed toxr.combine_nested()
You could get around this by using VirtualiZarr directly. Soon we should have a virtualizarr.open_virtual_mfdataset() function but for now you will have to just use virtualizarr.open_virtual_dataset() and xr.concat() directly.
ValueError: conflicting dimension sizes
This is not solvable. VirtualiZarr (and Xarray!) relies on your data being representable by the Zarr data model without modifying data. Mismatched dimension sizes mean your data doesn't even form a hypercube of arrays (i.e. your data is "ragged"), nevermind having the consistent chunks/encoding that the Zarr model also requires. This means if you want to get your data into Icechunk you're going to have to make some modifications to the data, which means writing at least some "native chunks".
You might still be able to write most of your data as virtual chunks and only modify the chunks at the edges (e.g. in the ways @ayushnag has suggested), but you're going to have to do that in a pretty bespoke way using VirtualiZarr + Xarray to manipulate individual chunks directly.
EDIT: Another idea might be to put the inconsistent Scan 3 data into a separate group, but I don't know if that will work for you (you don't want hundreds of groups).
Thanks for the responses!
Based on the above, what seems to me like promising directions are:
Option 1 - modifying the edges
From @ayushnag,
NaN padding all granules to mirror_step: 132 also seems doable but I have never tried it.
In theory, that seems like it could be the ideal way to go! However, from https://docs.xarray.dev/en/stable/generated/xarray.Dataset.pad.html, I wonder (1) the documentation says Dataset.pad is "experimental"—so what does that mean?—and (2) can that be applied in this case, i.e., applied to ManifestArrays? Not sure who would know the answer to this.. @dcherian ?
@TomNicholas, you said
...but you're going to have to do that in a pretty bespoke way using VirtualiZarr + Xarray to manipulate individual chunks directly.
How might that work? Do you envision using Dataset.pad, or something else?
Option 2 - separate group
From @TomNicholas,
Another idea might be to put the inconsistent Scan 3 data into a separate group, but I don't know if that will work for you (you don't want hundreds of groups).
I don't understand how that would work — do you mean creating a separate group when opening with VirtualiZarr, or do you mean changing the original netCDF file structure? I don't think the latter would work, because the netCDF structure is fixed.
Option 3 - dummy index
A third idea, with a focus on simply making the files openable, may be to discard the idea of representing the intuitive 3D arrays, and instead considering whether there would be a way to replace mirror_step with an dummy index instead. Might that be doable? I'm imagining it as a 2D virtual array, expanding on the dummy index—something like what's depicted here:
Other notes:
1. Changing the interaction with VirtualiZarr
From @TomNicholas,
You could get around this by using VirtualiZarr directly. Soon we should have a virtualizarr.open_virtual_mfdataset() function but for now you will have to just use virtualizarr.open_virtual_dataset() and xr.concat() directly.
@ayushnag, I think we would need to modify the logic and API for earthaccess.open_virtual_mfdataset() to enable this sort of interaction with VirtualiZarr. If we can determine a way to make this overall usage case work, perhaps we could talk further about whether such changes to earthaccess would make sense.
2. Trimming the edges
Another option, mentioned by @ayushnag,
I have encountered this issue before and have trimmed the dataset in the preprocess step. So basically trim all granules to mirror_step: 131 and then concatenate
That seems less preferable in this case, because it would discard a sizable amount of valid measurement data. In this case, we want to avoid discarding data.
Some top-level comments.
That seems less preferable in this case, because it would discard a sizable amount of valid measurement data. In this case, we want to avoid discarding data.
The fundamental issue here is that this data is not (yet) a consistent level 3 datacube. You cannot provide virtual access to the whole thing as a single datacube without modification if the original data does not form a datacube.
I think we would need to modify the logic and API for earthaccess.open_virtual_mfdataset() to enable this sort of interaction with VirtualiZarr.
This is an entirely orthogonal issue. You're not going to be able to paper over these limitations of the zarr data model through changes to the earthaccess API.
Now let's discuss the various ways you could modify the data you present to earthaccess users to make it into a valid datacube.
the documentation says
Dataset.padis "experimental"—so what does that mean?
Good question. Not great that we have apparently had that say "experimental" for 4 years...
Nevertheless, by following git blame on the docstring, I found this issue. There is sounds like the experimentalness is focused on questions about how padding xarray indexes should work. But Indexes don't get serialized to the store (they are recreated by decoding metadata upon opening), so I think this experimentalness needn't concern you.
can that be applied in this case, i.e., applied to ManifestArrays?
Not to ManifestArrays, at least not yet. It would likely require a new codec - see https://github.com/zarr-developers/VirtualiZarr/issues/22.
...but you're going to have to do that in a pretty bespoke way using VirtualiZarr + Xarray to manipulate individual chunks directly.
How might that work? Do you envision using
Dataset.pad, or something else?
Whilst you cannot concatenate a ManifestArray and numpy array together in memory, with Icechunk you could store virtual and non-virtual chunks side-by-side in the same array. You would need to use Xarray and VirtualiZarr separately to write to different regions of the same zarr array in icechunk. You can do whatever manipulations you like to the non-virtual chunks, so yes you could use Dataset.pad to adjust those. @abarciauskas-bgse has recently done this mixing of virtual and non-virtual chunks to get around messiness when virtualizing the GPM-IMERG data.
do you mean creating a separate group when opening with VirtualiZarr, or do you mean changing the original netCDF file structure?
Every suggestion I am making here assumes you have no power to change the original netCDF files. But you can still make arbitrary rearrangements to the metadata in the virtual store (because it's duplicated in the virtual store), including which chunks are referred to from which arrays (so for example you could rename every variable and put them in a different group in the virtual store). Also you can put some non-virtual chunks in an icechunk store as described above. But that comes with the downside of duplicating the original data for those non-virtual chunks.
discard the idea of representing the intuitive 3D arrays, and instead considering whether there would be a way to replace
mirror_stepwith an dummy index instead.
So basically you create a new "dummy" dimension and concatenate along that? That could work, but your arrays still need to have consistent shapes, chunks, dtypes and codecs along that new dimension.
@TomNicholas, following along, although this topic dances along the edge of my understanding...
Regarding Dataset.pad
can that [
Dataset.pad] be applied in this case, i.e., applied to ManifestArrays?
Not to ManifestArrays, at least not yet. It would likely require a new codec - see https://github.com/zarr-developers/VirtualiZarr/issues/22.
I don't understand the context of "codecs", but this #22 issue does seem to me like it's tackling the same problem. If there was a new codec developed, would that eliminate the need for manipulating a ManifestArray alongside numpy arrays or manipulating individual chunks? And if a new codec is the solution, then.... what's involved in developing a new codec? (sounds like a huge effort.)
Regarding concatenating different type chunks
Whilst you cannot concatenate a ManifestArray and numpy array together in memory, with Icechunk you could store virtual and non-virtual chunks side-by-side in the same array. You would need to use Xarray and VirtualiZarr separately to write to different regions of the same zarr array in icechunk. You can do whatever manipulations you like to the non-virtual chunks, so yes you could use Dataset.pad to adjust those. @abarciauskas-bgse has recently done this mixing of virtual and non-virtual chunks to get around messiness when virtualizing the GPM-IMERG data.
This sounds quite messy! And you would lose some of the efficiencies of using virtual chunks, correct?
Regarding concatenating along a dummy index
discard the idea of representing the intuitive 3D arrays, and instead considering whether there would be a way to replace mirror_step with an dummy index instead.
So basically you create a new "dummy" dimension and concatenate along that? That could work, but your arrays still need to have consistent shapes, chunks, dtypes and codecs along that new dimension.
Is this because of the still-open variable-length chunk issue?
If the data are being concatenated along the mirror_step dimension—which is the dimension that has different lengths—then you'd consider it a "variable-length" chunk problem as opposed to a "ragged" datacube problem, right?
Update
I found a way to concatenate the netCDF granules on my local machine, although it's not so elegant. It follows the "Option 3 - dummy index" approach from one of my previous comments:
def preprocess_root_group(ds: xr.Dataset) -> xr.Dataset:
"""Add datetime with scan number and mirror_step as a new dimension and coordinate.
Datetime, scan number (padded to 2 digits), and mirror_step (padded to 4 digits with leading zeros)
are combined into a single integer.
For example, the new coordinate's value for date==20250311, scan_num==10, mirror_step==1
will be 20250311100001.
"""
yearmonthday_timestamp = datetime.fromisoformat(ds.attrs["time_coverage_start"]).strftime("%Y%m%d")
datescan = f"{yearmonthday_timestamp}{ds.attrs['scan_num']:>02}"
datescan_mirror_step = np.char.add(datescan, np.char.mod('%04d', ds.mirror_step)).astype(int)
return (
ds
.rename_dims({"mirror_step": "datescan_mirror_step"})
.assign_coords({"datescan_mirror_step": ("mirror_step", datescan_mirror_step)})
.swap_dims({"mirror_step": "datescan_mirror_step"})
)
open_options = {
"preprocess": preprocess_root_group,
"combine": 'nested',
"coords": 'all',
"combine_attrs": 'drop_conflicts'
}
combined_dataset_root = xr.open_mfdataset(sorted(filepaths), **open_options)
A partial success
Here is the concatenated Dataset, with the newly created datescan_mirror_step dimension:
Continuing issues when attempting to do this with ManifestArrays
However, when running this through earthaccess and VirtualiZarr, this error gets raised:
which, according to the traceback, happens on this line:
new_coords_value = np.char.add(datescan, np.char.mod('%04d', ds.mirror_step)).astype(int)
Here is the full traceback:
---------------------------------------------------------------------------
NotImplementedError Traceback (most recent call last)
Cell In[5], line 29
12 return (
13 ds
14 .rename_dims({"mirror_step": "datescan_mirror_step"})
(...)
17 .assign_coords({"scan_number": ("datescan_mirror_step", scan_num_coord)})
18 )
20 open_options = {
21 "access": "direct",
22 "load": True,
(...)
27 "parallel": False,
28 }
---> 29 result_root = earthaccess.open_virtual_mfdataset(
30 granules=results,
31 **open_options
32 )
File [/srv/conda/envs/notebook/lib/python3.11/site-packages/earthaccess/dmrpp_zarr.py:121](https://openscapes.2i2c.cloud/srv/conda/envs/notebook/lib/python3.11/site-packages/earthaccess/dmrpp_zarr.py#line=120), in open_virtual_mfdataset(granules, group, access, load, preprocess, parallel, **xr_combine_nested_kwargs)
111 vdatasets.append(
112 open_(
113 filepath=g.data_links(access=access)[0] + ".dmrpp",
(...)
118 )
119 )
120 if preprocess is not None:
--> 121 vdatasets = [preprocess(ds) for ds in vdatasets]
122 if parallel:
123 vdatasets = dask.compute(vdatasets)[0] # type: ignore
File [/srv/conda/envs/notebook/lib/python3.11/site-packages/earthaccess/dmrpp_zarr.py:121](https://openscapes.2i2c.cloud/srv/conda/envs/notebook/lib/python3.11/site-packages/earthaccess/dmrpp_zarr.py#line=120), in <listcomp>(.0)
111 vdatasets.append(
112 open_(
113 filepath=g.data_links(access=access)[0] + ".dmrpp",
(...)
118 )
119 )
120 if preprocess is not None:
--> 121 vdatasets = [preprocess(ds) for ds in vdatasets]
122 if parallel:
123 vdatasets = dask.compute(vdatasets)[0] # type: ignore
Cell In[5], line 7, in preprocess_root_group(ds)
3 yearmonthday_timestamp = datetime.fromisoformat(ds.attrs["time_coverage_start"]).strftime("%Y%m%d")
5 datescan = f"{yearmonthday_timestamp}{ds.attrs['scan_num']:>02}"
----> 7 new_coords_value = np.char.add(datescan, np.char.mod('%04d', ds.mirror_step)).astype(int)
9 # print(new_coords_value)
10 scan_num_coord = np.full_like(new_coords_value, ds.attrs['scan_num']).astype(int)
File [/srv/conda/envs/notebook/lib/python3.11/site-packages/numpy/_core/strings.py:194](https://openscapes.2i2c.cloud/srv/conda/envs/notebook/lib/python3.11/site-packages/numpy/_core/strings.py#line=193), in mod(a, values)
173 def mod(a, values):
174 """
175 Return (a % i), that is pre-Python 2.6 string formatting
176 (interpolation), element-wise for a pair of array_likes of str
(...)
191
192 """
193 return _to_bytes_or_str_array(
--> 194 _vec_string(a, np.object_, '__mod__', (values,)), a)
File [/srv/conda/envs/notebook/lib/python3.11/site-packages/xarray/core/common.py:180](https://openscapes.2i2c.cloud/srv/conda/envs/notebook/lib/python3.11/site-packages/xarray/core/common.py#line=179), in AbstractArray.__array__(self, dtype, copy)
178 except TypeError:
179 copy = False
--> 180 return np.array(self.values, dtype=dtype, copy=copy)
File [/srv/conda/envs/notebook/lib/python3.11/site-packages/xarray/core/dataarray.py:814](https://openscapes.2i2c.cloud/srv/conda/envs/notebook/lib/python3.11/site-packages/xarray/core/dataarray.py#line=813), in DataArray.values(self)
801 @property
802 def values(self) -> np.ndarray:
803 """
804 The array's data converted to numpy.ndarray.
805
(...)
812 to this array may be reflected in the DataArray as well.
813 """
--> 814 return self.variable.values
File [/srv/conda/envs/notebook/lib/python3.11/site-packages/xarray/core/variable.py:566](https://openscapes.2i2c.cloud/srv/conda/envs/notebook/lib/python3.11/site-packages/xarray/core/variable.py#line=565), in Variable.values(self)
563 @property
564 def values(self) -> np.ndarray:
565 """The variable's data as a numpy.ndarray"""
--> 566 return _as_array_or_item(self._data)
File [/srv/conda/envs/notebook/lib/python3.11/site-packages/xarray/core/variable.py:363](https://openscapes.2i2c.cloud/srv/conda/envs/notebook/lib/python3.11/site-packages/xarray/core/variable.py#line=362), in _as_array_or_item(data)
349 def _as_array_or_item(data):
350 """Return the given values as a numpy array, or as an individual item if
351 it's a 0d datetime64 or timedelta64 array.
352
(...)
361 TODO: remove this (replace with np.asarray) once these issues are fixed
362 """
--> 363 data = np.asarray(data)
364 if data.ndim == 0:
365 if data.dtype.kind == "M":
File [/srv/conda/envs/notebook/lib/python3.11/site-packages/virtualizarr/manifests/array.py:122](https://openscapes.2i2c.cloud/srv/conda/envs/notebook/lib/python3.11/site-packages/virtualizarr/manifests/array.py#line=121), in ManifestArray.__array__(self, dtype)
121 def __array__(self, dtype: np.typing.DTypeLike = None) -> np.ndarray:
--> 122 raise NotImplementedError(
123 "ManifestArrays can't be converted into numpy arrays or pandas Index objects"
124 )
NotImplementedError: ManifestArrays can't be converted into numpy arrays or pandas Index objects
@TomNicholas, @ayushnag, thoughts on how to get around this? Or ways to go from "NotImplemented" to "Implemented"? 😊
It's unclear from your error report where virtualizarr is being invoked. But I suspect you could get past the error by adding whatever variable that error was raised on to virtualizarr's loadable_variables kwarg.
@TomNicholas, thanks for the suggestion!
New error
I tried setting mirror_step with loadable_variables for this use-case that involves earthaccess. However, there is another difficulty seemingly because of the separation of DMR++ from original netCDF files. When mirror_step is set with loadable_variables, then it disappears from the Dataset object...
<xarray.Dataset> Size: 8kB
Dimensions: (xtrack: 2048)
Coordinates:
xtrack (xtrack) int32 8kB ManifestArray<shape=(2048,), dtype=int32, chu...
Data variables:
*empty*
...and this error is raised:
Is this use of loadable_variables unworkable then for this DMR++ case, or is there a way to still load the variable?
(The code I changed for this attempt are in the following "Details"...)
Here is the code I changed, with loadable_variables kwarg added to the earthaccess code on this line:
# Get list of virtual datasets (or dask delayed objects)
for g in granules:
vdatasets.append(
open_(
filepath=g.data_links(access=access)[0] + ".dmrpp",
filetype="dmrpp", # type: ignore
group=group,
indexes={},
reader_options={"storage_options": fs.storage_options},
loadable_variables=["mirror_step"]
)
)
Is this use of loadable_variables unworkable then for this DMR++ case, or is there a way to still load the variable?
I don't immediately know, sorry.
Generally there isn't enough information here for me to quickly reproduce your issue. I would really need an MCVE that doesn't require me to run earthaccess.
But also I'm right in the middle of significantly refactoring how loadable_variables works (https://github.com/zarr-developers/VirtualiZarr/pull/477), and if you can wait until next week these issues might just all go away.
Just commenting here to signal intent and subscribe. I might soon be working on SWOT L2 data, which is also served in groups (but I am not clear yet about whether things are actually ragged or not), but nonetheless I am curious to experiment with this and potentially bring some more general functionality downstream to earthaccess.
Also FWIW @TomNicholas Id be happy to sync some time to get you earthdata login access if that would help here.
But also I'm right in the middle of significantly refactoring how loadable_variables works (https://github.com/zarr-developers/VirtualiZarr/pull/477), and if you can wait until next week these issues might just all go away.
These changes were released as part of v2.0.0.
get you earthdata login access
Access would be useful, but in general I would much rather you guys trimmed down bug reports to be pure virtualizarr with a tiny file (or synthetic data), so that access to the data or knowledge of earthaccess internals isn't a bottleneck!
Access would be useful, but in general I would much rather you guys trimmed down bug reports to be pure virtualizarr with a tiny file (or synthetic data), so that access to the data or knowledge of earthaccess internals isn't a bottleneck!
Of course! Just wanted to offer in case it becomes handy at some point!