earthaccess
earthaccess copied to clipboard
Opening virtual datasets with NASA dmrpp files
The idea is to speed up opening up netcdf4/hdf5 datasets with a NASA specific optimization. Load data like xr.open_mfdataset
with kerchunk/zarr speeds by translating existing dmr++ metadata files to zarr metadata on the fly. Much more context and discussion here.
virtualizarr
PR for the parser here
earthaccess
PR here
earthaccess
additions:
- Open a virtual dataset (like a view of the data that contains dimensions, variables, attrs, chunks, but no actual data)
- Concatenate virtual xr.Datasets
- Use
xarray
s concatenation logic to create virtual views of netcdf’s (more details invirtualizarr
documentation) - Save as json/parquet/in-memory dict
- Use
- Read netcdf/hdf5 data
- Use the
zarr
engine inxarray
to load a dataset (with indexes)
- Use the
Questions/Suggestions:
Changes to the API?
- **xr.combine_nested kwargs might be confusing and could be reworked if needed
NASA datasets you want me to test?
- Many with unique considerations and the goal is to handle all NASA netcdf4/hdf5 datasets
- So far I’ve tested and had success
MUR-JPL-L4-GLOB-v4.1
(netcdf),SWOT_SSH_2.0
(netcdf),ATL-03
ICE-SAT (hdf5) - Feel free to test it out yourself or add a comment with a dataset and I can take a look
Take a look at the virtualizarr parser PR and leave suggestions
- Speed improvements?
- The current bottleneck is creating the xml object (dmrpp are xml files) with
ET.fromstring(dmr_str)
since thexml.ElementTree
library needs to read the text, validate the xml, and create a parsable object. I am looking into a non-validating parser likexml.parsers.expat