earthaccess icon indicating copy to clipboard operation
earthaccess copied to clipboard

Opening virtual datasets with NASA dmrpp files

Open ayushnag opened this issue 8 months ago • 1 comments

The idea is to speed up opening up netcdf4/hdf5 datasets with a NASA specific optimization. Load data like xr.open_mfdataset with kerchunk/zarr speeds by translating existing dmr++ metadata files to zarr metadata on the fly. Much more context and discussion here.

virtualizarr PR for the parser here

earthaccess PR here

earthaccess additions:

  1. Open a virtual dataset (like a view of the data that contains dimensions, variables, attrs, chunks, but no actual data)
Screenshot 2024-06-18 at 4 13 48 PM
  1. Concatenate virtual xr.Datasets
    1. Use xarrays concatenation logic to create virtual views of netcdf’s (more details in virtualizarr documentation)
    2. Save as json/parquet/in-memory dict
Screenshot 2024-06-18 at 4 14 13 PM
  1. Read netcdf/hdf5 data
    1. Use the zarr engine in xarray to load a dataset (with indexes)
Screenshot 2024-06-18 at 4 14 31 PM

Questions/Suggestions:

Changes to the API?

  • **xr.combine_nested kwargs might be confusing and could be reworked if needed

NASA datasets you want me to test?

  • Many with unique considerations and the goal is to handle all NASA netcdf4/hdf5 datasets
  • So far I’ve tested and had success MUR-JPL-L4-GLOB-v4.1 (netcdf), SWOT_SSH_2.0 (netcdf), ATL-03 ICE-SAT (hdf5)
  • Feel free to test it out yourself or add a comment with a dataset and I can take a look

Take a look at the virtualizarr parser PR and leave suggestions

  • Speed improvements?
  • The current bottleneck is creating the xml object (dmrpp are xml files) with ET.fromstring(dmr_str) since the xml.ElementTree library needs to read the text, validate the xml, and create a parsable object. I am looking into a non-validating parser like xml.parsers.expat

ayushnag avatar Jun 18 '24 23:06 ayushnag