datatree Having a custom `engine` for `open

Hi @TomNicholas !

I am one of the core devs of satpy (https://github.com/pytroll/satpy), which makes use of xarray/dask to handle satellite data for earth-observing satellites. In this context, we have many times satellite data which have different resolutions for a same dataset, hence xarray's dataset can't really be used for these data, as the coords for the different variables don't match, and DataTree makes a lot of sense for us.

The satellite data, more often than not, is in some binary format, and we read it and convert it to xarray.DataArrays, and I'm now started experimenting placing them in a DataTree by hand. So it would be really nice if there was an interface for adding custom engines to read that data (multiple files). Did you already consider that? Do you maybe already have an idea on how this would work?

We have been wanting to stick closer to the data model of xarray in our library, and datatree looks like something we could really use :) let's hope we can contribute here, at least with ideas in the future.

Dec 21 '21 16:12 mraspaud

Hi @mraspaud , thanks so much for your interest!

So it would be really nice if there was an interface for adding custom engines to read that data (multiple files).

Some initial thoughts:

Can you already open one DataArray/Dataset by hand with open_dataset/open_datatree from your data format? Then you could pretty easily write your own open function to stack all of those into a tree. Given the prototype status of datatree, this might be the best option for now. (If not then for an example of opening custom binary formats into xarray you might be interested in xmitgcm.)
Can you already plug a custom engine for your data into open_dataset? Perhaps that interface can be extended to handle multiple files...
The next big step for me with DataTree is to write a detailed design doc, and then get input from potential users like you, before rewriting datatree and eventually integrating into xarray upstream. This would be a great point to really hash out the details of an interface to read data from multiple files.

Tagging @jhamman for his backends expertise too!

EDIT: Related to #51

Dec 21 '21 17:12 TomNicholas

Yes, I have done that and it works fine.

Eg

 DataTree('root')
 ├── DataTree('3000')
 │   Dimensions:  (y: 3712, x: 3712)
 │   Coordinates:
 │       crs      object PROJCRS["unknown",BASEGEOGCRS["unknown",DATUM["unknown",E...
 │     * y        (y) float64 -5.566e+06 -5.563e+06 -5.56e+06 ... 5.566e+06 5.569e+06
 │     * x        (x) float64 5.566e+06 5.563e+06 5.56e+06 ... -5.566e+06 -5.569e+06
 │   Data variables:
 │       VIS006   (y, x) uint16 dask.array<chunksize=(464, 3712), meta=np.ndarray>
 │       VIS008   (y, x) uint16 dask.array<chunksize=(464, 3712), meta=np.ndarray>
 │       IR_016   (y, x) uint16 dask.array<chunksize=(464, 3712), meta=np.ndarray>
 │       IR_039   (y, x) uint16 dask.array<chunksize=(464, 3712), meta=np.ndarray>
 │       WV_062   (y, x) uint16 dask.array<chunksize=(464, 3712), meta=np.ndarray>
 │       WV_073   (y, x) uint16 dask.array<chunksize=(464, 3712), meta=np.ndarray>
 │       IR_087   (y, x) uint16 dask.array<chunksize=(464, 3712), meta=np.ndarray>
 │       IR_097   (y, x) uint16 dask.array<chunksize=(464, 3712), meta=np.ndarray>
 │       IR_108   (y, x) uint16 dask.array<chunksize=(464, 3712), meta=np.ndarray>
 │       IR_120   (y, x) uint16 dask.array<chunksize=(464, 3712), meta=np.ndarray>
 │       IR_134   (y, x) uint16 dask.array<chunksize=(464, 3712), meta=np.ndarray>
 │   Attributes:
 │       SatelliteStatus:              {'SatelliteDefinition': {'SatelliteId': 324...
 │       ImageAcquisition:             {'PlannedAcquisitionTime': {'TrueRepeatCycl...
 │       CelestialEvents:              {'CelestialBodiesPosition': {'PeriodTimeSta...
 │       ImageDescription:             {'ProjectionDescription': {'TypeOfProjectio...
 │       RadiometricProcessing:        {'RPSummary': {'RadianceLinearization': arr...
 │       GeometricProcessing:          {'OptAxisDistances': {'E-WFocalPlane': arra...
 │       15TrailerVersion:             0
 │       ImageProductionStats:         {'SatelliteId': 324, 'ActualScanningSummary...
 │       NavigationExtractionResults:  {'ExtractedHorizons': {'HorizonId': array([...
 │       RadiometricQuality:           {'L10RadQuality': {'FullImageMinimumCount':...
 │       GeometricQuality:             {'AbsoluteAccuracy': {'QualityInfoValidity'...
 │       TimelinessAndCompleteness:    {'Timeliness': {'MaxDelay': 20.589, 'MinDel...
 └── DataTree('1000')
     Dimensions:  (y: 11136, x: 11136)
     Coordinates:
         crs      object PROJCRS["unknown",BASEGEOGCRS["unknown",DATUM["unknown",E...
       * y        (y) float64 -5.566e+06 -5.565e+06 -5.564e+06 ... 5.57e+06 5.571e+06
       * x        (x) float64 5.566e+06 5.565e+06 5.564e+06 ... -5.57e+06 -5.571e+06
     Data variables:
         HRV      (y, x) uint16 dask.array<chunksize=(464, 1804), meta=np.ndarray>
     Attributes:
         SatelliteStatus:              {'SatelliteDefinition': {'SatelliteId': 324...
         ImageAcquisition:             {'PlannedAcquisitionTime': {'TrueRepeatCycl...
         CelestialEvents:              {'CelestialBodiesPosition': {'PeriodTimeSta...
         ImageDescription:             {'ProjectionDescription': {'TypeOfProjectio...
         RadiometricProcessing:        {'RPSummary': {'RadianceLinearization': arr...
         GeometricProcessing:          {'OptAxisDistances': {'E-WFocalPlane': arra...
         15TrailerVersion:             0
         ImageProductionStats:         {'SatelliteId': 324, 'ActualScanningSummary...
         NavigationExtractionResults:  {'ExtractedHorizons': {'HorizonId': array([...
         RadiometricQuality:           {'L10RadQuality': {'FullImageMinimumCount':...
         GeometricQuality:             {'AbsoluteAccuracy': {'QualityInfoValidity'...
         TimelinessAndCompleteness:    {'Timeliness': {'MaxDelay': 20.589, 'MinDel...
 Load time: 0:00:03.257307

No, I haven't tested that yet as most formats have multiple interdependent files, so I didn't investigate the single file option yet.
Sounds good, we'll be happy to provide feedback!

Dec 21 '21 20:12 mraspaud

Can you already open one DataArray/Dataset by hand with open_dataset/open_datatree from your data format? Then you could pretty easily write your own open function to stack all of those into a tree. Given the prototype status of datatree, this might be the best option for now. (If not then for an example of opening custom binary formats into xarray you might be interested in xmitgcm.)

+1 on this being the current recommendation. Hierarchical datasets conform to a number of semantic linking conventions and, at least at this point, I would recommend writing custom openers for each dataset/convention. I think we'll learn a lot from the implementation of these custom openers, and as @alexamici mentions in https://github.com/pydata/xarray/issues/1982, there are some emerging standards that we may be able to leverage is some generic openers.

Jan 03 '22 16:01 jhamman

Hey ya'll (@TomNicholas )- we have some custom engines for radar data in our xradar package, where we can read data using the following:

import xarray as xr
import xradar

ds = xr.open_dataset("radar_file.nc", group='sweep_0', engine='cfradial1')

but we cannot use this engine with datatree directly yet since it is not one of the registered engines

import datatree as dt

dt.open_datatree("radar_file.nc", engine='cfradial1')
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In [6], line 1
----> 1 dt.open_datatree(filename, engine='cfradial1')

File ~/miniforge3/envs/xradar-dev/lib/python3.10/site-packages/datatree/io.py:60, in open_datatree(filename_or_obj, engine, **kwargs)
     58     return _open_datatree_netcdf(filename_or_obj, engine=engine, **kwargs)
     59 else:
---> 60     raise ValueError("Unsupported engine")

ValueError: Unsupported engine

What is the best way of adding our new engines so we can load these datasets into a datatree?

Here is a full example with our working functionality and API

Sep 22 '22 16:09 mgrover1

Hi @mgrover1!

Quick Q: If the file is .nc then what is your custom engine doing?

What is the best way of adding our new engines so we can load these datasets into a datatree?

The most general way would be to extend xarray's backend entrypoint system to support open_datatree, but we can't do this until datatree is integrated in xarray upstream.

In the meantime I guess we could add another special case to datatree/io.py? Unless you have another suggestion?

Sep 22 '22 16:09 TomNicholas

@TomNicholas - though these files are netcdf, they are a specific type of netcdf (cfradial) this has additional hierarchal metadata that we then use to parse into groups and such. Also, this is just one of the files supported by the package. Other readers include cfradial2 and odim_h5. We plan on adding several other readers too.

Sep 22 '22 17:09 mgrover1

Having a custom `engine` for `open_mfdatatree`