Having a custom `engine` for `open_mfdatatree`
Hi @TomNicholas !
I am one of the core devs of satpy (https://github.com/pytroll/satpy), which makes use of xarray/dask to handle satellite data for earth-observing satellites. In this context, we have many times satellite data which have different resolutions for a same dataset, hence xarray's dataset can't really be used for these data, as the coords for the different variables don't match, and DataTree makes a lot of sense for us.
The satellite data, more often than not, is in some binary format, and we read it and convert it to xarray.DataArrays, and I'm now started experimenting placing them in a DataTree by hand. So it would be really nice if there was an interface for adding custom engines to read that data (multiple files). Did you already consider that? Do you maybe already have an idea on how this would work?
We have been wanting to stick closer to the data model of xarray in our library, and datatree looks like something we could really use :) let's hope we can contribute here, at least with ideas in the future.
Hi @mraspaud , thanks so much for your interest!
So it would be really nice if there was an interface for adding custom engines to read that data (multiple files).
Some initial thoughts:
-
Can you already open one
DataArray/Datasetby hand withopen_dataset/open_datatreefrom your data format? Then you could pretty easily write your ownopenfunction to stack all of those into a tree. Given the prototype status of datatree, this might be the best option for now. (If not then for an example of opening custom binary formats into xarray you might be interested in xmitgcm.) -
Can you already plug a custom engine for your data into
open_dataset? Perhaps that interface can be extended to handle multiple files... -
The next big step for me with
DataTreeis to write a detailed design doc, and then get input from potential users like you, before rewriting datatree and eventually integrating into xarray upstream. This would be a great point to really hash out the details of an interface to read data from multiple files.
Tagging @jhamman for his backends expertise too!
EDIT: Related to #51
- Yes, I have done that and it works fine.
Eg
DataTree('root')
├── DataTree('3000')
│ Dimensions: (y: 3712, x: 3712)
│ Coordinates:
│ crs object PROJCRS["unknown",BASEGEOGCRS["unknown",DATUM["unknown",E...
│ * y (y) float64 -5.566e+06 -5.563e+06 -5.56e+06 ... 5.566e+06 5.569e+06
│ * x (x) float64 5.566e+06 5.563e+06 5.56e+06 ... -5.566e+06 -5.569e+06
│ Data variables:
│ VIS006 (y, x) uint16 dask.array<chunksize=(464, 3712), meta=np.ndarray>
│ VIS008 (y, x) uint16 dask.array<chunksize=(464, 3712), meta=np.ndarray>
│ IR_016 (y, x) uint16 dask.array<chunksize=(464, 3712), meta=np.ndarray>
│ IR_039 (y, x) uint16 dask.array<chunksize=(464, 3712), meta=np.ndarray>
│ WV_062 (y, x) uint16 dask.array<chunksize=(464, 3712), meta=np.ndarray>
│ WV_073 (y, x) uint16 dask.array<chunksize=(464, 3712), meta=np.ndarray>
│ IR_087 (y, x) uint16 dask.array<chunksize=(464, 3712), meta=np.ndarray>
│ IR_097 (y, x) uint16 dask.array<chunksize=(464, 3712), meta=np.ndarray>
│ IR_108 (y, x) uint16 dask.array<chunksize=(464, 3712), meta=np.ndarray>
│ IR_120 (y, x) uint16 dask.array<chunksize=(464, 3712), meta=np.ndarray>
│ IR_134 (y, x) uint16 dask.array<chunksize=(464, 3712), meta=np.ndarray>
│ Attributes:
│ SatelliteStatus: {'SatelliteDefinition': {'SatelliteId': 324...
│ ImageAcquisition: {'PlannedAcquisitionTime': {'TrueRepeatCycl...
│ CelestialEvents: {'CelestialBodiesPosition': {'PeriodTimeSta...
│ ImageDescription: {'ProjectionDescription': {'TypeOfProjectio...
│ RadiometricProcessing: {'RPSummary': {'RadianceLinearization': arr...
│ GeometricProcessing: {'OptAxisDistances': {'E-WFocalPlane': arra...
│ 15TrailerVersion: 0
│ ImageProductionStats: {'SatelliteId': 324, 'ActualScanningSummary...
│ NavigationExtractionResults: {'ExtractedHorizons': {'HorizonId': array([...
│ RadiometricQuality: {'L10RadQuality': {'FullImageMinimumCount':...
│ GeometricQuality: {'AbsoluteAccuracy': {'QualityInfoValidity'...
│ TimelinessAndCompleteness: {'Timeliness': {'MaxDelay': 20.589, 'MinDel...
└── DataTree('1000')
Dimensions: (y: 11136, x: 11136)
Coordinates:
crs object PROJCRS["unknown",BASEGEOGCRS["unknown",DATUM["unknown",E...
* y (y) float64 -5.566e+06 -5.565e+06 -5.564e+06 ... 5.57e+06 5.571e+06
* x (x) float64 5.566e+06 5.565e+06 5.564e+06 ... -5.57e+06 -5.571e+06
Data variables:
HRV (y, x) uint16 dask.array<chunksize=(464, 1804), meta=np.ndarray>
Attributes:
SatelliteStatus: {'SatelliteDefinition': {'SatelliteId': 324...
ImageAcquisition: {'PlannedAcquisitionTime': {'TrueRepeatCycl...
CelestialEvents: {'CelestialBodiesPosition': {'PeriodTimeSta...
ImageDescription: {'ProjectionDescription': {'TypeOfProjectio...
RadiometricProcessing: {'RPSummary': {'RadianceLinearization': arr...
GeometricProcessing: {'OptAxisDistances': {'E-WFocalPlane': arra...
15TrailerVersion: 0
ImageProductionStats: {'SatelliteId': 324, 'ActualScanningSummary...
NavigationExtractionResults: {'ExtractedHorizons': {'HorizonId': array([...
RadiometricQuality: {'L10RadQuality': {'FullImageMinimumCount':...
GeometricQuality: {'AbsoluteAccuracy': {'QualityInfoValidity'...
TimelinessAndCompleteness: {'Timeliness': {'MaxDelay': 20.589, 'MinDel...
Load time: 0:00:03.257307
-
No, I haven't tested that yet as most formats have multiple interdependent files, so I didn't investigate the single file option yet.
-
Sounds good, we'll be happy to provide feedback!
- Can you already open one DataArray/Dataset by hand with open_dataset/open_datatree from your data format? Then you could pretty easily write your own open function to stack all of those into a tree. Given the prototype status of datatree, this might be the best option for now. (If not then for an example of opening custom binary formats into xarray you might be interested in xmitgcm.)
+1 on this being the current recommendation. Hierarchical datasets conform to a number of semantic linking conventions and, at least at this point, I would recommend writing custom openers for each dataset/convention. I think we'll learn a lot from the implementation of these custom openers, and as @alexamici mentions in https://github.com/pydata/xarray/issues/1982, there are some emerging standards that we may be able to leverage is some generic openers.
Hey ya'll (@TomNicholas )- we have some custom engines for radar data in our xradar package, where we can read data using the following:
import xarray as xr
import xradar
ds = xr.open_dataset("radar_file.nc", group='sweep_0', engine='cfradial1')
but we cannot use this engine with datatree directly yet since it is not one of the registered engines
import datatree as dt
dt.open_datatree("radar_file.nc", engine='cfradial1')
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
Cell In [6], line 1
----> 1 dt.open_datatree(filename, engine='cfradial1')
File ~/miniforge3/envs/xradar-dev/lib/python3.10/site-packages/datatree/io.py:60, in open_datatree(filename_or_obj, engine, **kwargs)
58 return _open_datatree_netcdf(filename_or_obj, engine=engine, **kwargs)
59 else:
---> 60 raise ValueError("Unsupported engine")
ValueError: Unsupported engine
What is the best way of adding our new engines so we can load these datasets into a datatree?
Here is a full example with our working functionality and API
Hi @mgrover1!
Quick Q: If the file is .nc then what is your custom engine doing?
What is the best way of adding our new engines so we can load these datasets into a datatree?
The most general way would be to extend xarray's backend entrypoint system to support open_datatree, but we can't do this until datatree is integrated in xarray upstream.
In the meantime I guess we could add another special case to datatree/io.py? Unless you have another suggestion?
@TomNicholas - though these files are netcdf, they are a specific type of netcdf (cfradial) this has additional hierarchal metadata that we then use to parse into groups and such. Also, this is just one of the files supported by the package. Other readers include cfradial2 and odim_h5. We plan on adding several other readers too.