datatree icon indicating copy to clipboard operation
datatree copied to clipboard

Idea/use case: cfgrib

Open blaylockbk opened this issue 2 years ago • 6 comments

Hi Tom,

I missed your AMS talk this week because of a conflict, but I looked through the slides (thanks for posting those). Maybe I'll run into you later at AMS.

Just about all numerical weather model data is distributed in the grib format. Xarray has an engine for reading grib and grib2 files (cfgrib) that works great. One limitation with cfgrib is that when a file has variables on multiple types of levels (i.e., temperature at 2 meters, at 500 mb, and at cloud top height) cfgrib can't read the data into a single dataset, so instead it reads the data and returns a list of datasets when you do cfgrib.open_datasets(gribfileName).

If I understand the basics of datatree correctly, it sounds like datatree would be the better way for cfgrib to handle reading this data.

Have you looked at cfgrib and grib data before?

blaylockbk avatar Jan 10 '23 20:01 blaylockbk

Hi Brian!

I missed your AMS talk this week because of a conflict, but I looked through the slides (thanks for posting those). Maybe I'll run into you later at AMS.

No worries - are you coming to the pangeo workshop on Friday?


Have you looked at cfgrib and grib data before?

I have never personally used grib data, but I would be happy to help you make it work in xarray!

One limitation with cfgrib is that when a file has variables on multiple types of levels (i.e., temperature at 2 meters, at 500 mb, and at cloud top height) cfgrib can't read the data into a single dataset, so instead it reads the data and returns a list of datasets when you do cfgrib.open_datasets(gribfileName).

Do you know how you might organise this data in terms of nested groups / nodes? If those group names can be derived from your file then this should be pretty simple. You can see how datatree handles netCDF and Zarr here.

TomNicholas avatar Jan 10 '23 21:01 TomNicholas

Here's a brief snippet of code that could act as a starting point, given the one level depth of organization of datasets output by cfgrib (though could likely be cleaned up if integrated directly into cfgrib to use private functions):

import cfgrib
from datatree import DataTree

def cfgrib_open_datatree(file, **kwargs):
    ds_list = cfgrib.open_datasets(file, **kwargs)
    ds_dict = {}
    for ds in ds_list:
        type_of_level = next(ds.data_vars.values()).attrs.get("GRIB_typeOfLevel", "undef")
        ds_dict[type_of_level] = ds
    return DataTree.from_dict(ds_dict)

jthielen avatar Jan 10 '23 21:01 jthielen

That looks pretty neat already @jthielen ! Could we just add something like that to cfgrib?

Ideally we want this to work:

dt = open_datatree("data.grib", engine="cfgrib")

but I'm not familiar enough with xarray's backend code to know if that can be done purely with changes to cfgrib or whether it requires changes to xarray (/integration of datatree in xarray). cc @jhamman ?

TomNicholas avatar Jan 10 '23 21:01 TomNicholas

My hunch is that we could easily add a cfgrib.open_datatree() method to supplement/replace the existing cfgrib.open_datasets() from what I had (to https://github.com/ecmwf/cfgrib/blob/master/cfgrib/xarray_plugin.py), but supporting the backend engine would take more work (though, perhaps it may only entail adding the appropriate method to BackendEntrypoint? )

jthielen avatar Jan 10 '23 21:01 jthielen

Looks great @jthielen! And so quick.

@TomNicholas, unfortunately I won't be at AMS Friday for the pangeo workshop.

blaylockbk avatar Jan 11 '23 16:01 blaylockbk

I really like the idea of supporting open_datatree(engine=...) for certain backends. We already have open_dataarray and open_dataset so this will be a natural extension to include in xarray. We will need to do some dedicated design planning to figure out how to integrate with Xarray's backends. I'm thinking that sketching this out at the Pangeo meeting on Friday may be a good use of time.

jhamman avatar Jan 11 '23 16:01 jhamman

The integration of datatree into xarray's backend entrypoint system has now been done, so if anyone wants to try making their grib reader return xarray.core.datatree.DataTree objects they can! You might also be interested in the new open_groups function (https://github.com/pydata/xarray/issues/9137).

As xarray doesn't ship a grib reader, and this should now be possible in xarray upstream, I'm going to close this in favour of cfgrib tracking this enhancement to their package.

TomNicholas avatar Sep 07 '24 22:09 TomNicholas