pangeo-forge-recipes icon indicating copy to clipboard operation
pangeo-forge-recipes copied to clipboard

How to handle data with mixtures of Grib 1 and Grib 2?

Open alxmrs opened this issue 2 years ago • 4 comments

I'm running the XarrayZarrRecipe on an internal Era 5 dataset. I just found out it uses a mixture of Grib 1 and Grib 2 standards within the same files. The simple way I can convert the corpus to Zarr would involve filtering out some of the data (e.g. https://github.com/ecmwf/cfgrib/issues/2): The way cfgrib works with xarray is to get all the variables, we have to call open_dataset on the same file with different filter_by_key arguments.

Is there a clean way to work with mixed variable grib files today with pangeo-forge? If not, do we update the recipe to handle this use case?

xref:

  • https://github.com/pangeo-forge/staged-recipes/issues/92
  • https://github.com/pangeo-forge/staged-recipes/issues/22

CC: @rabernat @cisaacstern

alxmrs avatar Nov 24 '21 00:11 alxmrs

@alxmrs, as you're probably aware, XarrayZarrRecipe.xarray_open_kwargs allows passing arguments to open_dataset

https://github.com/pangeo-forge/pangeo-forge-recipes/blob/e6fdf876bbb5e1b342b235510ec438599c96b8b8/pangeo_forge_recipes/recipes/xarray_zarr.py#L293-L297

but currently these kwargs are applied uniformly across all inputs. (So no way to vary filter_by_key here, of course.)

I don't have first hand experience with filter_by_key for GRIB. But it seems like you might be able to achieve the same result by loading the whole dataset via open_dataset (so no filter_by_key kwarg) and then conditionally dropping variables with XarrayZarrRecipe.process_input, which is a callable with the following signature

https://github.com/pangeo-forge/pangeo-forge-recipes/blob/e6fdf876bbb5e1b342b235510ec438599c96b8b8/pangeo_forge_recipes/recipes/xarray_zarr.py#L657-L658

that is applied to every input

https://github.com/pangeo-forge/pangeo-forge-recipes/blob/e6fdf876bbb5e1b342b235510ec438599c96b8b8/pangeo_forge_recipes/recipes/xarray_zarr.py#L305-L306

Do you think there's a way to get your desired filtering via something like

def filter_grib(ds: xr.Dataset, filename: str):
    vars_to_drop = dict(
        grib_1=  # iterable of vars to drop if input file is GRIB1 format
        grib_2= # iterable of vars to drop if input file is GRIB2 format
   )
    if some_grib_1_identifier in ds.attrs:
        ds = ds.drop(labels=vars_to_drop["grib_1"])
    elif some_grib_2_identifier in ds.attrs:
        ds = ds.drop(labels=vars_to_drop["grib_2"])
    else:
        raise ValueError("GRIB version not identifiable from `ds.attrs`")

recipe = XarrayZarrRecipe(..., process_input=filter_grib, ...)

?

Depending on how many inputs you have and/or the information encoded in their filenames, rather than inferring the GRIB version from ds.attrs, you may be able either to pass filter_grib an explicit mapping between GRIB versions and filenames, or just infer GRIB version at runtime based on the filename.

cisaacstern avatar Nov 29 '21 16:11 cisaacstern

Quick note:

you might be able to achieve the same result by loading the whole dataset

Some datasets cannot be loaded at all, because the different parts conflict in their coordinates definitions. Maybe that doesn't apply in this case, but I've certainly seen it.

martindurant avatar Nov 29 '21 16:11 martindurant

The PR I just stared in #245 should allow you to handle this use case by providing a custom "Opener" which would dispatch the correct options depending on the filename or any other information passed from the FilePattern.

rabernat avatar Nov 29 '21 16:11 rabernat

Some datasets cannot be loaded at all, because the different parts conflict in their coordinates definitions.

That's exactly the case that I'm running into – and is common with grib. process_inputs can't address this, since it assumes we've already opened the data into XArray.

#245 would definitely solve this issue! With that, we could prevent these kinds of error by suing cfgrib directly instead of using xarray, for example.

alxmrs avatar Nov 29 '21 23:11 alxmrs