pangeo-forge-recipes icon indicating copy to clipboard operation
pangeo-forge-recipes copied to clipboard

Alternative to dropping attributes that vary between datasets

Open jbusecke opened this issue 2 months ago • 2 comments

When we merge dataset schemas here we currently drop everything in the attributes that is not identical between them.

Example:

from pangeo_forge_recipes.aggregation import _combine_xarray_schemas, dataset_to_schema, schema_to_template_ds
import xarray as xr

ds_a = xr.Dataset(attrs={'something_same':'a', 'something_different':'a'})
ds_b = xr.Dataset(attrs={'something_same':'a', 'something_different':'b'})

schemas = [dataset_to_schema(ds) for ds in [ds_a, ds_b]]
combined_schema = _combine_xarray_schemas(*schemas)
ds_new = schema_to_template_ds(combined_schema)
ds_new

gives

<xarray.Dataset> Size: 0B
Dimensions:  ()
Data variables:
    *empty*
Attributes:
    something_same:  a

I would like a way to preserve the values of something_different on each dataset. Perhaps we could add an option to just make a list of the differing items?

<xarray.Dataset> Size: 0B
Dimensions:  ()
Data variables:
    *empty*
Attributes:
    something_same:  a
    something_different: [a, b]

This is motivated by a real world use case. For CMIP6 each file has a unique tracking_id that can be used to find issues with a specific file (which would then affect all the resulting concatenated dataset). Currently my pangeo-forge-recipes based workflow is completely dropping this important information.

Happy to help with a PR but I am not quite sure what the best way to expose such a behavior to the user is?

Would this be a keyword argument to StoreToZarr?

jbusecke avatar May 06 '24 21:05 jbusecke

FYI this is a hard problem in general, and we normally recommend promoting unique_tracking_id to be an actual coordinate variable so that it has specific rules for propagation.

https://github.com/pydata/xarray/issues/1614

TomNicholas avatar May 07 '24 15:05 TomNicholas

Interesting. It would be great to have this implemented on the xarray level, but AFAICT that would still not solve the issue here, since we are not using xarray to generate much of the schema?

jbusecke avatar May 09 '24 18:05 jbusecke