xarray Support for globs

Is your feature request related to a problem?

When working with larger datasets or arrays with multiple dimensions, where the variable/dimension names are largely similar but differ only by, e.g., some appendix it would be nice to do operations using globs.

On top of that, often operations on datasets where, e.g., not all variables share some dimensions require additional loops over the variables.

In our code base we often have the construct of a dataset where each variable has unique dimensions that are similar in logic. E.g. a variable called "lambda_nir" with a dimension "spectral_nir", and many more variables where the "_nir" appendix is replaced with different ones.

Then we often have to loop over variables to call apply_ufunc with the explicit dimension. Here, it would be nice to also call apply_ufunc(..., input_core_dims=["spectral_*"], ...)

Describe the solution you'd like

e.g. instead of doing ds[["img_a", "img_b", "img_c"]] I would like to be able to do ds["img_*"]

Similarly, I would like to be able to replace ds.mean(["dim_a", "dim_b", "dim_c"]) by ds.mean("dim_*")

Describe alternatives you've considered

In some operations you can already pass callables, but this is usually more work and less readable than just passing the list of dimensions.

Additional context

This feature is quite a breaking change, so would probably require a long deprecation phase and warn users when using names with * in them.

Jun 21 '24 09:06 headtr1ck

Interesting suggestion. Here are some disconnected thoughts.

For a related problem, we already have filter_by_attrs.

Perhaps we should generalize to Dataset.filter_by that takes a name, attrs, dims and does the filtering.

I did something similar on selection by attribute values in cf-xarray. For example, ds.cf[["air_temperature"]] picks all variables with air_temperature in the attrs.

Re ds.mean("dim_*"): ds.mean([dim for dim in dims if dim.startswith('dim_')]) seems OK.

In general this reminds me of https://github.com/pydata/xarray/issues/6053 where we might do ds.broadcasting.sum() and ds.glob.sum()?

Jun 21 '24 15:06 dcherian

Something like ds.filter_by(glob=...) or ds.glob seems reasonable!

My initial reaction is to vote against having xarray look through to the names of objects and process them — it makes that very fundamental operation more complicated, and I think without a compelling case. This is python, so it's very possible to have a quick list comprehension....

Jun 21 '24 19:06 max-sixty

I agree that simple operations like ds.sum are not the problem here. I always disliked the explicit for loop over variables when doing apply_ufunc.

Jun 21 '24 19:06 headtr1ck

Check out the fnmatch module from the standard library, specifically the filter function.

In the above examples, thanks to the dict like interface of Datasets, you should be able to do the following: ds[filter(ds, "img_*")] or ds.mean(filter(ds.dims, "dim_*"))

The case of literal * is handled by wrapping in square brackets: ds[filter(ds, "C[*]_*")] Aside: C* is a method for tracking anthropogenic carbon and maybe someone named a variable this way

Jun 21 '24 20:06 DocOtak

Now that I think about it, I would simply like to have the possibility to define per-variable dimensions in apply_ufunc.

Maybe we could allow passing list[dict[str, list[str]] | list[str]] instead of only list[list[str]] that would help already.

Jul 09 '24 17:07 headtr1ck