xarray-tutorial icon indicating copy to clipboard operation
xarray-tutorial copied to clipboard

Remote access patterns using xarray.

Open betolink opened this issue 5 months ago • 8 comments

I'm not sure if this will fit in the upcoming (potential) SciPy tutorial or somewhere else, I think it could be helpful to include a mini-guide on access patterns to remote storage. I think that one of the key strengths of xarray is in a way, a weakness. I'm thinking about how powerful the abstractions are when it comes to open a multi-file datasets and how this could hide the nuances of different back-end storage types.

When a new user sees this and they get a data cube, it's like magic!

ds = xr.open_dataset(reference, engine="zarr")

and although this is the cloud-native way, a considerable amount of data is still in archival formats or available through a service like Opendap. In an ideal world, users shouldn't care in which format/location their data is, but I've run into multiple instances where is not that xarray is not doing its job but the data is in HDF on a slow server across the next continent.

Sometimes there are workarounds, from using different sources(e.g. Planetary Computer, GEE) that serve the same data but on a cloud optimized format, to the use of Kerchunk or using clever caching strategies. I feel that some of these topics are buried in threads in Github and not necessarily exposed in the documentation.

The idea would be to quickly illustrate, what xarray would do if I have files of type X and this access pattern:

file_set = [fsspec.open(f) for f in files]
ds = xr.open_mfdataset(file_set) 

What would happen if my files are HDF4, NetCDF, HDF5, what's the step 1, 2, 3... can we make it faster? how? What if the data is behind OPeNDAP? etc

I also wonder if this information is already out there in the docs and perhaps just needs to be compiled into a single notebook, I volunteer to start one if is not.

betolink avatar Feb 16 '24 01:02 betolink

I volunteer to start one if is not.

Yes please! This would be a really really great notebook to add.

The docs are here: https://docs.xarray.dev/en/stable/user-guide/io.html but need some reorg.

dcherian avatar Feb 16 '24 16:02 dcherian