xarray icon indicating copy to clipboard operation
xarray copied to clipboard

Add an asynchronous load method?

Open TomNicholas opened this issue 6 months ago • 1 comments

Is your feature request related to a problem?

Currently all xarray .load() calls are blocking, so the only way to concurrently load data for a bunch of different xarray objects is to use dask. This comes up when loading data from high-latency backends such as Zarr on remote object storage.

Describe the solution you'd like

But now that zarr v3 has async get methods, it should be possible to add an async version of the .load() method that could be used like this:

async def load_many_dataarrays_concurrently(dataarrays):
    tasks = [da.async_load() for da in dataarrays]
    results = await asyncio.gather(*tasks)
    return results

For N zarr stores pointing to remote object storage, each of which has a latency of ~1s, this code could take in theory only ~1s, whereas the blocking equivalent (i.e. return [da.load() for da in dataarrays]) would take at least ~N seconds.

(Note this suggestion is not the same as #8965, which is about concurrently loading multiple variables behind the scenes, rather than exposing an async interface to the user.)

The new method could be da.async_load(), or even use an accessor namespace like da.async.load().

To make this work we would need to add an async version of BackendArray.get_duck_array

https://github.com/pydata/xarray/blob/c8affb3c17769121a3a9895f8cfad6ed137a6e0f/xarray/backends/common.py#L273

and plumb that down through to zarr's AsyncArray methods somehow.

Describe alternatives you've considered

Using dask is massive overhead and additional complexity. There may be some other way to do this that I'm not aware of.

Additional context

This is a desired-enough feature that other people have done it before in 3rd-party libraries, e.g. https://github.com/jeliashi/xarray-async. That particular implementation also targeted zarr, but predates the async get methods now available in zarr v3.

cc @dcherian @rabernat @jhamman @ianhi

TomNicholas avatar May 16 '25 02:05 TomNicholas

Actually the accessor syntax idea of having ds.async.load() is not possible because async is a reserved keyword in python, so ds.async raises a SyntaxError. So it would have to be one of:

ds.async_.load()
ds.async_load()
ds.load_async()

or something like that.

TomNicholas avatar May 17 '25 21:05 TomNicholas