xarray icon indicating copy to clipboard operation
xarray copied to clipboard

Implement async support for open_datatree

Open aladinor opened this issue 4 months ago • 8 comments

  • [X] Closes #10579 and #12
  • [X] Tests added
  • [X] User visible changes (including notable bug fixes) are documented in whats-new.rst

aladinor avatar Sep 14 '25 14:09 aladinor

This looks great! Would it be possible to make the sync path reuse the async methods internally? This would help reduce duplication, increase test coverage and speed up sync workflows.

shoyer avatar Sep 14 '25 18:09 shoyer

Thanks for the suggestion @shoyer! I explored implementing sync-to-async reuse using a universal coroutine runner. The main challenge is handling environments where an event loop is already running (such as Jupyter notebooks), which requires spawning background threads using asyncio.run() fails with "cannot be called from a running event loop."

However, this approach raises some design concerns:

  • Threading implications: The sync API would internally spawn threads in Jupyter environments, which conflicts with xarray's general avoidance of hidden threading. This can make debugging harder, affect resource management, and surprise users who expect predictable sync behavior.
  • Maintenance burden: We'd need to maintain the threading utility, handle edge cases across different environments, and ensure thread safety.
  • User experience: Some users prefer explicit control over when async/threading is used, especially in performance-critical applications.
  • Alternative benefits: The current approach still provides the main wins - users get significant performance improvements by explicitly choosing open_datatree_async(), and testing the async path covers the core logic.

The tradeoff is between code deduplication vs. user control and predictable behavior. Other major Python libraries (like httpx, requests-async) often keep separate sync/async implementations for similar reasons.

What's your take on the threading tradeoff vs. the deduplication benefits?

CC @TomNicholas

aladinor avatar Sep 15 '25 15:09 aladinor

I'm pretty sure Zarr v3 uses async internally to implement sync methods. It may be worth taking a look at how Zarr does things, especially given the strong overlap in the contributor communities.

Launching a few threads is not particularly resource-intensive, so I'm not worried about that. Thread safety is a potential concern, but we do already take care to ensure that Xarray is thread safe internally, especially for IO backends.

I think we can safely say that the vast majority of Xarray users are not familiar with async programming models, so I think they could really benefit from having this work by default. This is quite different from the user base for the web programming libraries you mention.

shoyer avatar Sep 15 '25 16:09 shoyer

@shoyer did you see https://github.com/pydata/xarray/issues/10622? I raised that issue to discuss the general problem of how these libraries interact with each other when it comes to concurrency.

I'm pretty sure Zarr v3 uses async internally to implement sync methods. It may be worth taking a look at how Zarr does things, especially given the strong overlap in the contributor communities.

Yes zarr manages its own threadpool.

TomNicholas avatar Sep 15 '25 17:09 TomNicholas

OK, let's try to reach some initial resolution about the async strategy for Xarary over in #10622 first!

shoyer avatar Sep 15 '25 20:09 shoyer

Hey @TomNicholas and @shoyer,

I've updated the async DataTree implementation based on our previous discussions. Key changes:

User-facing API remains synchronous - no await needed: Users just call the normal sync API dt = xr.open_datatree("s3://bucket/data.zarr", engine="zarr")

How it works internally:

  • The zarr backend's open_datatree() now uses zarr.core.sync.sync() (aliased as zarr_sync) to execute async code from the sync context
  • Internally, _open_datatree_from_stores_async() opens all groups and creates indexes concurrently using asyncio.gather()

Please let me know your thoughs on this.

aladinor avatar Dec 12 '25 21:12 aladinor

OK, let's try to reach some initial resolution about the async strategy for Xarray over in https://github.com/pydata/xarray/issues/10622 first!

My understanding of that issue is that people thought that it should be zarr's responsiblity to offer API that xarray could use (e.g. open_many_groups_async). But OTOH @aladinor 's implementation looks great, and it's all internal, so shall we just get this merged?

@aladinor have you benchmarked this at scale? Creating a graph like this one would be really interesting.

TomNicholas avatar Dec 13 '25 09:12 TomNicholas

HI @TomNicholas

I ran the same benchmark as shown here https://github.com/pydata/xarray/issues/10579#issue-3270790283 and these are the results: benchmark_async_datatree

However, I tested it with real data in OSN s3 using the following code


import xarray as xr
import icechunk as ic
from time import time

storage = ic.s3_storage(
    bucket='nexrad-arco',
    prefix='KLOT-RT',
    endpoint_url='https://umn1.osn.mghpcc.org',
    anonymous=True,
    force_path_style=True,
    region='us-east-1',
)

repo = ic.Repository.open(storage)
session = repo.readonly_session("main")


dtree_region = xr.open_datatree(
    session.store,
    zarr_format=3,
    consolidated=False,
    chunks={},
    engine="zarr",
    )

And I got the following results:

sequential tree creation (8.5 secs)

sequential

async tree creation (1.4 secs)

async

This benchmark showed ~6x speedup (8.55s → 1.43s), which is even better because real cloud I/O has additional overhead (TCP, TLS, HTTP) that benefits more from concurrent connections.

aladinor avatar Dec 13 '25 17:12 aladinor