zarr-python icon indicating copy to clipboard operation
zarr-python copied to clipboard

WIP: Concurrent block reads and writes

Open llllllllll opened this issue 5 years ago • 1 comments

This commit adds support for optional concurrent executors to allow concurrently reading or writing blocks to the underlying store. For stores where the IO is expensive, for example an S3Map, allowing concurrent reads can be a massive performance improvement.

The API is designed around Executor objects to allow safe composition. An executor may be threaded to all of the places where concurrency is desired, both inside of Zarr itself and in the user's code, to ensure that a shared thread pool is used and no more than the desired number of threads are launched at the same time. Executors would allow users to safely read or write blocks concurrently from different Zarr Array objects at the same time, without worrying about accidentally spawning too many threads.

I am opening this PR early before adding tests or more documentation to get feedback this proposed API, and discuss the implications of allowing concurrent access to blocks in a zarr array. I didn't want to invest too much time before getting your feedback to make sure this is something that the zarr developers are interested in.

This API doesn't necessarily allow users to do something new, because they could add concurrency on top of the existing interfaces, but adding this interface makes it much easier to align your reads and writes on chunk boundaries without needing to know as much about the implementation of Zarr. This is especially true for integer indexers and boolean mask indexers.

One major downside of this API is that it makes it easier for a user to attempt to access a non thread safe store in a multithreaded context. A user may naively use a ThreadPoolExecutor when it is not safe to do so and then get unexpected crashes. This could cause hard to debug (or even notice) data corruption depending on the given store. Another issue is that users may open many duplicate issues about this which adds a maintenance burden for the zarr developers.

TODO:

  • [ ] Add unit tests and/or doctests in docstrings
  • [x] Add docstrings and API docs for any new/modified user-facing classes and functions
  • [ ] New/modified features documented in docs/tutorial.rst
  • [ ] Changes documented in docs/release.rst
  • [ ] AppVeyor and Travis CI passes
  • [ ] Test coverage is 100% (Coveralls passes)

llllllllll avatar Feb 04 '20 02:02 llllllllll

I left some comments in #547 about a use case that may be useful to consider when implementing this. The gist is that it'd be very nice to allow stores to make their own concurrency or other optimizations, but that requires getting them more information.

I think my specific need is implementable with what you have, but it may not be ideal in many ways.

bilts avatar Mar 22 '20 17:03 bilts

This has obviously gone stale (appreciate the effort here @llllllllll) but I would point those following it to take a look at #1583 where we layout a design for concurrent actions across the library.

jhamman avatar Dec 07 '23 22:12 jhamman