pystac icon indicating copy to clipboard operation
pystac copied to clipboard

Add async I/O methods [WIP]

Open duckontheweb opened this issue 3 years ago • 5 comments

Related Issue(s)

  • Closes #274 (pending additional changes)
  • Closes #609 (pending additional changes)

Description

Implements asynchronous I/O operations by adding async equivalents to the following StacIO methods:

  • read_text / read_text_async (abstract methods)
  • write_text / write_text_async (abstract methods)
  • read_json / read_json_async
  • save_json / save_json_async
  • read_stac_object / read_stac_object_async

This PR also creates a new DefaultStacIOAsync class with default implementations of the async methods using aiofiles (for local file I/O) and httpx (for reading via HTTP). In order to take advantage of these async methods, users must install PySTAC using the aiofiles and httpx extras or create their own implementations of the async abstract methods using their libraries of choice (e.g. aiohttp).

I added a Catalog.save_async method that takes advantage of these new async I/O methods as an example of how me might implement async I/O throughout the code base. Feedback would be appreciated on the approach of having sync methods call their equivalent async method, and on the naming conventions used.

There is still quite a bit of work to do on this before we can consider the linked issues to be closed, but I would like feedback on the changes to the Catalog and StacIO APIs before I do too much more work on the rest of the code. I've added some additional TODOs in the PR checklist.

This PR represents a breaking change, as it requires developers to implement the abstract read_text_async and write_text_async methods in any custom StacIO implementations. However, users who are using the default implementation should not need to make any changes.

PR Checklist

  • [ ] Code is formatted (run pre-commit run --all-files)
  • [ ] Tests pass (run scripts/test)
  • [ ] Documentation has been updated to reflect changes, if applicable
  • [ ] This PR maintains or improves overall codebase code coverage.
  • [ ] Changes are added to the CHANGELOG. See the docs for information about adding to the changelog.
  • [ ] Test performance of synchronous StacIO methods calling async methods vs. using separately maintained synchronous code.
  • [ ] Implement asynchronous versions of remaining methods that involve I/O operations (e.g. Catalog.walk)
  • [ ] Update tutorials and other docs to use async methods
  • [ ] Merge changes from #748 and add benchmarks to test I/O performance improvements/degradations

duckontheweb avatar Feb 14 '22 21:02 duckontheweb

Changed the base branch to 2.0 since this introduces breaking changes.

duckontheweb avatar Feb 15 '22 14:02 duckontheweb

I've never used asyncio, so I don't have a lot to offer in review.

schwehr avatar Feb 16 '22 22:02 schwehr

Thanks @gadomski!

It feels a little awkward to have all of the RuntimeError checks inside the save method, but if that's how it has to be, so it goes.

I agree. I was going to abstract this into a separate _run_coroutine method, but I didn't want to add that complexity until there was consensus on the general approach.

  • Go "async-first" and make save (and friends) async, and use "blocking" to describe non-async methods (e.g. read_blocking). This is probably a bad idea because PySTAC is so foundational and we don't want people to have to go async to just use it.

I thought about this, too, and was wary for the same reasons. I suspect that the majority of our users are not as familiar with async in Python and might be at least intimidated by seeing the methods they are used to using switched to async.

  • Use two different classes for Async and Sync objects, e.g. AsyncItem. I think this probably breaks a lot of the type inference stuff used to detect item types?

I had toyed with this idea a bit, too. I think the general recommendation for async programming is to put sync and async I/O operations in different classes, but as you mentioned that might cause problems with some of our STAC object type sniffing. It also would make it a little trickier to take advantage of async operations within synchronous methods (like here) for parallelizing operations.

That said, I'm open to other ways of structuring this. One thought would be to have an AsyncItem class as you suggested but only have it contain the I/O operations and make it available within the normal Item class (maybe in an Item.async_io property). That way our normal STAC object detection methods could still return the normal classes, and we could isolate the async operations to a separate class. If that seems worth pursuing I can put in an alternate PR with that structure.

  • Is there any way to "store" asyncio information on StacIO to avoid all those try/excepts? Probably not...just something that I thought of while reading tokio's tutorial on bridging w/ sync code: https://tokio.rs/tokio/topics/bridging.

I had originally structured the StacIO changes by having a separate AsyncStacIO class that was available as StacIO.async_io in the normal StacIO instances and could be used by StacIO to do async operations. I ran into a some problems with making the sync and async methods available within the methods on STACObject, Catalog, etc., but those might be alleviated if we separate out the async logic as suggested above. This makes me think it might be worth a separate PR to explore structuring the the async code that way so we have a good point of comparison.

duckontheweb avatar Feb 28 '22 14:02 duckontheweb

  • just something that I thought of while reading tokio's tutorial on bridging w/ sync code: https://tokio.rs/tokio/topics/bridging.

This is an interesting idea. It looks like their approach is to wrap the async functions to create blocking versions, which is different from the httpx approach of having completely separate async and blocking code. It might be worth pursuing this approach to see if it simplifies the code at all.

duckontheweb avatar Feb 28 '22 14:02 duckontheweb

When I was updating Planet's stactools plugin I noticed that stactools already depends on aiohttp. Maybe it would be worth considering using that instead of httpx? So that a user's stactools install doesn't contain more web clients than it needs.

sgillies avatar Feb 28 '22 15:02 sgillies