pystac
pystac copied to clipboard
Add async I/O methods [WIP]
Related Issue(s)
- Closes #274 (pending additional changes)
- Closes #609 (pending additional changes)
Description
Implements asynchronous I/O operations by adding async
equivalents to the following StacIO
methods:
-
read_text
/read_text_async
(abstract methods) -
write_text
/write_text_async
(abstract methods) -
read_json
/read_json_async
-
save_json
/save_json_async
-
read_stac_object
/read_stac_object_async
This PR also creates a new DefaultStacIOAsync
class with default implementations of the async methods using aiofiles
(for local file I/O) and httpx
(for reading via HTTP). In order to take advantage of these async methods, users must install PySTAC using the aiofiles
and httpx
extras or create their own implementations of the async abstract methods using their libraries of choice (e.g. aiohttp
).
I added a Catalog.save_async
method that takes advantage of these new async I/O methods as an example of how me might implement async I/O throughout the code base. Feedback would be appreciated on the approach of having sync methods call their equivalent async method, and on the naming conventions used.
There is still quite a bit of work to do on this before we can consider the linked issues to be closed, but I would like feedback on the changes to the Catalog
and StacIO
APIs before I do too much more work on the rest of the code. I've added some additional TODOs in the PR checklist.
This PR represents a breaking change, as it requires developers to implement the abstract read_text_async
and write_text_async
methods in any custom StacIO
implementations. However, users who are using the default implementation should not need to make any changes.
PR Checklist
- [ ] Code is formatted (run
pre-commit run --all-files
) - [ ] Tests pass (run
scripts/test
) - [ ] Documentation has been updated to reflect changes, if applicable
- [ ] This PR maintains or improves overall codebase code coverage.
- [ ] Changes are added to the CHANGELOG. See the docs for information about adding to the changelog.
- [ ] Test performance of synchronous
StacIO
methods calling async methods vs. using separately maintained synchronous code. - [ ] Implement asynchronous versions of remaining methods that involve I/O operations (e.g.
Catalog.walk
) - [ ] Update tutorials and other docs to use async methods
- [ ] Merge changes from #748 and add benchmarks to test I/O performance improvements/degradations
Changed the base branch to 2.0
since this introduces breaking changes.
I've never used asyncio, so I don't have a lot to offer in review.
Thanks @gadomski!
It feels a little awkward to have all of the
RuntimeError
checks inside thesave
method, but if that's how it has to be, so it goes.
I agree. I was going to abstract this into a separate _run_coroutine
method, but I didn't want to add that complexity until there was consensus on the general approach.
- Go "async-first" and make
save
(and friends) async, and use "blocking" to describe non-async methods (e.g.read_blocking
). This is probably a bad idea because PySTAC is so foundational and we don't want people to have to go async to just use it.
I thought about this, too, and was wary for the same reasons. I suspect that the majority of our users are not as familiar with async in Python and might be at least intimidated by seeing the methods they are used to using switched to async.
- Use two different classes for Async and Sync objects, e.g.
AsyncItem
. I think this probably breaks a lot of the type inference stuff used to detect item types?
I had toyed with this idea a bit, too. I think the general recommendation for async programming is to put sync and async I/O operations in different classes, but as you mentioned that might cause problems with some of our STAC object type sniffing. It also would make it a little trickier to take advantage of async operations within synchronous methods (like here) for parallelizing operations.
That said, I'm open to other ways of structuring this. One thought would be to have an AsyncItem
class as you suggested but only have it contain the I/O operations and make it available within the normal Item
class (maybe in an Item.async_io
property). That way our normal STAC object detection methods could still return the normal classes, and we could isolate the async operations to a separate class. If that seems worth pursuing I can put in an alternate PR with that structure.
- Is there any way to "store" asyncio information on
StacIO
to avoid all thosetry/excepts
? Probably not...just something that I thought of while reading tokio's tutorial on bridging w/ sync code: https://tokio.rs/tokio/topics/bridging.
I had originally structured the StacIO
changes by having a separate AsyncStacIO
class that was available as StacIO.async_io
in the normal StacIO
instances and could be used by StacIO
to do async operations. I ran into a some problems with making the sync and async methods available within the methods on STACObject
, Catalog
, etc., but those might be alleviated if we separate out the async logic as suggested above. This makes me think it might be worth a separate PR to explore structuring the the async code that way so we have a good point of comparison.
- just something that I thought of while reading tokio's tutorial on bridging w/ sync code: https://tokio.rs/tokio/topics/bridging.
This is an interesting idea. It looks like their approach is to wrap the async functions to create blocking versions, which is different from the httpx approach of having completely separate async and blocking code. It might be worth pursuing this approach to see if it simplifies the code at all.
When I was updating Planet's stactools plugin I noticed that stactools already depends on aiohttp. Maybe it would be worth considering using that instead of httpx? So that a user's stactools install doesn't contain more web clients than it needs.