stravalib
stravalib copied to clipboard
Move to async network requests
Currently network requests in https://github.com/stravalib/stravalib/blob/master/stravalib/protocol.py are done using the requests
library whihch blocks while data is being downloaded.
This isn't a problem when the libray is being used to bulk-download data but is more of an issue when I've attempted to use stravalib to build interactive applications and want to keep the rest of the UI responsive (e.g. showing a progress display, having a way to cancel a long request and move somewhere else).
It looks like there's a library that supports async requests and is pretty much a drop-in replacement for requests
https://www.python-httpx.org/compatibility/
I'm happy to give it a go in a branch and make a PR, but I'm opening this first as requested in the contribution guideline, and to get feedback from the rest of the maintainers. Does anyone think it's a bad idea to make this library use async network requests? Anything that could trip us up while trying to do it?
Allowing async requests would be a great improvement; thanks for bringing this up!
Anything that could trip us up while trying to do it?
From the top of my mind (I may be overlooking stuff):
- From an API perspective, what would be the behavior and return type of the stravalib
Client
methods? E.g., what wouldget_activity()
return? An actualActivity
object or a future? Will this client method itself need to be declared asasync
? - Would we need a separate
ASyncClient
API? - There's a
BatchedResultsIterator
for dealing with paginated responses. When using it in an async fashion, do we need to preserve the order of the individual responses/pages? - The current integration test suite is written using the
responses
library for mocking HTTP requests. This would need to be rewritten using a mocking library compatible withhttpx
oraiohttp
.
Please let us know what you think about the above considerations.
If you need a short-term workaround for your particular problem, you may consider using "plain" multithreading, see e.g. https://docs.python.org/3/library/concurrent.futures.html#threadpoolexecutor.
Thanks Jonatan - all good questions, and I don't have an answer yet as I'm new to this part of Python. But I'm willing to give it a try and do some research to see how other libraries have done similar things.
You're right, for now I can work around the issue as you suggested but it would be nicer to make my code cleaner the future 😄
It would be great if you have the time and resources to research this, as it is not entirely trivial. I'm happy to help if you have questions about the stravalib code base.
Hi folks -- I'll offer my $0.02 here, since I've spent a fair bit of time with both blocking (i.e. requests
-based) and non-blocking / asyncio (httpx
-based) frameworks.
@futureshape , you're right that the requests library is a blocking library. The way to achieve higher-throughput / responsiveness would be to use multiple threads. You're also right, though, that for an application like this an asyncio framework would make a lot of sense. I do not know, though, whether an asyncio framework will actually be faster than a well-designed multi-threaded application; however, the concurrency syntax provided by asyncio would make it easier to implement this. (But there are also some caveats and it's easy for people who don't understand asyncio to essentially make everything single-threaded.)
The way this would typically work is that you would use async def ...
which defines methods like get_activity()
would be coroutines. Under the hood they would be returning futures, but when called with await
they would resolve the actual Activity. Concurrency can be achieved by using functions like asyncio.gather()
to collect results from a whole bunch of these concurrently . Additionally, it would be fairly natural to use an httpx-based solution in a framework like FastAPI / Starlette which is asyncio under the hood.
class Client:
async def get_activity(self, id: str) -> Activity:
...
activity = await get_activity(id) # resolves Activity before proceding
# Or, gather activities concurrently:
coro1 = get_activity(id1)
coro2 = get_activity(id2)
activity1, activity2 = await asyncio.gather(coro1, coro2)
The challenges here are:
- asyncio and blocking code aren't trivial to mix together. If someone wants to use stravalib in a framework that is not inherently asyncio (e.g. Flask), they'd need to wrap any coroutine calls. I'd suggest it's a bit easier to go the other way (wrapping blocking code) using
asyncio.to_thread()
or similar. - Some of the testing tools are also a little awkward w/ asyncio. I think pytest has decent support now, but it hasn't always been that way.
- Asyncio is often misunderstood. Especially in python where the tradition has been blocking code, sometimes people don't understand that if you're using an asyncio framework and you call out to a blocking function your entire application will block (in a threaded web framework, for example, this wouldn't be an issue as every request would typically be handled by a separate thread).
So, I don't think it's a bad idea, but it's not a simple choice -- and last I checked there weren't great options for libraries that wanted to support both paradigms. I do like asyncio and much of my work currently uses async frameworks, so that fits well. And I'd say in the last few years the level of support in python has reached a threshold where it's becoming a good default choice for many frameworks like this.
If we change to async
do we need to still support sync
mode?
If we change to
async
do we need to still supportsync
mode?
I guess so. I wouldn't want to force users to deal with the added complexity of async requests, they should be able to choose.
if we did try to support both, to hans' point above, the implementation (our api) would potentially become more complex? As would testing.
Unless we wrapped some functionality around a plain multithread solution as jonatan suggested above. Is that an option too or is that approach inferior ?
Yeah, I haven't seen an elegant way to support both sync and async. (I've seen libraries that support multiple asyncio frameworks, but even that can get fairly complex.) Probably the path to supporting both means two different client classes with different underpinnings. Unfortunately if models are set up to support lazily loading things over the request, they also would need to be specific. That sounds tangled. (This, incidentally, is similar to why SqlAlchemy has such challenges w/ asyncio, despite the db drivers supporting asyncio for awhile -- the lazy loading of attrs means that the entire ORM framework needs to pick to be either sync or async and that's a tall order.)
Obviously this is not my project anymore (I'm eternally grateful to you all for breathing new life into this and for taking this on!) , but personally, I don't think there's any clear mandate to support asyncio: I think it's easy enough to fix any performance issues by using multiple threads. And folks that want to use in context of asyncio have options like asyncio.to_thread()
or loop.run_in_executor()
. I don't see stravalib as supporting the true use-case of asyncio where you might have thousands of concurrent requests (thousands of pthreads would be expensive!); Strava's api rate limits would typically preclude that usage pattern. I think the reason to move to asyncio would just be if the developers felt strongly opinionated that it was the right way to go. (I think, though, in the context of data science applications, jupyter notebooks, etc. asyncio is definitely a less popular paradigm.)
Unless we wrapped some functionality around a plain multithread solution as jonatan suggested above
I wasn't entirely clear, but my suggestion was to use multithreading on the user application side, not as part of stravalib. E.g., when you have a bunch of activity ids, and you want to fetch their streams as fast as possible (within your rate limits), you could use something like
with ThreadPoolExecutor(max_workers=10) as executor:
streams = executor.map(client.get_activity_streams, ids)
which would prevent each call of get_activity_streams()
from waiting on the completion of the previous call. Instead of map()
, you could also use submit()
to immediately return control without waiting on the result.