Multiple providers blocking the async event loop
🐛 Describe the bug
Llama Stack uses FastAPI and an async event loop. FastAPI uses a single event loop to dispatch requests to all async request handlers. If this event loop gets blocked - for example by doing some blocking operation inside the request event loop - then all request handling of the server gets stopped until the blocking operation completes. So, it's imperative that we never block the main request event loop.
Today, many of our providers have async request handlers that are actually performing blocking operations. So, we're regularly blocking the event loop for disk I/O, network operations, compute-intensive tasks, and related things. Here's an inventory of all the provider implementations today that appear to be doing blocking operations in async methods:
- providers/inline/datasetio/localfs/datasetio.py
- blocking file operations
- providers/inline/post_training/torchtune/recipes/lora_finetuning_single_device.py
- blocking torch operations in event loop
- providers/inline/safety/prompt_guard/prompt_guard.py
- blocking tokenization and torch operations
- providers/inline/scoring/braintrust/braintrust.py
- likely blocking calls to braintrust evaluators
- providers/inline/tool_runtime/code_interpreter/code_interpreter.py
- blocking calls to python code execution
- providers/inline/vector_io/faiss/faiss.py
- likely blocking calls to faiss index search
- providers/inline/vector_io/sqlite_vec/sqlite_vec.py
- blocking database operations (query, insert, etc)
- providers/remote/datasetio/huggingface/huggingface.py
- blocking network calls to huggingface
- providers/remote/inference/bedrock/bedrock.py
- blocking network calls via bedrock client
- providers/remote/inference/databricks/databricks.py
- blocking calls to OpenAI client
- providers/remote/inference/fireworks/fireworks.py
- likely some blocking calls in _stream_completion and embeddings
- providers/remote/inference/passthrough/passthrough.py
- blocking calls to passthrough LlamaStackClient
- providers/remote/inference/runpod/runpod.py
- blocking calls to OpenAI client
- providers/remote/inference/sambanova/sambanova.py
- blocking calls to OpenAI client
- providers/remote/inference/together/together.py
- blocking calls to Together client
- providers/remote/safety/bedrock/bedrock.py
- blocking network calls via bedrock client
- providers/remote/tool_runtime/bing_search/bing_search.py
- blocking network calls
- providers/remote/tool_runtime/brave_search/brave_search.py
- blocking network calls
- providers/remote/tool_runtime/tavily_search/tavily_search.py
- blocking network calls
- providers/remote/tool_runtime/wolfram_alpha/wolfram_alpha.py
- blocking network calls
- providers/remote/vector_io/chroma/chroma.py
- blocking calls when using local chroma client
- providers/remote/vector_io/milvus/milvus.py
- blocking calls with milvus client
- providers/remote/vector_io/pgvector/pgvector.py
- blocking SQL calls
- providers/remote/vector_io/weaviate/weaviate.py
- blocking calls with weaviate client
- providers/utils/kvstore/mongodb/mongodb.py
- blocking calls with MongoClient
- providers/utils/kvstore/postgres/postgres.py
- blocking calls with postgres client
- providers/utils/inference/embedding_mixin.py
- blocking loading and usage of embedding model
- providers/utils/inference/litellm_openai_mixin.py
- blocking calls to litellm
This list was compiled from a quick scan through the code, and I may have missed some. All of these need to be rewritten to either use async operations or move their blocking operations into separate threads or processes with async operations that wait on those separate threads or processes to complete.
Expected behavior
We should not be blocking the event loop so that a single Llama Stack server can handle a reasonable amount of concurrent requests.
This is a large list, and something we'll have to tackle over time with each provider. So, consider this issue a tracker for the overall problem as its exists today. If we prefer to turn this into a checklist, create sub-issues (if that's enabled for this repo?), or separate issues for each item we can do that as well. It may be a good chance to enlist the larger community, including original provider contributors, to see if we can split up the work here.
Fixed for torchtune here: https://github.com/meta-llama/llama-stack/pull/1437
I'm going to adopt the same pattern for more providers once this first consumer of the async scheduler is merged.
WIll try to pickup some of these to solve inbuilt tools blocking network calls.
- Added Faiss search non blocking call as well.
- Made mongo sync client into async
#1509
Just adding a bit more detail after some investigations
The routes are called by create_dynamic_typed_route in llama_stack/distribution/server/server.py which calls them asynchronously with 'await' if the function is defined as a coroutine, if not it is called synchronously but this only works around keeping the python interpreter happy, coroutines will block the event loop if they contain a blocking call, and synchronous calls block the same event loop.
e.g. we can see this in datasetio/localfs.py in 'async def iterrows' which blocks the entire server when loading data from csv (pandas.read_csv)
Acording to the FastAPI doc https://fastapi.tiangolo.com/async/#very-technical-details "When you declare a path operation function with normal def instead of async def, it is run in an external threadpool that is then awaited, instead of being called directly (as it would block the server)."
But declaring the iterrows as a normal def doesn't do this. Due to (I think) the way create_dynamic_typed_route calls the function (it doesn't start a new thread merely just calls the function without 'await')
I think a reasonable approach would be to
- Update places where we know blocking is occurring (i.e. the list Ben identifies above) but also
- Update create_dynamic_typed_route to create a thread for any synonymous functions to give as a fallback if needed
providers/utils/inference/embedding_mixin.py blocking loading and usage of embedding model
providers/utils/inference/litellm_openai_mixin.py blocking calls to litellm
providers/remote/inference/runpod/runpod.py blocking calls to OpenAI client
Pr to address these ones is #1645
providers/inline/vector_io/sqlite_vec/sqlite_vec.py blocking database operations (query, insert, etc)
PR to address this is https://github.com/meta-llama/llama-stack/pull/1762
code_interpreter was covered here https://github.com/meta-llama/llama-stack/pull/1654
This issue has been automatically marked as stale because it has not had activity within 60 days. It will be automatically closed if no further activity occurs within 30 days.
@bbrowning maybe we can have a checklist be part of this bug since it seems to be being worked on piecemeal?
providers/utils/inference/embedding_mixin.py
- blocking loading and usage of embedding model
https://github.com/llamastack/llama-stack/pull/3335
This issue has been automatically marked as stale because it has not had activity within 60 days. It will be automatically closed if no further activity occurs within 30 days.