llama-stack icon indicating copy to clipboard operation
llama-stack copied to clipboard

Multiple providers blocking the async event loop

Open bbrowning opened this issue 1 year ago • 11 comments

🐛 Describe the bug

Llama Stack uses FastAPI and an async event loop. FastAPI uses a single event loop to dispatch requests to all async request handlers. If this event loop gets blocked - for example by doing some blocking operation inside the request event loop - then all request handling of the server gets stopped until the blocking operation completes. So, it's imperative that we never block the main request event loop.

Today, many of our providers have async request handlers that are actually performing blocking operations. So, we're regularly blocking the event loop for disk I/O, network operations, compute-intensive tasks, and related things. Here's an inventory of all the provider implementations today that appear to be doing blocking operations in async methods:

  • providers/inline/datasetio/localfs/datasetio.py
    • blocking file operations
  • providers/inline/post_training/torchtune/recipes/lora_finetuning_single_device.py
    • blocking torch operations in event loop
  • providers/inline/safety/prompt_guard/prompt_guard.py
    • blocking tokenization and torch operations
  • providers/inline/scoring/braintrust/braintrust.py
    • likely blocking calls to braintrust evaluators
  • providers/inline/tool_runtime/code_interpreter/code_interpreter.py
    • blocking calls to python code execution
  • providers/inline/vector_io/faiss/faiss.py
    • likely blocking calls to faiss index search
  • providers/inline/vector_io/sqlite_vec/sqlite_vec.py
    • blocking database operations (query, insert, etc)
  • providers/remote/datasetio/huggingface/huggingface.py
    • blocking network calls to huggingface
  • providers/remote/inference/bedrock/bedrock.py
    • blocking network calls via bedrock client
  • providers/remote/inference/databricks/databricks.py
    • blocking calls to OpenAI client
  • providers/remote/inference/fireworks/fireworks.py
    • likely some blocking calls in _stream_completion and embeddings
  • providers/remote/inference/passthrough/passthrough.py
    • blocking calls to passthrough LlamaStackClient
  • providers/remote/inference/runpod/runpod.py
    • blocking calls to OpenAI client
  • providers/remote/inference/sambanova/sambanova.py
    • blocking calls to OpenAI client
  • providers/remote/inference/together/together.py
    • blocking calls to Together client
  • providers/remote/safety/bedrock/bedrock.py
    • blocking network calls via bedrock client
  • providers/remote/tool_runtime/bing_search/bing_search.py
    • blocking network calls
  • providers/remote/tool_runtime/brave_search/brave_search.py
    • blocking network calls
  • providers/remote/tool_runtime/tavily_search/tavily_search.py
    • blocking network calls
  • providers/remote/tool_runtime/wolfram_alpha/wolfram_alpha.py
    • blocking network calls
  • providers/remote/vector_io/chroma/chroma.py
    • blocking calls when using local chroma client
  • providers/remote/vector_io/milvus/milvus.py
    • blocking calls with milvus client
  • providers/remote/vector_io/pgvector/pgvector.py
    • blocking SQL calls
  • providers/remote/vector_io/weaviate/weaviate.py
    • blocking calls with weaviate client
  • providers/utils/kvstore/mongodb/mongodb.py
    • blocking calls with MongoClient
  • providers/utils/kvstore/postgres/postgres.py
    • blocking calls with postgres client
  • providers/utils/inference/embedding_mixin.py
    • blocking loading and usage of embedding model
  • providers/utils/inference/litellm_openai_mixin.py
    • blocking calls to litellm

This list was compiled from a quick scan through the code, and I may have missed some. All of these need to be rewritten to either use async operations or move their blocking operations into separate threads or processes with async operations that wait on those separate threads or processes to complete.

Expected behavior

We should not be blocking the event loop so that a single Llama Stack server can handle a reasonable amount of concurrent requests.

bbrowning avatar Mar 07 '25 21:03 bbrowning

This is a large list, and something we'll have to tackle over time with each provider. So, consider this issue a tracker for the overall problem as its exists today. If we prefer to turn this into a checklist, create sub-issues (if that's enabled for this repo?), or separate issues for each item we can do that as well. It may be a good chance to enlist the larger community, including original provider contributors, to see if we can split up the work here.

bbrowning avatar Mar 07 '25 21:03 bbrowning

Fixed for torchtune here: https://github.com/meta-llama/llama-stack/pull/1437

I'm going to adopt the same pattern for more providers once this first consumer of the async scheduler is merged.

booxter avatar Mar 07 '25 23:03 booxter

WIll try to pickup some of these to solve inbuilt tools blocking network calls.

  • Added Faiss search non blocking call as well.
  • Made mongo sync client into async

#1509

cheesecake100201 avatar Mar 09 '25 07:03 cheesecake100201

Just adding a bit more detail after some investigations

The routes are called by create_dynamic_typed_route in llama_stack/distribution/server/server.py which calls them asynchronously with 'await' if the function is defined as a coroutine, if not it is called synchronously but this only works around keeping the python interpreter happy, coroutines will block the event loop if they contain a blocking call, and synchronous calls block the same event loop.

e.g. we can see this in datasetio/localfs.py in 'async def iterrows' which blocks the entire server when loading data from csv (pandas.read_csv)

Acording to the FastAPI doc https://fastapi.tiangolo.com/async/#very-technical-details "When you declare a path operation function with normal def instead of async def, it is run in an external threadpool that is then awaited, instead of being called directly (as it would block the server)."

But declaring the iterrows as a normal def doesn't do this. Due to (I think) the way create_dynamic_typed_route calls the function (it doesn't start a new thread merely just calls the function without 'await')

I think a reasonable approach would be to

  • Update places where we know blocking is occurring (i.e. the list Ben identifies above) but also
  • Update create_dynamic_typed_route to create a thread for any synonymous functions to give as a fallback if needed

derekhiggins avatar Mar 19 '25 11:03 derekhiggins

providers/utils/inference/embedding_mixin.py blocking loading and usage of embedding model

providers/utils/inference/litellm_openai_mixin.py blocking calls to litellm

providers/remote/inference/runpod/runpod.py blocking calls to OpenAI client

Pr to address these ones is #1645

jaideepr97 avatar Mar 20 '25 17:03 jaideepr97

providers/inline/vector_io/sqlite_vec/sqlite_vec.py blocking database operations (query, insert, etc)

PR to address this is https://github.com/meta-llama/llama-stack/pull/1762

franciscojavierarceo avatar Mar 22 '25 03:03 franciscojavierarceo

code_interpreter was covered here https://github.com/meta-llama/llama-stack/pull/1654

derekhiggins avatar Mar 24 '25 12:03 derekhiggins

This issue has been automatically marked as stale because it has not had activity within 60 days. It will be automatically closed if no further activity occurs within 30 days.

github-actions[bot] avatar May 24 '25 00:05 github-actions[bot]

@bbrowning maybe we can have a checklist be part of this bug since it seems to be being worked on piecemeal?

nathan-weinberg avatar Jul 14 '25 14:07 nathan-weinberg

providers/utils/inference/embedding_mixin.py

  • blocking loading and usage of embedding model

https://github.com/llamastack/llama-stack/pull/3335

derekhiggins avatar Sep 04 '25 13:09 derekhiggins

This issue has been automatically marked as stale because it has not had activity within 60 days. It will be automatically closed if no further activity occurs within 30 days.

github-actions[bot] avatar Nov 15 '25 00:11 github-actions[bot]