llama-stack icon indicating copy to clipboard operation
llama-stack copied to clipboard

Multiple providers blocking the async event loop

Open bbrowning opened this issue 7 months ago • 8 comments

🐛 Describe the bug

Llama Stack uses FastAPI and an async event loop. FastAPI uses a single event loop to dispatch requests to all async request handlers. If this event loop gets blocked - for example by doing some blocking operation inside the request event loop - then all request handling of the server gets stopped until the blocking operation completes. So, it's imperative that we never block the main request event loop.

Today, many of our providers have async request handlers that are actually performing blocking operations. So, we're regularly blocking the event loop for disk I/O, network operations, compute-intensive tasks, and related things. Here's an inventory of all the provider implementations today that appear to be doing blocking operations in async methods:

  • providers/inline/datasetio/localfs/datasetio.py
    • blocking file operations
  • providers/inline/post_training/torchtune/recipes/lora_finetuning_single_device.py
    • blocking torch operations in event loop
  • providers/inline/safety/prompt_guard/prompt_guard.py
    • blocking tokenization and torch operations
  • providers/inline/scoring/braintrust/braintrust.py
    • likely blocking calls to braintrust evaluators
  • providers/inline/tool_runtime/code_interpreter/code_interpreter.py
    • blocking calls to python code execution
  • providers/inline/vector_io/faiss/faiss.py
    • likely blocking calls to faiss index search
  • providers/inline/vector_io/sqlite_vec/sqlite_vec.py
    • blocking database operations (query, insert, etc)
  • providers/remote/datasetio/huggingface/huggingface.py
    • blocking network calls to huggingface
  • providers/remote/inference/bedrock/bedrock.py
    • blocking network calls via bedrock client
  • providers/remote/inference/databricks/databricks.py
    • blocking calls to OpenAI client
  • providers/remote/inference/fireworks/fireworks.py
    • likely some blocking calls in _stream_completion and embeddings
  • providers/remote/inference/passthrough/passthrough.py
    • blocking calls to passthrough LlamaStackClient
  • providers/remote/inference/runpod/runpod.py
    • blocking calls to OpenAI client
  • providers/remote/inference/sambanova/sambanova.py
    • blocking calls to OpenAI client
  • providers/remote/inference/together/together.py
    • blocking calls to Together client
  • providers/remote/safety/bedrock/bedrock.py
    • blocking network calls via bedrock client
  • providers/remote/tool_runtime/bing_search/bing_search.py
    • blocking network calls
  • providers/remote/tool_runtime/brave_search/brave_search.py
    • blocking network calls
  • providers/remote/tool_runtime/tavily_search/tavily_search.py
    • blocking network calls
  • providers/remote/tool_runtime/wolfram_alpha/wolfram_alpha.py
    • blocking network calls
  • providers/remote/vector_io/chroma/chroma.py
    • blocking calls when using local chroma client
  • providers/remote/vector_io/milvus/milvus.py
    • blocking calls with milvus client
  • providers/remote/vector_io/pgvector/pgvector.py
    • blocking SQL calls
  • providers/remote/vector_io/weaviate/weaviate.py
    • blocking calls with weaviate client
  • providers/utils/kvstore/mongodb/mongodb.py
    • blocking calls with MongoClient
  • providers/utils/kvstore/postgres/postgres.py
    • blocking calls with postgres client
  • providers/utils/inference/embedding_mixin.py
    • blocking loading and usage of embedding model
  • providers/utils/inference/litellm_openai_mixin.py
    • blocking calls to litellm

This list was compiled from a quick scan through the code, and I may have missed some. All of these need to be rewritten to either use async operations or move their blocking operations into separate threads or processes with async operations that wait on those separate threads or processes to complete.

Expected behavior

We should not be blocking the event loop so that a single Llama Stack server can handle a reasonable amount of concurrent requests.

bbrowning avatar Mar 07 '25 21:03 bbrowning