chainlit icon indicating copy to clipboard operation
chainlit copied to clipboard

Experiment with FastAPI (async architecture by default)

Open willydouhard opened this issue 1 year ago • 8 comments

TLDR: This proposal aims to transition from a synchronous Python runtime (Flask + threads) to an asynchronous one (FastAPI + event loop) for improved maintainability, compatibility with the broader Python ecosystem, and performance. Check out the PR on the cookbook repo to see the differences in real-life Chainlit apps.

Motivation

Currently, Chainlit relies on a synchronous Python server to execute the developer's code and update the UI state accordingly. Since it processes requests one at a time, users must wait for their turn if another user's LangChain agent takes several seconds to complete.

To alleviate this issue, we use threads for each request. However, this approach has limitations:

  1. It requires patching and can cause compatibility issues with other packages.
  2. Threads remain synchronous, which is problematic for I/O-intensive LLM apps, as they have to wait for one API call to finish before starting another.
  3. It does not scale well.

Additionally, the current design prevents the use of async Python packages (OpenAI and LangChain have async implementations).

Solution

I propose switching from a synchronous runtime to an asynchronous one (FastAPI + asyncio). If you're unfamiliar with this concept, you can learn more here.

This change implies that we would use the async and await keywords by default.

OpenAI example

    response = openai.Completion.create(
        model=model_name, prompt=fromatted_prompt, **settings
    )

Becomes

    response = await openai.Completion.acreate(
        model=model_name, prompt=fromatted_prompt, **settings
    )

LangChain example

llm = OpenAI(temperature=0)
chain = LLMChain(llm=llm, prompt=PromptTemplate.from_template(prompt_template))
result = chain("hello world")

Becomes

llm = OpenAI(temperature=0)
chain = LLMChain(llm=llm, prompt=PromptTemplate.from_template(prompt_template))
result = await chain.acall("hello world")

Running synchronous code

At times, an async implementation may not be available (e.g., for some LangChain tools). In such cases, we create a thread as a fallback:

result = agent("2+2")

Becomes

res = await asyncify(agent.__call__)("2+2")

To summarize, we leverage async implementations wherever possible and fallback to threads when only a synchronous implementation is available. This approach improves performance, compatibility, and supports both async and sync calls.

Developer experience

Simplicity continues to be Chainlit's primary goal. While this proposal is a step in the right direction, I recognize that not everyone may be familiar with the async/await pattern. The intent of this PR is to ensure the community is comfortable with this transition regarding developer experience.

All APIs, classes, and methods remain backward compatible, except for the usage of async/await pattern.

The rule of thumb is to always use async implementations when available. When not available and for CPU bound/long running tasks, use asyncify to run in a thread.

To provide context, here is the PR on the cookbook repo, showcasing the steps to migrate an existing Chainlit app using this proposal.

Finally, here are examples of Chainlit implementations using this proposal.

Synchronous @cl.langchain_run

The LangChain integration was initially hacky and required patching the LangChain library. Since then, LangChain has significantly improved, providing an opportunity to remove those hacks. If you are calling the LangChain agent yourself, pass the Chainlit callback as shown below:

import chainlit as cl
from chainlit.sync import asyncify

@cl.langchain_run
async def run(agent, input_str):
    res = await asyncify(agent.__call__)(
        "2+2", callbacks=[cl.ChainlitCallbackHandler()]
    )
    await cl.Message(content=res["text"]).send()

Async @cl.langchain_run

Similarly, for the async implementation of LangChain, pass the Chainlit callback as shown below:

@cl.langchain_run
async def run(agent, input_str):
    res = await agent.acall("2+2", callbacks=[cl.AsyncChainlitCallbackHandler()])
    await cl.Message(content=res["text"]).send()

If you want to test this right now:

  1. Download the wheel
  2. Run pip uninstall chainlit
  3. Run pip install PATH_TO_WHEEL

willydouhard avatar Jun 06 '23 15:06 willydouhard

It seems that in most cases there would be only one asyncify call, right? If the person creates a custom chain that uses other chains inside then there could be more than one.

Have you tested this with VectorStores or other LangChain modules that do not have an implementation for async calls?

ogabrielluiz avatar Jun 08 '23 12:06 ogabrielluiz

It seems that in most cases there would be only one asyncify call, right? If the person creates a custom chain that uses other chains inside then there could be more than one.

Have you tested this with VectorStores or other LangChain modules that do not have an implementation for async calls?

Only one asyncify call yes, every sub call from the chain will run in the same thread and hence not block the event loop.

I have tested with LangChain modules but not Vector DBs yet. I'll try that asap!

willydouhard avatar Jun 08 '23 13:06 willydouhard

Hi @willydouhard this is a quite nice solution. Just please note whether ‘streaming’ will work too in this scenarios. I see that you have created a callback handler for chainlit make sure it can accept stream as an input and generate tokens

hooman-bayer avatar Jun 08 '23 17:06 hooman-bayer

Hi @willydouhard this is a quite nice solution. Just please note whether ‘streaming’ will work too in this scenarios. I see that you have created a callback handler for chainlit make sure it can accept stream as an input and generate tokens

Thanks! Streaming should work. The callback you mention is cl.AsyncChainlitCallbackHandler(). It will handle streaming fully asynchronously if the LLM and LangChain agent you use support it.

If you do not use Langchain you can use the openai package directly:

response = await openai.Completion.acreate(
        model=model_name, prompt=fromatted_prompt, stream=True, **settings
    )

willydouhard avatar Jun 08 '23 17:06 willydouhard

@willydouhard I can also test tmrw morning once I’m next to my laptop ! But this is really amazing! Congrats again, this is going to bring chainlit to the commercial level potentially scaling to 100s users etc.

hooman-bayer avatar Jun 08 '23 17:06 hooman-bayer

@willydouhard I can also test tmrw morning once I’m next to my laptop ! But this is really amazing! Congrats again, this is going to bring chainlit to the commercial level potentially scaling to 100s users etc.

Ty, would love to have your feedback after testing it!

willydouhard avatar Jun 08 '23 19:06 willydouhard

hi Willy, thanks for asking for community feedback. In most cases I think devs will just take from sample code and adapt. The only jarring call I can see is this one : res = await asyncify(agent.call)("2+2"). Seems reaching into internals.

If that's only way, I think ok. As the goal is a good one with better stability and performance.

derekcheungsa avatar Jun 09 '23 21:06 derekcheungsa

hi Willy, thanks for asking for community feedback. In most cases I think devs will just take from sample code and adapt. The only jarring call I can see is this one : res = await asyncify(agent.call)("2+2"). Seems reaching into internals.

If that's only way, I think ok. As the goal is a good one with better stability and performance.

the call could be re written like this:

@cl.langchain_run
async def run(agent, input_str):
    @cl.asyncify
    def run_agent_sync():
        res = agent(input_str, callbacks=[cl.ChainlitCallbackHandler()])
        return res

    res = await run_agent_sync()
    await cl.Message(content=res["text"]).send()

This is already working and doing the same thing, syntactic sugar!

willydouhard avatar Jun 10 '23 09:06 willydouhard