chainlit icon indicating copy to clipboard operation
chainlit copied to clipboard

Update to real Async and Streaming

Open hooman-bayer opened this issue 1 year ago • 19 comments

This is amazing work! Props to you! A lot of ideas are really future looking such as asking the user for input action!

I was looking into the examples and it seems like the current implementation is not really using asynchronous endpoints For instance:

  1. OpenAI python SDK offers openai.ChatCompletion.acreate which is an async generator
  2. LangChain offers AsyncCallbackHandler

This is specially helpful for Agents that can take a longtime to run and might clog the backend

Cheers

hooman-bayer avatar May 26 '23 15:05 hooman-bayer

You are correct we are not leveraging async implementations at the moment. The main reason is that I feel like most code bases are not written in the async paradigm and it is quite hard and not always possible to transition from sync to async.

To mitigate this we currently run agents in different threads so at least one agent will not block the whole app.

As we move forward I would love to see Chainlit support async implementations :)

willydouhard avatar May 26 '23 16:05 willydouhard

For streaming, Chainlit already supports streaming with openai, langchain and any python code. See https://docs.chainlit.io/concepts/streaming/python :)

willydouhard avatar May 26 '23 16:05 willydouhard

@willydouhard happy to contribute at some point in the near future, in case it becomes part of your roadmap, I have been building an app that is fully extending langchain to async including tools ( using their class signature that offers arun). But you are 100% right that most of libraries are only offering sync APIs.

hooman-bayer avatar May 26 '23 16:05 hooman-bayer

For streaming, Chainlit already supports streaming with openai, langchain and any python code. See https://docs.chainlit.io/concepts/streaming/python :)

Correct, I saw it but its again kind of faking :) it and the client needs to wait anyway till the response is completed from OpenAI endpoint that might not be desired. For instance openai.ChatCompletion.acreate creates an SSE and directly passes the response to the client token by token as its generated by ChatGPT so the latency is way smaller.

Also imagine in case of your WebSocket for action agents this can bring a lot better experience for the user.

hooman-bayer avatar May 26 '23 16:05 hooman-bayer

Interesting, in my understanding, openai.ChatCompletion.create was not waiting for the whole response to be generated to start streaming tokens. Do you happen to have a link to a resource covering that in more details?

willydouhard avatar May 28 '23 16:05 willydouhard

To add to the conversation, I tried to use a few different chain classes and couldn't get streaming to work on any of them (only once the response was complete it was updated on screen).

segevtomer avatar May 30 '23 17:05 segevtomer

For LangChain, only the intermediary steps are streamed at the moment. If you configured your LLM with streaming=True you should see the intermediary steps being streamed if you unfold them in the UI (click on the Using... button).

I will take a look on how to also stream the final response!

willydouhard avatar May 30 '23 18:05 willydouhard

@willydouhard see this issue from openai python sdk for more details. In general if you want to keep the tool as a simple POC for only multiple users I see it great as is with sync but what if we want to scale to 100 users or so? I think running all of this on different threads is not so realistic and modern ( user experience with sync also wont be great), probably async is the way to go.

hooman-bayer avatar May 30 '23 18:05 hooman-bayer

Thank you for the link @hooman-bayer , I pretty much agree with you and we also want to see where the community wants Chainlit to go between staying a rapid prototyping tool or deploy to production and scale.

As for streaming final responses in LangChain @segevtomer I found this interesting issue https://github.com/hwchase17/langchain/issues/2483. I'll dig more into it!

willydouhard avatar May 30 '23 18:05 willydouhard

@willydouhard that is using AsyncCallbackHandler I mentioned above. Using that you get access to on_llm_new_token(self, token: str, **kwargs: Any) -> None: that then you could customize the return the output token by token to the client

hooman-bayer avatar May 30 '23 19:05 hooman-bayer

@segevtomer For clarity, all the intermediary steps are already streamed, including the last one which is the final response. Then the final response is sent as a stand alone message (not an intermediary step) in the UI without any overhead so the user can see it.

What I am saying here is that the only improvement we can do is to stream the last tokens after your stop token (usually Final answer:) directly without waiting for the completion to end. This is what https://github.com/hwchase17/langchain/issues/2483 does.

While this would be a win for the user experience, the actual time gain will be very limited, since it only impacts a few tokens at the very end of the whole process.

willydouhard avatar May 31 '23 10:05 willydouhard

Thanks for the update @willydouhard. I wouldn't say it's "very limited", I agree that there will still be delay because we have to wait for the final prompt in the chain to occur, however it is still very valuable to stream it. Let's say the final response is over 1k tokens long, streaming that will still be significant for UX.

segevtomer avatar May 31 '23 21:05 segevtomer

I have two problems with this:

  • There is absolute no indication, that the LLM is doing something, not even a wait cursor
  • If the text generation runs longer than a few seconds, the UI loses connection to the server, and the message is never displayed.

So what I would need, is either streaming of the final result, or a configurable timeout before the UI loses connection to the server and some spinner to indicate, that something is happening. Preferably both.

Banbury avatar Jun 03 '23 16:06 Banbury

@Banbury what is your setup? Are your running open source models like gpt4all locally or are you using openai api?

willydouhard avatar Jun 03 '23 20:06 willydouhard

I have been trying to run Vicuna locally with langchain. It does work more or less, but only for short texts.

Banbury avatar Jun 03 '23 20:06 Banbury

So I have seen issues for local models and we are investing them. For API's models everything should work fine. It would be helpful If you can share a code snippet so I can try to reproduce.

willydouhard avatar Jun 03 '23 20:06 willydouhard

This is the code I have been working on. It's simple enough.

import chainlit as cl
from llama_cpp import Llama
from langchain.llms import LlamaCpp
from langchain.embeddings import LlamaCppEmbeddings
from langchain import PromptTemplate, LLMChain

llm = LlamaCpp(model_path="Wizard-Vicuna-13B-Uncensored.ggmlv3.q5_0.bin", seed=0, n_ctx=2048, max_tokens=512, temperature=0.1, streaming=True)

template = """
### Instruction: 
{message}
### Response:
"""

@cl.langchain_factory
def factory():
    prompt = PromptTemplate(template=template, input_variables=["message"])
    llm_chain = LLMChain(prompt=prompt, llm=llm, verbose=True)

    return llm_chain

I have been using the same model with llama.cpp and llama-cpp-python without problems.

Banbury avatar Jun 03 '23 20:06 Banbury

Thank you, I am going to prioritize this!

willydouhard avatar Jun 04 '23 09:06 willydouhard

Here is the proposal to move chainlit to async by default https://github.com/Chainlit/chainlit/pull/40. Feedback wanted!

willydouhard avatar Jun 08 '23 16:06 willydouhard

Should be fixed in the latest version 0.3.0. Please note that it contains breaking changes. We prepared a migration guide to make it easy for everyone.

willydouhard avatar Jun 13 '23 16:06 willydouhard

Hi @willydouhard thanks for your clarification on intermediary streaming. I agree with @segevtomer that streaming final answer to UI would be fair for both long generations and simple chains. Not sure if this the right place to ask but since related issues have all been closed, is final answer streaming still on our roadmap or a feature request shall be made?

xleven avatar Jul 02 '23 06:07 xleven

It is still on the roadmap but I was waiting for LangChain to come up with a solution for it. This looks promising!

willydouhard avatar Jul 03 '23 10:07 willydouhard

Just came across a new callback handler about streaming final iterator. Not sure how much related but hope it helps.

xleven avatar Jul 03 '23 11:07 xleven

Hi, does anyone have an answer I am stuck and I posted the issue there :https://github.com/langchain-ai/langchain/issues/10316 Can someone help me ?

Serge9744 avatar Feb 06 '24 15:02 Serge9744