gradio icon indicating copy to clipboard operation
gradio copied to clipboard

Gradio LangChain streaming Example

Open taoari opened this issue 1 year ago • 19 comments

  • [x] I have searched to see if a similar issue already exists.

Is your feature request related to a problem? Please describe.
A clear and concise description of what the problem is. Ex. I'm always frustrated when [...]

It would be great if there could be a gradio langchain example with streaming support.

There is an example in the Guide: https://www.gradio.app/guides/creating-a-chatbot-fast#a-langchain-example for LangChain, but there is no streaming support. LangChain supports streaming in a callback way (https://python.langchain.com/docs/modules/model_io/models/chat/streaming), but the official example only streams to stdout. How can we stream langchain LLM to Gradio Chatbot messages?

Describe the solution you'd like
A clear and concise description of what you want to happen.

A gradio langchain example with streaming support is provoided in https://www.gradio.app/guides/creating-a-chatbot-fast#a-langchain-example.

Additional context
Add any other context or screenshots about the feature request here.

taoari avatar Aug 25 '23 18:08 taoari

Yeah this is popular enough that we could consider adding this. @yvrjsharma would you like to take this on?

abidlabs avatar Aug 25 '23 18:08 abidlabs

Can I do this?

Saigenix avatar Aug 25 '23 18:08 Saigenix

Go for it @Saigenix! We'd welcome a contribution

abidlabs avatar Aug 25 '23 18:08 abidlabs

Go for it @Saigenix! We'd welcome a contribution

hello, I don't have a paid openai account which's why I can't able to check its working or not Can you checkout this code?

A langchain example with streaming support

This will be same as above example with extra streaming support. Some Chat models provide a streaming response. This means that instead of waiting for the entire response to be returned, you can start processing it as soon as it's available. This is useful if you want to display the response to the user as it's being generated, or if you want to process the response as it's being generated.


from langchain.chat_models import ChatOpenAI
from langchain.schema import (
    HumanMessage,AIMessage
)
from langchain.callbacks.streaming_stdout import StreamingStdOutCallbackHandler
import gradio as gr


#  os.envrion["OPENAI_API_KEY"] = ""  # Replace with your key

def predict(message, history):
    history_langchain_format = []
    for human, ai in history:
        history_langchain_format.append(HumanMessage(content=human))
        history_langchain_format.append(AIMessage(content=ai))
    history_langchain_format.append(HumanMessage(content=message))
    gpt_response = ChatOpenAI(streaming=True, callbacks=[StreamingStdOutCallbackHandler()], temperature=0,openai_api_key="...")
    resp = gpt_response(history_langchain_format)
    return resp.content

gr.ChatInterface(predict).launch()

Saigenix avatar Aug 25 '23 19:08 Saigenix

@Saigenix This does not work. For Gradio streaming, the predict function should be a generator function. The complex part is that LangChain does not return a generator even with streaming=True.

taoari avatar Aug 25 '23 23:08 taoari

@Saigenix This does not work. For Gradio streaming, the predict function should be a generator function. The complex part is that LangChain does not return a generator even with streaming=True.

wait i will try another way

Saigenix avatar Aug 26 '23 05:08 Saigenix

@Saigenix This does not work. For Gradio streaming, the predict function should be a generator function. The complex part is that LangChain does not return a generator even with streaming=True.

hey I tried with this

# Callbacks support token-wise streaming
class StreamingStdOutCallbackHandler(BaseCallbackHandler):
    def __init__(self,initial_text=""):
        self.text=initial_text
        
    def on_llm_new_token(self, token: str, **kwargs) -> None:
        # "/" is a marker to show difference 
        # you don't need it 
        self.text+=token+"/" 

Do you know any way to update chatbox content when the on_llm_new_token() function gets called

Saigenix avatar Aug 26 '23 07:08 Saigenix

@Saigenix i do not know simple ways to achieve this. I see examples using subprocess or websocket, the codes are quite difficult to understand. So I am wondering if this can be implemented. In langchain, there are streamlit and stdout callback functions. langchain streaming works for both stdout and streamlit, do not know why langchain does not have one gradio callback function bulitin. Is this really hard to implement?

taoari avatar Aug 27 '23 04:08 taoari

@taoari yes This will be much easier if they provide callback function for gradio

Saigenix avatar Aug 27 '23 04:08 Saigenix

I couldn't find a simple way to do it, but I found a pragmatic solution. I'm not saying this is the recommended way. I just needed to perform a demo session with a client, and I wanted to use streaming with my architecture. This is how I did it adapting this code using a Queue and a Generator function (thanks to this guy). Define the callback:

from langchain.callbacks.base import BaseCallbackHandler

class QueueCallback(BaseCallbackHandler):
    """Callback handler for streaming LLM responses to a queue."""

    def __init__(self, q):
        self.q = q

    def on_llm_new_token(self, token: str, **kwargs: any) -> None:
        self.q.put(token)

    def on_llm_end(self, *args, **kwargs: any) -> None:
        return self.q.empty()

The stream function:

def stream(input_text) -> Generator:
    # Create a Queue
    q = Queue()
    job_done = object()


    """Logic for loading the chain you want to use should go here."""    
    llm = ChatOpenAI(
        streaming=True, 
        model='gpt-3.5-turbo-0613', 
        callbacks=[QueueCallback(q)], 
        temperature=0
    )
    
    conversation = ConversationChain(
        prompt=PROMPT,
        llm=llm,
        verbose=True
    )
    # Create a funciton to call - this will run in a thread
    def task():
        resp = conversation.run(input_text)
        q.put(job_done)

    # Create a thread and start the function
    t = Thread(target=task)
    t.start()

    content = ""

    # Get each new token from the queue and yield for our generator
    while True:
        try:
            next_token = q.get(True, timeout=1)
            if next_token is job_done:
                break
            content += next_token
            yield next_token, content
        except Empty:
            continue

and finally calling it from ChatInterface:

def ask_llm(message, history):
    for next_token, content in stream(message):
        yield(content)

chatInterface = gr.ChatInterface(
    fn=ask_llm,
...

Hope it helps, good luck!

damigarcia avatar Aug 30 '23 12:08 damigarcia

did

it's very helpful, thak you

yy306525121 avatar Sep 28 '23 03:09 yy306525121

This should now be possible using LC's Chain.stream() function. E.g. for a chatbot:

from operator import itemgetter
import os

from langchain.chat_models import ChatOpenAI
from langchain.prompts import ChatPromptTemplate, MessagesPlaceholder
from langchain.memory import ConversationBufferMemory
from langchain.schema.runnable import RunnableLambda, RunnablePassthrough

# Initialize chat model
llm = ChatOpenAI(openai_api_key=os.environ["OPENAI_API_KEY"])

# Define a prompt template
template = """You are a helpful AI assistant. You give specialized advice on travel.
"""

chat_prompt = ChatPromptTemplate.from_messages(
    [
        ("system", template),
        MessagesPlaceholder(variable_name="history"),
        ("human", "{input}"),
    ]
)

# Create conversation history store
memory = ConversationBufferMemory(memory_key="history", return_messages=True)

# Initialize chain
# chain = LLMChain(
#     llm=llm,
#     prompt=chat_prompt,
#     # verbose=True,
#     memory=memory,
# )
chain = (
    RunnablePassthrough.assign(
        history=RunnableLambda(memory.load_memory_variables) | itemgetter("history")
    )
    | chat_prompt
    | llm
)


def stream_response(input, history):
    if input is not None:
        # ChatInterface struggles with rendering stream
        for response in chain.stream({"input": input}):
            print(response.content)
            yield response.content


# UI
import gradio as gr

gr.ChatInterface(stream_response).queue().launch()

From the print statement, you can see that the response is being generated correctly. Unfortunately, ChatInterface fails to display the results:

Screenshot 2023-12-08 at 2 14 24 PM

Does anyone know why this might be?

jason-trinidad avatar Dec 08 '23 21:12 jason-trinidad

I can debug this more deeply at my end, but have you already tried setting debug=True in launch() to look at the log what error it is showing?

yvrjsharma avatar Dec 09 '23 03:12 yvrjsharma

Hi @yvrjsharma - thanks for the response. Just tried debug=True. I see no error:

Screenshot 2023-12-11 at 9 35 25 AM

I also tried stepping through each generation. I see now that each word is replacing the previous word, instead of adding to it. E.g.

Screenshot 2023-12-11 at 9 41 31 AM

This seems to happen in Chrome and Safari. Thoughts?

jason-trinidad avatar Dec 11 '23 16:12 jason-trinidad

ah, sweet, thanks for sharing this. I think we just need to return the full message from the stream_response function instead of a single word. I used your above repro to come up with a working solution below. Do you want to try it out and see if that works for you too?

from operator import itemgetter
import os

from langchain.chat_models import ChatOpenAI
from langchain.prompts import ChatPromptTemplate, MessagesPlaceholder
from langchain.memory import ConversationBufferMemory
from langchain.schema.runnable import RunnableLambda, RunnablePassthrough

# Initialize chat model
llm = ChatOpenAI(openai_api_key="sk-your-key")

# Define a prompt template
template = """You are a helpful AI assistant. You give specialized advice on travel.
"""

chat_prompt = ChatPromptTemplate.from_messages(
    [
        ("system", template),
        MessagesPlaceholder(variable_name="history"),
        ("human", "{input}"),
    ]
)

# Create conversation history store
memory = ConversationBufferMemory(memory_key="history", return_messages=True)

chain = (
    RunnablePassthrough.assign(
        history=RunnableLambda(memory.load_memory_variables) | itemgetter("history")
    )
    | chat_prompt
    | llm
)


def stream_response(input, history):
    if input is not None:
        partial_message = ""
        # ChatInterface struggles with rendering stream
        for response in chain.stream({"input": input}):
            partial_message += response.content
            print(partial_message)
            yield partial_message 


# UI
import gradio as gr

gr.ChatInterface(stream_response).queue().launch(debug=True)

screenshot - image

yvrjsharma avatar Dec 11 '23 17:12 yvrjsharma

Ah I misunderstood the implementation. That works for me! Thank you 😁

jason-trinidad avatar Dec 12 '23 00:12 jason-trinidad

For others who come across this thread:

  • I can confirm that the streaming solution works. Thanks @yvrjsharma!
  • However, the chain memory in the implementation above is not updated and therefore you won't be able to continue the conversation.

Others may have better solutions, but one way to fix is by updating the stream_response function as follows:

def stream_response(message, history):
    print(f"Input: {message}. History: {history}\n")

    if history:
        human, ai = history[-1]
        memory.chat_memory.add_user_message(HumanMessage(content=human))
        memory.chat_memory.add_ai_message(AIMessage(content=ai))

    print(f"Memory in chain: \n{memory.chat_memory} \n")

    if message is not None:
        partial_message = ""
        # ChatInterface struggles with rendering stream
        for response in chain.stream({"input": message}):
            partial_message += response.content
            # print(partial_message)
            yield partial_message

bent-verbiage avatar Jan 05 '24 10:01 bent-verbiage

Thanks @bent-verbiage , I finish it without memory store.

import os

from langchain_openai import ChatOpenAI
from langchain.schema import AIMessage, HumanMessage
import gradio as gr

os.environ["OPENAI_API_KEY"] = "sk-xxx"

# Initialize chat model
llm = ChatOpenAI(temperature=0.7, model='gpt-4', streaming=True)


def stream_response(message, history):
    print(f"Input: {message}. History: {history}\n")

    history_langchain_format = []
    for human, ai in history:
        history_langchain_format.append(HumanMessage(content=human))
        history_langchain_format.append(AIMessage(content=ai))

    if message is not None:
        history_langchain_format.append(HumanMessage(content=message))
        partial_message = ""
        for response in llm.stream(history_langchain_format):
            partial_message += response.content
            yield partial_message


iface = gr.ChatInterface(
    stream_response,
    textbox=gr.Textbox(placeholder="Message ChatGPT...", container=False, scale=7),
)

iface.launch(share=True)

HamaWhiteGG avatar Feb 26 '24 06:02 HamaWhiteGG

How can I simulate streaming output response in my code ?

import logging
import sys
import torch
import requests
import subprocess
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader, ServiceContext
from llama_index.llms.huggingface import HuggingFaceLLM
from langchain.embeddings.huggingface import HuggingFaceEmbeddings
from llama_index.legacy.embeddings.langchain import LangchainEmbedding
from llama_index.core.prompts.prompts import SimpleInputPrompt


logging.basicConfig(stream=sys.stdout, level=logging.INFO)
logging.getLogger().addHandler(logging.StreamHandler(stream=sys.stdout))

# Open the left folder icon menu and create a folder named sample and upload documents (pdfs)
documents = SimpleDirectoryReader("/content/G2").load_data()
system_prompt = "You are a Q&A assistant. Your goal is to answer questions as accurately as possible based on the instructions and context provided."
query_wrapper_prompt = SimpleInputPrompt("<|USER|>{query_str}<|ASSISTANT|>")
import torch

llm = HuggingFaceLLM(
    context_window=4096,
    max_new_tokens=256,
    generate_kwargs={"temperature": 0.0, "do_sample": False},
    system_prompt=system_prompt,
    query_wrapper_prompt=query_wrapper_prompt,
    tokenizer_name="meta-llama/Llama-2-7b-chat-hf",
    model_name="meta-llama/Llama-2-7b-chat-hf",
    device_map="auto",
    model_kwargs={"torch_dtype": torch.float16 , "load_in_8bit":True}
)

embed_model = LangchainEmbedding(
    HuggingFaceEmbeddings(model_name="sentence-transformers/all-mpnet-base-v2")
)

service_context = ServiceContext.from_defaults(
    chunk_size=1024,
    llm=llm,
    embed_model=embed_model
)

index = VectorStoreIndex.from_documents(documents, service_context=service_context)

import gradio as gr

# Define your query_index function here
def query_index(query, history):
      query_engine = index.as_query_engine()
      response = query_engine.query(query)
      return str(response)

demo = gr.ChatInterface(
    fn=query_index,
    title="G2Bot"
)

# Launch the Gradio Chat Interface
demo.launch(debug=True)

Would love help asap, thank you in advance.

YashaswiniIppili avatar Apr 12 '24 12:04 YashaswiniIppili

I looked into this, but the langchain docs offer so many different ways to stream LLMs that I'm not sure what the best example to add to our docs. I'd recommend just using the openai streaming example and modifying as necessary: https://www.gradio.app/guides/creating-a-chatbot-fast#a-streaming-example-using-openai

If someone has concrete issues getting this to work, its best to ask in our Discord server. (I'll close this issue)

abidlabs avatar Jul 01 '24 13:07 abidlabs