langchain
langchain copied to clipboard
Add stream method for HuggingFacePipeline Objet
Probably use this
https://huggingface.co/docs/transformers/main/en/generation_strategies#streaming
We probably want to expose the raw generator objects so that people can create streaming APIs w/ SSE
This would be great! I would say the TextStreamer
is just a stdout, so probablyTextIteratorStreamer
is a better choice: https://huggingface.co/docs/transformers/main/en/internal/generation_utils#transformers.TextIteratorStreamer
It seems there is already pull request for this - https://github.com/hwchase17/langchain/pull/1222, waiting for review from @agola11
Hi, @sam-h-bean! I'm Dosu, and I'm here to help the LangChain team manage their backlog. I wanted to let you know that we are marking this issue as stale.
From what I understand, you requested the addition of a stream method for the HuggingFacePipeline object to enable streaming generation strategies. There have been some discussions in the comments, including suggestions for exposing raw generator objects for creating streaming APIs and a suggestion for a better choice for the streamer name. Additionally, there is a pull request waiting for review from @agola11.
Before we close this issue, we wanted to check with you if it is still relevant to the latest version of the LangChain repository. If it is, please let us know by commenting on the issue. Otherwise, feel free to close the issue yourself, or it will be automatically closed in 7 days.
Thank you for your contribution to the LangChain repository!
Please don't close this issue I'm facing the same problem: https://stackoverflow.com/questions/77197723/how-do-i-stream-huggingfacepipeline-output-to-a-langchain-dataframe-agent
@baskaryan Could you please help @PyroGenesis with the issue they are facing? They have indicated that the problem mentioned in the closed issue is still relevant and provided a link to a related Stack Overflow question. Thank you!
@PyroGenesis @sam-h-bean I created a pull request which implements the streaming for the HuggingFacePipeline by implementing the _astream method. It is still a bit hacky I think but it works very well. Here is the link to the PR: https://github.com/langchain-ai/langchain/pull/14090
I am currently using the streaming in a chainlit app. It should work with other frontends but I haven't tested it. Below is my chainlit app:
from langchain.llms.huggingface_pipeline import HuggingFacePipeline
from transformers import AutoTokenizer, TextIteratorStreamer
from langchain.prompts import PromptTemplate
from langchain_core.output_parsers.string import StrOutputParser
from langchain_core.runnables.config import RunnableConfig
from langchain_core.runnables.base import Runnable
import transformers
import torch
import chainlit as cl
from chainlit.playground.config import add_llm_provider
from chainlit.playground.providers.langchain import LangchainGenericProvider
template = """
GPT4 Correct User: {question}<|end_of_turn|>GPT4 Correct Assistant:
"""
# Load model and tokenizer
@cl.cache
def load_llama():
model_name = "berkeley-nest/Starling-LM-7B-alpha"
model_path = "./model"
tokenizer = AutoTokenizer.from_pretrained(model_name)
streamer = TextIteratorStreamer(tokenizer, skip_prompt=True, skip_special_tokens=True)
pipeline = transformers.pipeline(
"text-generation",
model=model_path,
tokenizer=tokenizer,
trust_remote_code=True,
device_map="auto",
max_length=10000,
do_sample=True,
num_return_sequences=1,
pad_token_id=tokenizer.pad_token_id,
eos_token_id=tokenizer.eos_token_id,
streamer=streamer
)
llm = HuggingFacePipeline(
pipeline=pipeline,
model_kwargs={"temperature": 0.8}
)
return llm
llm = load_llama()
# add_llm_provider(
# LangchainGenericProvider(
# id=llm._llm_type, name="starling7b", llm=llm, is_chat=False
# )
# )
@cl.on_chat_start
async def on_chat_start():
model = load_llama()
prompt = PromptTemplate(template=template, input_variables=["GPT4 Correct User"])
runnable = prompt | model | StrOutputParser()
cl.user_session.set("runnable", runnable)
@cl.on_message
async def on_message(message: cl.Message):
runnable = cl.user_session.get("runnable")
msg = cl.Message(content="")
async for chunk in runnable.astream(
{"question": message.content},
config=RunnableConfig(callbacks=[cl.LangchainCallbackHandler()]),
):
await msg.stream_token(chunk)
await msg.send()
@PyroGenesis @sam-h-bean I created a pull request which implements the streaming for the HuggingFacePipeline by implementing the _astream method. It is still a bit hacky I think but it works very well. Here is the link to the PR: #14090
I am currently using the streaming in a chainlit app. It should work with other frontends but I haven't tested it. Below is my chainlit app:
from langchain.llms.huggingface_pipeline import HuggingFacePipeline from transformers import AutoTokenizer, TextIteratorStreamer from langchain.prompts import PromptTemplate from langchain_core.output_parsers.string import StrOutputParser from langchain_core.runnables.config import RunnableConfig from langchain_core.runnables.base import Runnable import transformers import torch import chainlit as cl from chainlit.playground.config import add_llm_provider from chainlit.playground.providers.langchain import LangchainGenericProvider template = """ GPT4 Correct User: {question}<|end_of_turn|>GPT4 Correct Assistant: """ # Load model and tokenizer @cl.cache def load_llama(): model_name = "berkeley-nest/Starling-LM-7B-alpha" model_path = "./model" tokenizer = AutoTokenizer.from_pretrained(model_name) streamer = TextIteratorStreamer(tokenizer, skip_prompt=True, skip_special_tokens=True) pipeline = transformers.pipeline( "text-generation", model=model_path, tokenizer=tokenizer, trust_remote_code=True, device_map="auto", max_length=10000, do_sample=True, num_return_sequences=1, pad_token_id=tokenizer.pad_token_id, eos_token_id=tokenizer.eos_token_id, streamer=streamer ) llm = HuggingFacePipeline( pipeline=pipeline, model_kwargs={"temperature": 0.8} ) return llm llm = load_llama() # add_llm_provider( # LangchainGenericProvider( # id=llm._llm_type, name="starling7b", llm=llm, is_chat=False # ) # ) @cl.on_chat_start async def on_chat_start(): model = load_llama() prompt = PromptTemplate(template=template, input_variables=["GPT4 Correct User"]) runnable = prompt | model | StrOutputParser() cl.user_session.set("runnable", runnable) @cl.on_message async def on_message(message: cl.Message): runnable = cl.user_session.get("runnable") msg = cl.Message(content="") async for chunk in runnable.astream( {"question": message.content}, config=RunnableConfig(callbacks=[cl.LangchainCallbackHandler()]), ): await msg.stream_token(chunk) await msg.send()
Thank you for your guidance. I've been experimenting with the code you provided, integrating it into a chainlit app. However, I was curious whether this refers to the output being streamed (word by word) as LangChain intermediate steps, or as the final answer streamed word by word. In my trials, neither streaming scenario occurred in my Chainlit.