langchain Add stream method for HuggingFacePipeline Objet

Probably use this

https://huggingface.co/docs/transformers/main/en/generation_strategies#streaming

Apr 02 '23 19:04 sam-h-bean

We probably want to expose the raw generator objects so that people can create streaming APIs w/ SSE

Apr 02 '23 19:04 sam-h-bean

This would be great! I would say the TextStreamer is just a stdout, so probablyTextIteratorStreamer is a better choice: https://huggingface.co/docs/transformers/main/en/internal/generation_utils#transformers.TextIteratorStreamer

May 04 '23 08:05 maziyarpanahi

It seems there is already pull request for this - https://github.com/hwchase17/langchain/pull/1222, waiting for review from @agola11

Jun 21 '23 13:06 ernestp

Hi, @sam-h-bean! I'm Dosu, and I'm here to help the LangChain team manage their backlog. I wanted to let you know that we are marking this issue as stale.

From what I understand, you requested the addition of a stream method for the HuggingFacePipeline object to enable streaming generation strategies. There have been some discussions in the comments, including suggestions for exposing raw generator objects for creating streaming APIs and a suggestion for a better choice for the streamer name. Additionally, there is a pull request waiting for review from @agola11.

Before we close this issue, we wanted to check with you if it is still relevant to the latest version of the LangChain repository. If it is, please let us know by commenting on the issue. Otherwise, feel free to close the issue yourself, or it will be automatically closed in 7 days.

Thank you for your contribution to the LangChain repository!

Sep 20 '23 16:09 dosubot[bot]

Please don't close this issue I'm facing the same problem: https://stackoverflow.com/questions/77197723/how-do-i-stream-huggingfacepipeline-output-to-a-langchain-dataframe-agent

Sep 28 '23 23:09 PyroGenesis

@baskaryan Could you please help @PyroGenesis with the issue they are facing? They have indicated that the problem mentioned in the closed issue is still relevant and provided a link to a related Stack Overflow question. Thank you!

Sep 28 '23 23:09 dosubot[bot]

@PyroGenesis @sam-h-bean I created a pull request which implements the streaming for the HuggingFacePipeline by implementing the _astream method. It is still a bit hacky I think but it works very well. Here is the link to the PR: https://github.com/langchain-ai/langchain/pull/14090

I am currently using the streaming in a chainlit app. It should work with other frontends but I haven't tested it. Below is my chainlit app:

from langchain.llms.huggingface_pipeline import HuggingFacePipeline
from transformers import AutoTokenizer, TextIteratorStreamer
from langchain.prompts import PromptTemplate
from langchain_core.output_parsers.string import StrOutputParser
from langchain_core.runnables.config import RunnableConfig
from langchain_core.runnables.base import Runnable

import transformers
import torch

import chainlit as cl
from chainlit.playground.config import add_llm_provider
from chainlit.playground.providers.langchain import LangchainGenericProvider

template = """
GPT4 Correct User: {question}<|end_of_turn|>GPT4 Correct Assistant:
"""

# Load model and tokenizer


@cl.cache
def load_llama():
    model_name = "berkeley-nest/Starling-LM-7B-alpha"
    model_path = "./model"
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    streamer = TextIteratorStreamer(tokenizer, skip_prompt=True, skip_special_tokens=True)
    pipeline = transformers.pipeline(
        "text-generation",
        model=model_path,
        tokenizer=tokenizer,
        trust_remote_code=True,
        device_map="auto",
        max_length=10000,
        do_sample=True,
        num_return_sequences=1,
        pad_token_id=tokenizer.pad_token_id,
        eos_token_id=tokenizer.eos_token_id,
        streamer=streamer
    )

    llm = HuggingFacePipeline(
        pipeline=pipeline,
        model_kwargs={"temperature": 0.8}
    )
    return llm


llm = load_llama()

# add_llm_provider(
#     LangchainGenericProvider(
#         id=llm._llm_type, name="starling7b", llm=llm, is_chat=False
#     )
# )

@cl.on_chat_start
async def on_chat_start():
    model = load_llama()
    prompt = PromptTemplate(template=template, input_variables=["GPT4 Correct User"])
    runnable = prompt | model | StrOutputParser()
    cl.user_session.set("runnable", runnable)

@cl.on_message
async def on_message(message: cl.Message):
    runnable = cl.user_session.get("runnable")

    msg = cl.Message(content="")

    async for chunk in runnable.astream(
        {"question": message.content},
        config=RunnableConfig(callbacks=[cl.LangchainCallbackHandler()]),
    ):
        await msg.stream_token(chunk)

    await msg.send()

Nov 30 '23 21:11 laudanum123

@PyroGenesis @sam-h-bean I created a pull request which implements the streaming for the HuggingFacePipeline by implementing the _astream method. It is still a bit hacky I think but it works very well. Here is the link to the PR: #14090

I am currently using the streaming in a chainlit app. It should work with other frontends but I haven't tested it. Below is my chainlit app:

from langchain.llms.huggingface_pipeline import HuggingFacePipeline
from transformers import AutoTokenizer, TextIteratorStreamer
from langchain.prompts import PromptTemplate
from langchain_core.output_parsers.string import StrOutputParser
from langchain_core.runnables.config import RunnableConfig
from langchain_core.runnables.base import Runnable

import transformers
import torch

import chainlit as cl
from chainlit.playground.config import add_llm_provider
from chainlit.playground.providers.langchain import LangchainGenericProvider

template = """
GPT4 Correct User: {question}<|end_of_turn|>GPT4 Correct Assistant:
"""

# Load model and tokenizer


@cl.cache
def load_llama():
    model_name = "berkeley-nest/Starling-LM-7B-alpha"
    model_path = "./model"
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    streamer = TextIteratorStreamer(tokenizer, skip_prompt=True, skip_special_tokens=True)
    pipeline = transformers.pipeline(
        "text-generation",
        model=model_path,
        tokenizer=tokenizer,
        trust_remote_code=True,
        device_map="auto",
        max_length=10000,
        do_sample=True,
        num_return_sequences=1,
        pad_token_id=tokenizer.pad_token_id,
        eos_token_id=tokenizer.eos_token_id,
        streamer=streamer
    )

    llm = HuggingFacePipeline(
        pipeline=pipeline,
        model_kwargs={"temperature": 0.8}
    )
    return llm


llm = load_llama()

# add_llm_provider(
#     LangchainGenericProvider(
#         id=llm._llm_type, name="starling7b", llm=llm, is_chat=False
#     )
# )

@cl.on_chat_start
async def on_chat_start():
    model = load_llama()
    prompt = PromptTemplate(template=template, input_variables=["GPT4 Correct User"])
    runnable = prompt | model | StrOutputParser()
    cl.user_session.set("runnable", runnable)

@cl.on_message
async def on_message(message: cl.Message):
    runnable = cl.user_session.get("runnable")

    msg = cl.Message(content="")

    async for chunk in runnable.astream(
        {"question": message.content},
        config=RunnableConfig(callbacks=[cl.LangchainCallbackHandler()]),
    ):
        await msg.stream_token(chunk)

    await msg.send()

Thank you for your guidance. I've been experimenting with the code you provided, integrating it into a chainlit app. However, I was curious whether this refers to the output being streamed (word by word) as LangChain intermediate steps, or as the final answer streamed word by word. In my trials, neither streaming scenario occurred in my Chainlit.

Mar 18 '24 08:03 viobert

langchain langchain copied to clipboard

Add stream method for HuggingFacePipeline Objet

langchain
langchain copied to clipboard