from the notebook It says: LangChain provides streaming support for LLMs. Currently, we support streaming for the OpenAI, ChatOpenAI. and Anthropic implementations, but streaming support for other LLM implementations is on the roadmap.

I am more interested in using the commercially open-source LLM available on Hugging Face, such as Dolly V2. I am wondering whether LangChain has plans to include streaming support for Hugging Face's LLM in their roadmap. Additionally, is there any timeline for its integration? Thank you.

Apr 14 '23 22:04 DanqingZ

It seems to just work out of the box if you put a streamer in your pipeline:

streamer = TextStreamer(tokenizer)
pipe = pipeline(model=model,
                tokenizer=tokenizer, 
                streamer=streamer}
llm = HuggingFacePipeline(pipeline=pipe)

Apr 20 '23 14:04 jloganolson

@jloganolson thank you so much Logan!

I just learnt TextStreamer from you today. I did some research and found it was released two weeks ago by huggingface in the transformers package:, released two weeks ago by huggingface in the transformers package: https://huggingface.co/docs/transformers/internal/generation_utils#transformers.TextStreamer, https://github.com/huggingface/transformers/blob/main/src/transformers/generation/streamers.py

from transformers import AutoModelForCausalLM, AutoTokenizer, TextStreamer, pipeline
streamer = TextStreamer(tokenizer, skip_prompt=True)
pipe = pipeline(
    "text-generation",
    model=model_fintuned,
    tokenizer=tokenizer,
    max_length=2048,
    temperature=0.6,
    pad_token_id=tokenizer.eos_token_id,
    top_p=0.95,
    repetition_penalty=1.2,
    device=device,
    streamer=streamer
)
pipe(prompts[0])

inputs = tokenizer(prompts[0], return_tensors="pt").to(device)
streamer = TextStreamer(tokenizer, skip_prompt=True)
_ = model_fintuned.generate(**inputs, streamer=streamer, pad_token_id=tokenizer.eos_token_id, max_length=248, temperature=0.8, top_p=0.8,
                        repetition_penalty=1.25)

Apr 21 '23 04:04 DanqingZ

related issues: https://github.com/databrickslabs/dolly/issues/84

Apr 21 '23 04:04 DanqingZ

close this issue, since it is solved thanks to @jloganolson

Apr 21 '23 04:04 DanqingZ

langchain+gradio chatbot, streaming output

        streamer = TextIteratorStreamer(tokenizer, timeout=10., skip_prompt=True, skip_special_tokens=True)
        pipe = pipeline(
            "text-generation",
            model=base_model,
            tokenizer=tokenizer,
            max_length=2048,
            temperature=0.6,
            pad_token_id=tokenizer.eos_token_id,
            top_p=0.95,
            repetition_penalty=1.2,
            streamer=streamer
        )
        local_llm = HuggingFacePipeline(pipeline=pipe)
        enhanced_rqa = RetrievalQA.from_chain_type(llm=local_llm, chain_type="stuff", retriever=product_retriever)
        from threading import Thread
        def run_enhanced_rqa(message):
            enhanced_rqa.run(message)

        t = Thread(target=run_enhanced_rqa, args=(input_message,))
        t.start()

        history[-1][1] = ""
        for new_text in streamer:
            history[-1][1]  += new_text
            time.sleep(0.05)
            yield history

Apr 21 '23 23:04 DanqingZ

I am creating an indexer, for that I want to use CustomLLM. How can I use this streaming method in this type of object. Note: I can't use HuggingFacePipeline or any similar framework. My work is limited to CustomLLM.

May 24 '23 08:05 ambiSk

langchain+gradio chatbot, streaming output

        streamer = TextIteratorStreamer(tokenizer, timeout=10., skip_prompt=True, skip_special_tokens=True)
        pipe = pipeline(
            "text-generation",
            model=base_model,
            tokenizer=tokenizer,
            max_length=2048,
            temperature=0.6,
            pad_token_id=tokenizer.eos_token_id,
            top_p=0.95,
            repetition_penalty=1.2,
            streamer=streamer
        )
        local_llm = HuggingFacePipeline(pipeline=pipe)
        enhanced_rqa = RetrievalQA.from_chain_type(llm=local_llm, chain_type="stuff", retriever=product_retriever)
        from threading import Thread
        def run_enhanced_rqa(message):
            enhanced_rqa.run(message)

        t = Thread(target=run_enhanced_rqa, args=(input_message,))
        t.start()

        history[-1][1] = ""
        for new_text in streamer:
            history[-1][1]  += new_text
            time.sleep(0.05)
            yield history

This is not working for me . Getting thread empty error. Could you pls share the complete gradio code

Jul 10 '23 02:07 NajiAboo

I use Llma 2

Use a pipeline for later

from transformers import pipeline, TextStreamer

streamer = TextStreamer(tokenizer, skip_prompt=True)

pipe = pipeline("text-generation", model=model, tokenizer= tokenizer, torch_dtype=torch.bfloat16, device_map="auto", max_new_tokens = 512, do_sample=True, top_k=10, num_return_sequences=1, streamer=streamer, eos_token_id=tokenizer.eos_token_id )

It is working for me!

Jul 26 '23 16:07 dtthanh1971

it streams to stdout not as generator variable

Oct 23 '23 18:10 Stosan

Added TextStreamer for HuggingFacePipeline, but doesn't seem to change anything to issue

https://github.com/langchain-ai/langserve/issues/218

Nov 13 '23 09:11 tigerinus

Any new updates on this?

Mar 18 '24 22:03 mfwz247

@NajiAboo same, have you solved it? i'm getting _queue.Empty error

Apr 26 '24 07:04 Aillian

if the response has im_start or im_end and you are annoyed with, use skip_special_tokens as keyword arguments in TextStreamer

streamer = TextStreamer(tokenizer, skip_prompt=True, skip_special_tokens=True)

Jun 14 '24 07:06 Shuntw6096

prompt = "How to make sandwich ?" streamer = TextStreamer(tokenizer,skip_prompt=True) This is my code, I want to stop

pipe = pipeline(task="text-generation", model=model, tokenizer=tokenizer,max_length=512,
    min_length = 30,
    temperature=0.6,
    pad_token_id=tokenizer.eos_token_id,
    top_p=0.95,
    encoder_repetition_penalty = 0.3,
    num_return_sequences=1,
    repetition_penalty=1.2,
    length_penalty= 0.5,

    streamer=streamer)
result = pipe(f"<s>[INST] {prompt} [/INST]")

the output stops at instantly without completing the full sentence, I want it as minimum response, Is there any parameter I'm missing, for example:

Spread soft bread with mayonnaise or mustard, add your favorite meat and cheese, and enjoy! 2. What is the difference between It stops like this

I new to this.

Jul 12 '24 12:07 gbs-ai

langchain+gradio chatbot, streaming output

        streamer = TextIteratorStreamer(tokenizer, timeout=10., skip_prompt=True, skip_special_tokens=True)
        pipe = pipeline(
            "text-generation",
            model=base_model,
            tokenizer=tokenizer,
            max_length=2048,
            temperature=0.6,
            pad_token_id=tokenizer.eos_token_id,
            top_p=0.95,
            repetition_penalty=1.2,
            streamer=streamer
        )
        local_llm = HuggingFacePipeline(pipeline=pipe)
        enhanced_rqa = RetrievalQA.from_chain_type(llm=local_llm, chain_type="stuff", retriever=product_retriever)
        from threading import Thread
        def run_enhanced_rqa(message):
            enhanced_rqa.run(message)

        t = Thread(target=run_enhanced_rqa, args=(input_message,))
        t.start()

        history[-1][1] = ""
        for new_text in streamer:
            history[-1][1]  += new_text
            time.sleep(0.05)
            yield history

How to initialise tokenizer with chat_template here?

Aug 02 '24 11:08 ShreyGanatra

langchain
langchain copied to clipboard

streaming support for LLM, from huggingface

Use a pipeline for later

langchain langchain copied to clipboard

streaming support for LLM, from huggingface

Use a pipeline for later

langchain
langchain copied to clipboard