langchain icon indicating copy to clipboard operation
langchain copied to clipboard

streaming support for LLM, from huggingface

Open DanqingZ opened this issue 1 year ago • 1 comments

from the notebook It says: LangChain provides streaming support for LLMs. Currently, we support streaming for the OpenAI, ChatOpenAI. and Anthropic implementations, but streaming support for other LLM implementations is on the roadmap.

I am more interested in using the commercially open-source LLM available on Hugging Face, such as Dolly V2. I am wondering whether LangChain has plans to include streaming support for Hugging Face's LLM in their roadmap. Additionally, is there any timeline for its integration? Thank you.

DanqingZ avatar Apr 14 '23 22:04 DanqingZ

It seems to just work out of the box if you put a streamer in your pipeline:

streamer = TextStreamer(tokenizer)
pipe = pipeline(model=model,
                tokenizer=tokenizer, 
                streamer=streamer}
llm = HuggingFacePipeline(pipeline=pipe)

jloganolson avatar Apr 20 '23 14:04 jloganolson

@jloganolson thank you so much Logan!

I just learnt TextStreamer from you today. I did some research and found it was released two weeks ago by huggingface in the transformers package:, released two weeks ago by huggingface in the transformers package: https://huggingface.co/docs/transformers/internal/generation_utils#transformers.TextStreamer, https://github.com/huggingface/transformers/blob/main/src/transformers/generation/streamers.py

from transformers import AutoModelForCausalLM, AutoTokenizer, TextStreamer, pipeline
streamer = TextStreamer(tokenizer, skip_prompt=True)
pipe = pipeline(
    "text-generation",
    model=model_fintuned,
    tokenizer=tokenizer,
    max_length=2048,
    temperature=0.6,
    pad_token_id=tokenizer.eos_token_id,
    top_p=0.95,
    repetition_penalty=1.2,
    device=device,
    streamer=streamer
)
pipe(prompts[0])

inputs = tokenizer(prompts[0], return_tensors="pt").to(device)
streamer = TextStreamer(tokenizer, skip_prompt=True)
_ = model_fintuned.generate(**inputs, streamer=streamer, pad_token_id=tokenizer.eos_token_id, max_length=248, temperature=0.8, top_p=0.8,
                        repetition_penalty=1.25)

DanqingZ avatar Apr 21 '23 04:04 DanqingZ

related issues: https://github.com/databrickslabs/dolly/issues/84

DanqingZ avatar Apr 21 '23 04:04 DanqingZ

close this issue, since it is solved thanks to @jloganolson

DanqingZ avatar Apr 21 '23 04:04 DanqingZ

langchain+gradio chatbot, streaming output

        streamer = TextIteratorStreamer(tokenizer, timeout=10., skip_prompt=True, skip_special_tokens=True)
        pipe = pipeline(
            "text-generation",
            model=base_model,
            tokenizer=tokenizer,
            max_length=2048,
            temperature=0.6,
            pad_token_id=tokenizer.eos_token_id,
            top_p=0.95,
            repetition_penalty=1.2,
            streamer=streamer
        )
        local_llm = HuggingFacePipeline(pipeline=pipe)
        enhanced_rqa = RetrievalQA.from_chain_type(llm=local_llm, chain_type="stuff", retriever=product_retriever)
        from threading import Thread
        def run_enhanced_rqa(message):
            enhanced_rqa.run(message)

        t = Thread(target=run_enhanced_rqa, args=(input_message,))
        t.start()

        history[-1][1] = ""
        for new_text in streamer:
            history[-1][1]  += new_text
            time.sleep(0.05)
            yield history

DanqingZ avatar Apr 21 '23 23:04 DanqingZ

I am creating an indexer, for that I want to use CustomLLM. How can I use this streaming method in this type of object. Note: I can't use HuggingFacePipeline or any similar framework. My work is limited to CustomLLM.

ambiSk avatar May 24 '23 08:05 ambiSk

langchain+gradio chatbot, streaming output

        streamer = TextIteratorStreamer(tokenizer, timeout=10., skip_prompt=True, skip_special_tokens=True)
        pipe = pipeline(
            "text-generation",
            model=base_model,
            tokenizer=tokenizer,
            max_length=2048,
            temperature=0.6,
            pad_token_id=tokenizer.eos_token_id,
            top_p=0.95,
            repetition_penalty=1.2,
            streamer=streamer
        )
        local_llm = HuggingFacePipeline(pipeline=pipe)
        enhanced_rqa = RetrievalQA.from_chain_type(llm=local_llm, chain_type="stuff", retriever=product_retriever)
        from threading import Thread
        def run_enhanced_rqa(message):
            enhanced_rqa.run(message)

        t = Thread(target=run_enhanced_rqa, args=(input_message,))
        t.start()

        history[-1][1] = ""
        for new_text in streamer:
            history[-1][1]  += new_text
            time.sleep(0.05)
            yield history

This is not working for me . Getting thread empty error. Could you pls share the complete gradio code

NajiAboo avatar Jul 10 '23 02:07 NajiAboo

I use Llma 2

Use a pipeline for later

from transformers import pipeline, TextStreamer

streamer = TextStreamer(tokenizer, skip_prompt=True)

pipe = pipeline("text-generation", model=model, tokenizer= tokenizer, torch_dtype=torch.bfloat16, device_map="auto", max_new_tokens = 512, do_sample=True, top_k=10, num_return_sequences=1, streamer=streamer, eos_token_id=tokenizer.eos_token_id )

It is working for me!

dtthanh1971 avatar Jul 26 '23 16:07 dtthanh1971

it streams to stdout not as generator variable

Stosan avatar Oct 23 '23 18:10 Stosan

Added TextStreamer for HuggingFacePipeline, but doesn't seem to change anything to issue

  • https://github.com/langchain-ai/langserve/issues/218

tigerinus avatar Nov 13 '23 09:11 tigerinus

Any new updates on this?

mfwz247 avatar Mar 18 '24 22:03 mfwz247

@NajiAboo same, have you solved it? i'm getting _queue.Empty error

Aillian avatar Apr 26 '24 07:04 Aillian

if the response has im_start or im_end and you are annoyed with, use skip_special_tokens as keyword arguments in TextStreamer

streamer = TextStreamer(tokenizer, skip_prompt=True, skip_special_tokens=True)

Shuntw6096 avatar Jun 14 '24 07:06 Shuntw6096

prompt = "How to make sandwich ?" streamer = TextStreamer(tokenizer,skip_prompt=True) This is my code, I want to stop

pipe = pipeline(task="text-generation", model=model, tokenizer=tokenizer,max_length=512,
    min_length = 30,
    temperature=0.6,
    pad_token_id=tokenizer.eos_token_id,
    top_p=0.95,
    encoder_repetition_penalty = 0.3,
    num_return_sequences=1,
    repetition_penalty=1.2,
    length_penalty= 0.5,

    streamer=streamer)
result = pipe(f"<s>[INST] {prompt} [/INST]")

the output stops at instantly without completing the full sentence, I want it as minimum response, Is there any parameter I'm missing, for example:

Spread soft bread with mayonnaise or mustard, add your favorite meat and cheese, and enjoy! 2. What is the difference between It stops like this

I new to this.

gbs-ai avatar Jul 12 '24 12:07 gbs-ai

langchain+gradio chatbot, streaming output

        streamer = TextIteratorStreamer(tokenizer, timeout=10., skip_prompt=True, skip_special_tokens=True)
        pipe = pipeline(
            "text-generation",
            model=base_model,
            tokenizer=tokenizer,
            max_length=2048,
            temperature=0.6,
            pad_token_id=tokenizer.eos_token_id,
            top_p=0.95,
            repetition_penalty=1.2,
            streamer=streamer
        )
        local_llm = HuggingFacePipeline(pipeline=pipe)
        enhanced_rqa = RetrievalQA.from_chain_type(llm=local_llm, chain_type="stuff", retriever=product_retriever)
        from threading import Thread
        def run_enhanced_rqa(message):
            enhanced_rqa.run(message)

        t = Thread(target=run_enhanced_rqa, args=(input_message,))
        t.start()

        history[-1][1] = ""
        for new_text in streamer:
            history[-1][1]  += new_text
            time.sleep(0.05)
            yield history

How to initialise tokenizer with chat_template here?

ShreyGanatra avatar Aug 02 '24 11:08 ShreyGanatra