langchain
langchain copied to clipboard
streaming support for LLM, from huggingface
from the notebook It says: LangChain provides streaming support for LLMs. Currently, we support streaming for the OpenAI, ChatOpenAI. and Anthropic implementations, but streaming support for other LLM implementations is on the roadmap.
I am more interested in using the commercially open-source LLM available on Hugging Face, such as Dolly V2. I am wondering whether LangChain has plans to include streaming support for Hugging Face's LLM in their roadmap. Additionally, is there any timeline for its integration? Thank you.
It seems to just work out of the box if you put a streamer in your pipeline:
streamer = TextStreamer(tokenizer)
pipe = pipeline(model=model,
tokenizer=tokenizer,
streamer=streamer}
llm = HuggingFacePipeline(pipeline=pipe)
@jloganolson thank you so much Logan!
I just learnt TextStreamer from you today. I did some research and found it was released two weeks ago by huggingface in the transformers package:, released two weeks ago by huggingface in the transformers package: https://huggingface.co/docs/transformers/internal/generation_utils#transformers.TextStreamer, https://github.com/huggingface/transformers/blob/main/src/transformers/generation/streamers.py
from transformers import AutoModelForCausalLM, AutoTokenizer, TextStreamer, pipeline
streamer = TextStreamer(tokenizer, skip_prompt=True)
pipe = pipeline(
"text-generation",
model=model_fintuned,
tokenizer=tokenizer,
max_length=2048,
temperature=0.6,
pad_token_id=tokenizer.eos_token_id,
top_p=0.95,
repetition_penalty=1.2,
device=device,
streamer=streamer
)
pipe(prompts[0])
inputs = tokenizer(prompts[0], return_tensors="pt").to(device)
streamer = TextStreamer(tokenizer, skip_prompt=True)
_ = model_fintuned.generate(**inputs, streamer=streamer, pad_token_id=tokenizer.eos_token_id, max_length=248, temperature=0.8, top_p=0.8,
repetition_penalty=1.25)
related issues: https://github.com/databrickslabs/dolly/issues/84
close this issue, since it is solved thanks to @jloganolson
langchain+gradio chatbot, streaming output
streamer = TextIteratorStreamer(tokenizer, timeout=10., skip_prompt=True, skip_special_tokens=True)
pipe = pipeline(
"text-generation",
model=base_model,
tokenizer=tokenizer,
max_length=2048,
temperature=0.6,
pad_token_id=tokenizer.eos_token_id,
top_p=0.95,
repetition_penalty=1.2,
streamer=streamer
)
local_llm = HuggingFacePipeline(pipeline=pipe)
enhanced_rqa = RetrievalQA.from_chain_type(llm=local_llm, chain_type="stuff", retriever=product_retriever)
from threading import Thread
def run_enhanced_rqa(message):
enhanced_rqa.run(message)
t = Thread(target=run_enhanced_rqa, args=(input_message,))
t.start()
history[-1][1] = ""
for new_text in streamer:
history[-1][1] += new_text
time.sleep(0.05)
yield history
I am creating an indexer, for that I want to use CustomLLM. How can I use this streaming method in this type of object. Note: I can't use HuggingFacePipeline or any similar framework. My work is limited to CustomLLM.
langchain+gradio chatbot, streaming output
streamer = TextIteratorStreamer(tokenizer, timeout=10., skip_prompt=True, skip_special_tokens=True) pipe = pipeline( "text-generation", model=base_model, tokenizer=tokenizer, max_length=2048, temperature=0.6, pad_token_id=tokenizer.eos_token_id, top_p=0.95, repetition_penalty=1.2, streamer=streamer ) local_llm = HuggingFacePipeline(pipeline=pipe) enhanced_rqa = RetrievalQA.from_chain_type(llm=local_llm, chain_type="stuff", retriever=product_retriever) from threading import Thread def run_enhanced_rqa(message): enhanced_rqa.run(message) t = Thread(target=run_enhanced_rqa, args=(input_message,)) t.start() history[-1][1] = "" for new_text in streamer: history[-1][1] += new_text time.sleep(0.05) yield history
This is not working for me . Getting thread empty error. Could you pls share the complete gradio code
I use Llma 2
Use a pipeline for later
from transformers import pipeline, TextStreamer
streamer = TextStreamer(tokenizer, skip_prompt=True)
pipe = pipeline("text-generation", model=model, tokenizer= tokenizer, torch_dtype=torch.bfloat16, device_map="auto", max_new_tokens = 512, do_sample=True, top_k=10, num_return_sequences=1, streamer=streamer, eos_token_id=tokenizer.eos_token_id )
It is working for me!
it streams to stdout not as generator variable
Added TextStreamer
for HuggingFacePipeline
, but doesn't seem to change anything to issue
- https://github.com/langchain-ai/langserve/issues/218
Any new updates on this?
@NajiAboo same, have you solved it?
i'm getting _queue.Empty
error
if the response has im_start or im_end and you are annoyed with, use skip_special_tokens as keyword arguments in TextStreamer
streamer = TextStreamer(tokenizer, skip_prompt=True, skip_special_tokens=True)
prompt = "How to make sandwich ?" streamer = TextStreamer(tokenizer,skip_prompt=True) This is my code, I want to stop
pipe = pipeline(task="text-generation", model=model, tokenizer=tokenizer,max_length=512,
min_length = 30,
temperature=0.6,
pad_token_id=tokenizer.eos_token_id,
top_p=0.95,
encoder_repetition_penalty = 0.3,
num_return_sequences=1,
repetition_penalty=1.2,
length_penalty= 0.5,
streamer=streamer)
result = pipe(f"<s>[INST] {prompt} [/INST]")
the output stops at instantly without completing the full sentence, I want it as minimum response, Is there any parameter I'm missing, for example:
Spread soft bread with mayonnaise or mustard, add your favorite meat and cheese, and enjoy! 2. What is the difference between It stops like this
I new to this.
langchain+gradio chatbot, streaming output
streamer = TextIteratorStreamer(tokenizer, timeout=10., skip_prompt=True, skip_special_tokens=True) pipe = pipeline( "text-generation", model=base_model, tokenizer=tokenizer, max_length=2048, temperature=0.6, pad_token_id=tokenizer.eos_token_id, top_p=0.95, repetition_penalty=1.2, streamer=streamer ) local_llm = HuggingFacePipeline(pipeline=pipe) enhanced_rqa = RetrievalQA.from_chain_type(llm=local_llm, chain_type="stuff", retriever=product_retriever) from threading import Thread def run_enhanced_rqa(message): enhanced_rqa.run(message) t = Thread(target=run_enhanced_rqa, args=(input_message,)) t.start() history[-1][1] = "" for new_text in streamer: history[-1][1] += new_text time.sleep(0.05) yield history
How to initialise tokenizer with chat_template here?